Model Overview
- Model Architecture: Qwen3_5MoeForConditionalGeneration
- Input: Text
- Output: Text
- Supported Hardware Microarchitecture: AMD MI300 MI350/MI355
- ROCm: 7.0
- PyTorch: 2.8.0
- Transformers: 5.2.0
- Operating System(s): Linux
- Inference Engine: SGLang/vLLM
- Model Optimizer: AMD-Quark (v0.11.1)
- Weight quantization: OCP MXFP4, Static
- Activation quantization: OCP MXFP4, Dynamic
Model Quantization
The model was quantized from Qwen/Qwen3.5-35B-A3B-FP8 using AMD-Quark. The weights are quantized to MXFP4 and activations are quantized to MXFP4.
Quantization scripts:
cd Quark/examples/torch/language_modeling/llm_ptq/
export exclude_layers="lm_head model.visual.* mtp.* *mlp.gate *shared_expert_gate* *.linear_attn.* *.self_attn.* *.shared_expert.*"
python3 quantize_quark.py --model_dir Qwen/Qwen3.5-35B-A3B-FP8 \
--quant_scheme mxfp4 \
--file2file_quantization \
--exclude_layers $exclude_layers \
--output_dir amd/Qwen3.5-35B-A3B-MXFP4
For further details or issues, please refer to the AMD-Quark documentation or contact the respective developers.
Evaluation
The model was evaluated on gsm8k benchmarks using the vllm framework.
Accuracy
| Benchmark | Qwen/Qwen3.5-35B-A3B | amd/Qwen3.5-35B-A3B-MXFP4(this model) | Recovery |
| gsm8k (flexible-extract) | 90.52 | 89.23 | 98.57% |
Reproduction
The GSM8K results were obtained using the vLLM framework, based on the Docker image rocm/vllm-dev:nightly, and vLLM is installed inside the container with fixes applied for model support.
Evaluating model in a new terminal
lm_eval \
--model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=1,max_model_len=262144,gpu_memory_utilization=0.90,max_gen_toks=2048,trust_remote_code=True,reasoning_parser=qwen3 \
--tasks gsm8k --num_fewshot 5 \
--batch_size auto
License
Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.
- Downloads last month
- 72