Qwen
/

Qwen3-30B-A3B-Instruct-2507-FP8

@@ -12,7 +12,7 @@ pipeline_tag: text-generation
 ## Highlights
-We introduce the updated version of the **Qwen3-30B-A3B non-thinking mode**, named **Qwen3-30B-A3B-Instruct-2507-FP8**, featuring the following key enhancements:
 - **Significant improvements** in general capabilities, including **instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage**.
 - **Substantial gains** in long-tail knowledge coverage across **multiple languages**.
@@ -21,7 +21,7 @@ We introduce the updated version of the **Qwen3-30B-A3B non-thinking mode**, nam
 ## Model Overview
-**Qwen3-30B-A3B-Instruct-2507-FP8** has the following features:
 - Type: Causal Language Models
 - Training Stage: Pretraining & Post-training
 - Number of Parameters: 30.5B in total and 3.3B activated
@@ -87,17 +87,23 @@ print("content:", content)
 For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
 - SGLang:
     ```shell
-    python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 --tp 8 --context-length 262144
     ```
 - vLLM:
     ```shell
-    vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --tensor-parallel-size 8 --max-model-len 262144
     ```
 **Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`.**
 For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
 ## Agentic Use
 Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.

 ## Highlights
+We introduce the updated version of the **Qwen3-30B-A3B-FP8 non-thinking mode**, named **Qwen3-30B-A3B-Instruct-2507-FP8**, featuring the following key enhancements:
 - **Significant improvements** in general capabilities, including **instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage**.
 - **Substantial gains** in long-tail knowledge coverage across **multiple languages**.
 ## Model Overview
+This repo contains the FP8 version of **Qwen3-30B-A3B-Instruct-2507**, which has the following features:
 - Type: Causal Language Models
 - Training Stage: Pretraining & Post-training
 - Number of Parameters: 30.5B in total and 3.3B activated
 For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
 - SGLang:
     ```shell
+    python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --tp 8 --context-length 262144
     ```
 - vLLM:
     ```shell
+    vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --tensor-parallel-size 8 --max-model-len 262144
     ```
 **Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`.**
 For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
+## Note on FP8
+For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
+You can use the Qwen3-30B-A3B-Instruct-2507-FP8 model with serveral inference frameworks, including `transformers`, `sglang`, and `vllm`, as the original bfloat16 model.
 ## Agentic Use
 Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.