littlebird13 commited on
Commit
c99c7ae
·
verified ·
1 Parent(s): 61c2ee1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -4
README.md CHANGED
@@ -12,7 +12,7 @@ pipeline_tag: text-generation
12
 
13
  ## Highlights
14
 
15
- We introduce the updated version of the **Qwen3-30B-A3B non-thinking mode**, named **Qwen3-30B-A3B-Instruct-2507-FP8**, featuring the following key enhancements:
16
 
17
  - **Significant improvements** in general capabilities, including **instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage**.
18
  - **Substantial gains** in long-tail knowledge coverage across **multiple languages**.
@@ -21,7 +21,7 @@ We introduce the updated version of the **Qwen3-30B-A3B non-thinking mode**, nam
21
 
22
  ## Model Overview
23
 
24
- **Qwen3-30B-A3B-Instruct-2507-FP8** has the following features:
25
  - Type: Causal Language Models
26
  - Training Stage: Pretraining & Post-training
27
  - Number of Parameters: 30.5B in total and 3.3B activated
@@ -87,17 +87,23 @@ print("content:", content)
87
  For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
88
  - SGLang:
89
  ```shell
90
- python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 --tp 8 --context-length 262144
91
  ```
92
  - vLLM:
93
  ```shell
94
- vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --tensor-parallel-size 8 --max-model-len 262144
95
  ```
96
 
97
  **Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`.**
98
 
99
  For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
100
 
 
 
 
 
 
 
101
  ## Agentic Use
102
 
103
  Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
 
12
 
13
  ## Highlights
14
 
15
+ We introduce the updated version of the **Qwen3-30B-A3B-FP8 non-thinking mode**, named **Qwen3-30B-A3B-Instruct-2507-FP8**, featuring the following key enhancements:
16
 
17
  - **Significant improvements** in general capabilities, including **instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage**.
18
  - **Substantial gains** in long-tail knowledge coverage across **multiple languages**.
 
21
 
22
  ## Model Overview
23
 
24
+ This repo contains the FP8 version of **Qwen3-30B-A3B-Instruct-2507**, which has the following features:
25
  - Type: Causal Language Models
26
  - Training Stage: Pretraining & Post-training
27
  - Number of Parameters: 30.5B in total and 3.3B activated
 
87
  For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
88
  - SGLang:
89
  ```shell
90
+ python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --tp 8 --context-length 262144
91
  ```
92
  - vLLM:
93
  ```shell
94
+ vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --tensor-parallel-size 8 --max-model-len 262144
95
  ```
96
 
97
  **Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`.**
98
 
99
  For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
100
 
101
+ ## Note on FP8
102
+
103
+ For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
104
+
105
+ You can use the Qwen3-30B-A3B-Instruct-2507-FP8 model with serveral inference frameworks, including `transformers`, `sglang`, and `vllm`, as the original bfloat16 model.
106
+
107
  ## Agentic Use
108
 
109
  Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.