Update README.md
Browse files
README.md
CHANGED
|
@@ -12,7 +12,7 @@ pipeline_tag: text-generation
|
|
| 12 |
|
| 13 |
## Highlights
|
| 14 |
|
| 15 |
-
We introduce the updated version of the **Qwen3-30B-A3B non-thinking mode**, named **Qwen3-30B-A3B-Instruct-2507-FP8**, featuring the following key enhancements:
|
| 16 |
|
| 17 |
- **Significant improvements** in general capabilities, including **instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage**.
|
| 18 |
- **Substantial gains** in long-tail knowledge coverage across **multiple languages**.
|
|
@@ -21,7 +21,7 @@ We introduce the updated version of the **Qwen3-30B-A3B non-thinking mode**, nam
|
|
| 21 |
|
| 22 |
## Model Overview
|
| 23 |
|
| 24 |
-
**Qwen3-30B-A3B-Instruct-2507
|
| 25 |
- Type: Causal Language Models
|
| 26 |
- Training Stage: Pretraining & Post-training
|
| 27 |
- Number of Parameters: 30.5B in total and 3.3B activated
|
|
@@ -87,17 +87,23 @@ print("content:", content)
|
|
| 87 |
For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
|
| 88 |
- SGLang:
|
| 89 |
```shell
|
| 90 |
-
python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 --tp 8 --context-length 262144
|
| 91 |
```
|
| 92 |
- vLLM:
|
| 93 |
```shell
|
| 94 |
-
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --tensor-parallel-size 8 --max-model-len 262144
|
| 95 |
```
|
| 96 |
|
| 97 |
**Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`.**
|
| 98 |
|
| 99 |
For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
## Agentic Use
|
| 102 |
|
| 103 |
Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
|
|
|
|
| 12 |
|
| 13 |
## Highlights
|
| 14 |
|
| 15 |
+
We introduce the updated version of the **Qwen3-30B-A3B-FP8 non-thinking mode**, named **Qwen3-30B-A3B-Instruct-2507-FP8**, featuring the following key enhancements:
|
| 16 |
|
| 17 |
- **Significant improvements** in general capabilities, including **instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage**.
|
| 18 |
- **Substantial gains** in long-tail knowledge coverage across **multiple languages**.
|
|
|
|
| 21 |
|
| 22 |
## Model Overview
|
| 23 |
|
| 24 |
+
This repo contains the FP8 version of **Qwen3-30B-A3B-Instruct-2507**, which has the following features:
|
| 25 |
- Type: Causal Language Models
|
| 26 |
- Training Stage: Pretraining & Post-training
|
| 27 |
- Number of Parameters: 30.5B in total and 3.3B activated
|
|
|
|
| 87 |
For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint:
|
| 88 |
- SGLang:
|
| 89 |
```shell
|
| 90 |
+
python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --tp 8 --context-length 262144
|
| 91 |
```
|
| 92 |
- vLLM:
|
| 93 |
```shell
|
| 94 |
+
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 --tensor-parallel-size 8 --max-model-len 262144
|
| 95 |
```
|
| 96 |
|
| 97 |
**Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`.**
|
| 98 |
|
| 99 |
For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
|
| 100 |
|
| 101 |
+
## Note on FP8
|
| 102 |
+
|
| 103 |
+
For convenience and performance, we have provided `fp8`-quantized model checkpoint for Qwen3, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.
|
| 104 |
+
|
| 105 |
+
You can use the Qwen3-30B-A3B-Instruct-2507-FP8 model with serveral inference frameworks, including `transformers`, `sglang`, and `vllm`, as the original bfloat16 model.
|
| 106 |
+
|
| 107 |
## Agentic Use
|
| 108 |
|
| 109 |
Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
|