Gemma 4 E4B Instruct — W4A16 Quantized AutoRound
This repository hosts W4A16 INT4-quantized versions of google/gemma-4-E4B-it, a multimodal mixture-of-experts model supporting text, vision, and audio inputs. Two quantized variants are available:
| Variant | Method | Repo |
|---|---|---|
| AutoRound (RTN) | intel/auto-round | Vishva007/gemma-4-E4B-it-W4A16-AutoRound |
| GPTQ | AutoGPTQ | Vishva007/gemma-4-E4B-it-W4A16-AutoRound-GPTQ |
Note: Only the language model (LM) layers are quantized to INT4. The vision tower, audio tower, and multimodal projectors are kept at full precision (BF16) to preserve multimodal quality.
Quantization Details
| Parameter | Value |
|---|---|
| Base model | google/gemma-4-E4B-it |
| Quantization scheme | W4A16 (INT4 weights, BF16 activations) |
| Group size | 128 |
| Symmetric | Yes |
| Calibration samples | 256 |
| Sequence length | 2048 |
| Non-LM modules | Kept at FP32 (vision, audio, projectors) |
| Quantized layers | All LM linear layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, per_layer_input_gate, per_layer_projection) |
| AutoRound mode | RTN (iters=0) — required for Gemma 4 compatibility |
| Hardware used | NVIDIA A100 80GB PCIe |
| Framework | PyTorch 2.10.0 + CUDA 12.8 |
Model Architecture
Gemma 4 E4B is a multimodal MoE model (Gemma4ForConditionalGeneration) with:
- Text backbone: 42-layer
Gemma4TextModelwith 2560 hidden dim, mixed local/global attention - Vision tower: 16-layer
Gemma4VisionModel(768-dim, unquantized) - Audio tower: 12-layer
Gemma4AudioModelwith conformer-style layers (unquantized) - Vocabulary size: 262,144 tokens
Usage
vLLM Inference
The recommended way to serve this model is via the official vllm/vllm-openai:gemma4 Docker image, which ships vLLM v0.19.1 with the latest Transformers patches required for Gemma 4.
Serve with Docker (recommended)
docker run --gpus all --rm -p 8000:8000 \
vllm/vllm-openai:gemma4 \
vllm serve Vishva007/gemma-4-E4B-it-W4A16-AutoRound \
--served-model-name Gemma-4-E4B-it \
--quantization autoround \
--kv-cache-dtype auto \
--max-num-batched-tokens 16384 \
--enable-chunked-prefill \
--enable-prefix-caching \
--dtype bfloat16 \
--max-model-len 18432 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--port 8000 \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--mm-processor-kwargs '{"max_soft_tokens": 560}'
Direct vllm serve (vLLM ≥ 0.19.0)
vllm serve Vishva007/gemma-4-E4B-it-W4A16-AutoRound \
--served-model-name Gemma-4-E4B-it \
--quantization autoround \
--kv-cache-dtype auto \
--max-num-batched-tokens 16384 \
--enable-chunked-prefill \
--enable-prefix-caching \
--dtype bfloat16 \
--max-model-len 18432 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--port 8000 \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--mm-processor-kwargs '{"max_soft_tokens": 560}'
max_soft_tokens — Image Token Budget
The max_soft_tokens parameter controls how many visual tokens are allocated per image. Higher values give richer image representations at the cost of context length and throughput.
max_soft_tokens |
Detail level | Recommended use |
|---|---|---|
70 |
Minimal | Fast throughput, simple images |
140 |
Low | Charts, diagrams |
280 |
Medium (default) | General-purpose |
560 |
High | Dense scenes, documents |
1120 |
Maximum | Fine-grained visual detail |
Pass it via --mm-processor-kwargs '{"max_soft_tokens": <value>}'.
OpenAI-compatible API call
Once the server is running, query it like any OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Gemma-4-E4B-it",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe what you see."},
{"type": "image_url", "image_url": {"url": "https://..."}},
],
}
],
max_tokens=512,
)
print(response.choices.message.content)
The quantized model was then exported in both AutoRound and GPTQ formats and pushed to Hugging Face Hub.
Limitations & Notes
- RTN mode (
iters=0) is used instead of full AutoRound optimization due to Gemma 4's architecture constraints. - Some layers with shapes not divisible by 32 are skipped during quantization (minor precision impact).
- Multimodal (vision/audio) capabilities are fully preserved as those towers are not quantized.
Acknowledgements
Big thanks to OLAF-OSS and the contributors of gemma4-vllm — a fantastic resource for running Gemma 4 with vLLM. Their detailed QUANTIZE.md guide was instrumental in figuring out the correct quantization setup for gemma-4-E4B-it, and this work is directly inspired by ciocan/gemma-4-E4B-it-W4A16.
If you're looking to quantize or serve Gemma 4 yourself, their repo is the best starting point. 🙏
The full quantization process used to produce these models is documented here: 📓 auto_round_Gemma4-E4B.ipynb
License
This quantized model is derived from google/gemma-4-E4B-it and is subject to the Gemma Terms of Use.
Citation
If you use this quantized model, please also cite the original Gemma 4 work:
@misc{gemma4_2026,
title = {Gemma 4},
author = {Google DeepMind},
year = {2026},
url = {https://huggingface.co/google/gemma-4-E4B-it}
}
Quantized by Vishva007
- Downloads last month
- 268
Model tree for Vishva007/gemma-4-E4B-it-W4A16-AutoRound
Base model
google/gemma-4-E4B-it