Gemma 4 E4B Instruct — W4A16 Quantized AutoRound

This repository hosts W4A16 INT4-quantized versions of google/gemma-4-E4B-it, a multimodal mixture-of-experts model supporting text, vision, and audio inputs. Two quantized variants are available:

Variant Method Repo
AutoRound (RTN) intel/auto-round Vishva007/gemma-4-E4B-it-W4A16-AutoRound
GPTQ AutoGPTQ Vishva007/gemma-4-E4B-it-W4A16-AutoRound-GPTQ

Note: Only the language model (LM) layers are quantized to INT4. The vision tower, audio tower, and multimodal projectors are kept at full precision (BF16) to preserve multimodal quality.


Quantization Details

Parameter Value
Base model google/gemma-4-E4B-it
Quantization scheme W4A16 (INT4 weights, BF16 activations)
Group size 128
Symmetric Yes
Calibration samples 256
Sequence length 2048
Non-LM modules Kept at FP32 (vision, audio, projectors)
Quantized layers All LM linear layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, per_layer_input_gate, per_layer_projection)
AutoRound mode RTN (iters=0) — required for Gemma 4 compatibility
Hardware used NVIDIA A100 80GB PCIe
Framework PyTorch 2.10.0 + CUDA 12.8

Model Architecture

Gemma 4 E4B is a multimodal MoE model (Gemma4ForConditionalGeneration) with:

  • Text backbone: 42-layer Gemma4TextModel with 2560 hidden dim, mixed local/global attention
  • Vision tower: 16-layer Gemma4VisionModel (768-dim, unquantized)
  • Audio tower: 12-layer Gemma4AudioModel with conformer-style layers (unquantized)
  • Vocabulary size: 262,144 tokens

Usage

vLLM Inference

The recommended way to serve this model is via the official vllm/vllm-openai:gemma4 Docker image, which ships vLLM v0.19.1 with the latest Transformers patches required for Gemma 4.

Serve with Docker (recommended)

docker run --gpus all --rm -p 8000:8000 \
  vllm/vllm-openai:gemma4 \
  vllm serve Vishva007/gemma-4-E4B-it-W4A16-AutoRound \
    --served-model-name Gemma-4-E4B-it \
    --quantization autoround \
    --kv-cache-dtype auto \
    --max-num-batched-tokens 16384 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --dtype bfloat16 \
    --max-model-len 18432 \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --port 8000 \
    --default-chat-template-kwargs '{"enable_thinking": false}' \
    --mm-processor-kwargs '{"max_soft_tokens": 560}'

Direct vllm serve (vLLM ≥ 0.19.0)

vllm serve Vishva007/gemma-4-E4B-it-W4A16-AutoRound \
  --served-model-name Gemma-4-E4B-it \
  --quantization autoround \
  --kv-cache-dtype auto \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --dtype bfloat16 \
  --max-model-len 18432 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --port 8000 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --mm-processor-kwargs '{"max_soft_tokens": 560}'

max_soft_tokens — Image Token Budget

The max_soft_tokens parameter controls how many visual tokens are allocated per image. Higher values give richer image representations at the cost of context length and throughput.

max_soft_tokens Detail level Recommended use
70 Minimal Fast throughput, simple images
140 Low Charts, diagrams
280 Medium (default) General-purpose
560 High Dense scenes, documents
1120 Maximum Fine-grained visual detail

Pass it via --mm-processor-kwargs '{"max_soft_tokens": <value>}'.

OpenAI-compatible API call

Once the server is running, query it like any OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Gemma-4-E4B-it",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe what you see."},
                {"type": "image_url", "image_url": {"url": "https://..."}},
            ],
        }
    ],
    max_tokens=512,
)
print(response.choices.message.content)

The quantized model was then exported in both AutoRound and GPTQ formats and pushed to Hugging Face Hub.


Limitations & Notes

  • RTN mode (iters=0) is used instead of full AutoRound optimization due to Gemma 4's architecture constraints.
  • Some layers with shapes not divisible by 32 are skipped during quantization (minor precision impact).
  • Multimodal (vision/audio) capabilities are fully preserved as those towers are not quantized.

Acknowledgements

Big thanks to OLAF-OSS and the contributors of gemma4-vllm — a fantastic resource for running Gemma 4 with vLLM. Their detailed QUANTIZE.md guide was instrumental in figuring out the correct quantization setup for gemma-4-E4B-it, and this work is directly inspired by ciocan/gemma-4-E4B-it-W4A16.

If you're looking to quantize or serve Gemma 4 yourself, their repo is the best starting point. 🙏

The full quantization process used to produce these models is documented here: 📓 auto_round_Gemma4-E4B.ipynb

License

This quantized model is derived from google/gemma-4-E4B-it and is subject to the Gemma Terms of Use.


Citation

If you use this quantized model, please also cite the original Gemma 4 work:

@misc{gemma4_2026,
  title  = {Gemma 4},
  author = {Google DeepMind},
  year   = {2026},
  url    = {https://huggingface.co/google/gemma-4-E4B-it}
}

Quantized by Vishva007

Downloads last month
268
Safetensors
Model size
4B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vishva007/gemma-4-E4B-it-W4A16-AutoRound

Quantized
(92)
this model

Collection including Vishva007/gemma-4-E4B-it-W4A16-AutoRound