Gemma 4 E4B Instruct — W4A16 Quantized AutoRound

This repository hosts W4A16 INT4-quantized versions of google/gemma-4-E4B-it, a multimodal mixture-of-experts model supporting text, vision, and audio inputs. Two quantized variants are available:

Variant	Method	Repo
AutoRound (RTN)	intel/auto-round	`Vishva007/gemma-4-E4B-it-W4A16-AutoRound`
GPTQ	AutoGPTQ	`Vishva007/gemma-4-E4B-it-W4A16-AutoRound-GPTQ`

Note: Only the language model (LM) layers are quantized to INT4. The vision tower, audio tower, and multimodal projectors are kept at full precision (BF16) to preserve multimodal quality.

Quantization Details

Parameter	Value
Base model	`google/gemma-4-E4B-it`
Quantization scheme	W4A16 (INT4 weights, BF16 activations)
Group size	128
Symmetric	Yes
Calibration samples	256
Sequence length	2048
Non-LM modules	Kept at FP32 (vision, audio, projectors)
Quantized layers	All LM linear layers (`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`, `per_layer_input_gate`, `per_layer_projection`)
AutoRound mode	RTN (`iters=0`) — required for Gemma 4 compatibility
Hardware used	NVIDIA A100 80GB PCIe
Framework	PyTorch 2.10.0 + CUDA 12.8

Model Architecture

Gemma 4 E4B is a multimodal MoE model (Gemma4ForConditionalGeneration) with:

Text backbone: 42-layer Gemma4TextModel with 2560 hidden dim, mixed local/global attention
Vision tower: 16-layer Gemma4VisionModel (768-dim, unquantized)
Audio tower: 12-layer Gemma4AudioModel with conformer-style layers (unquantized)
Vocabulary size: 262,144 tokens

Usage

vLLM Inference

The recommended way to serve this model is via the official vllm/vllm-openai:gemma4 Docker image, which ships vLLM v0.19.1 with the latest Transformers patches required for Gemma 4.

Serve with Docker (recommended)

docker run --gpus all --rm -p 8000:8000 \
  vllm/vllm-openai:gemma4 \
  vllm serve Vishva007/gemma-4-E4B-it-W4A16-AutoRound \
    --served-model-name Gemma-4-E4B-it \
    --quantization autoround \
    --kv-cache-dtype auto \
    --max-num-batched-tokens 16384 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --dtype bfloat16 \
    --max-model-len 18432 \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --port 8000 \
    --default-chat-template-kwargs '{"enable_thinking": false}' \
    --mm-processor-kwargs '{"max_soft_tokens": 560}'

Direct `vllm serve` (vLLM ≥ 0.19.0)

vllm serve Vishva007/gemma-4-E4B-it-W4A16-AutoRound \
  --served-model-name Gemma-4-E4B-it \
  --quantization autoround \
  --kv-cache-dtype auto \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --dtype bfloat16 \
  --max-model-len 18432 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --port 8000 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --mm-processor-kwargs '{"max_soft_tokens": 560}'

`max_soft_tokens` — Image Token Budget

The max_soft_tokens parameter controls how many visual tokens are allocated per image. Higher values give richer image representations at the cost of context length and throughput.

`max_soft_tokens`	Detail level	Recommended use
`70`	Minimal	Fast throughput, simple images
`140`	Low	Charts, diagrams
`280`	Medium (default)	General-purpose
`560`	High	Dense scenes, documents
`1120`	Maximum	Fine-grained visual detail

Pass it via --mm-processor-kwargs '{"max_soft_tokens": <value>}'.

OpenAI-compatible API call

Once the server is running, query it like any OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Gemma-4-E4B-it",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe what you see."},
                {"type": "image_url", "image_url": {"url": "https://..."}},
            ],
        }
    ],
    max_tokens=512,
)
print(response.choices.message.content)

The quantized model was then exported in both AutoRound and GPTQ formats and pushed to Hugging Face Hub.

Limitations & Notes

RTN mode (iters=0) is used instead of full AutoRound optimization due to Gemma 4's architecture constraints.
Some layers with shapes not divisible by 32 are skipped during quantization (minor precision impact).
Multimodal (vision/audio) capabilities are fully preserved as those towers are not quantized.

Acknowledgements

Big thanks to OLAF-OSS and the contributors of gemma4-vllm — a fantastic resource for running Gemma 4 with vLLM. Their detailed QUANTIZE.md guide was instrumental in figuring out the correct quantization setup for gemma-4-E4B-it, and this work is directly inspired by ciocan/gemma-4-E4B-it-W4A16.

If you're looking to quantize or serve Gemma 4 yourself, their repo is the best starting point. 🙏

The full quantization process used to produce these models is documented here: 📓 auto_round_Gemma4-E4B.ipynb

License

This quantized model is derived from google/gemma-4-E4B-it and is subject to the Gemma Terms of Use.

Citation

If you use this quantized model, please also cite the original Gemma 4 work:

@misc{gemma4_2026,
  title  = {Gemma 4},
  author = {Google DeepMind},
  year   = {2026},
  url    = {https://huggingface.co/google/gemma-4-E4B-it}
}

Quantized by Vishva007

Downloads last month: 268

Safetensors

Model size

4B params

Tensor type

I32

BF16

F16

Model tree for Vishva007/gemma-4-E4B-it-W4A16-AutoRound

Base model

google/gemma-4-E4B-it

Quantized

(92)

this model

Collection including Vishva007/gemma-4-E4B-it-W4A16-AutoRound

Gemma 4 Collection

Collection

Quantized Gemma 4 models for efficient image-text understanding (AutoRound W4A16). • 4 items • Updated 3 days ago