Mixtral-8x7B-Instruct-v0.1-NVFP4

NVIDIA NVFP4 quantized version of Mixtral 8x7B-Instruct for Blackwell architecture GPUs.

Model Description

This is a 4-bit floating-point (NVFP4) quantized version of mistralai/Mixtral-8x7B-Instruct-v0.1, created using NVIDIA TensorRT Model Optimizer (modelopt).

Metric	Value
Original Size	86.99 GB
Quantized Size	24.82 GB
Compression Ratio	3.50x
Size Reduction	71.5%
Quantization Method	NVFP4 (calibration-based)
Calibration Samples	512 (C4 dataset)
Quantization Time	~5.6 hours

What is NVFP4?

NVFP4 is NVIDIA's native 4-bit floating-point format, introduced with the Blackwell architecture. Unlike integer quantization (INT4), NVFP4 uses a micro-exponent floating-point format:

┌───────┬───────────┬──────────┐
│ Sign  │ Exponent  │ Mantissa │
│ 1 bit │  2 bits   │  1 bit   │
└───────┴───────────┴──────────┘

This provides better dynamic range for neural network weights compared to uniform integer quantization.

Hardware Requirements

⚠️ This model requires Blackwell architecture GPUs (GB10, GB100, GB200) and TensorRT-LLM for inference.

Standard HuggingFace transformers cannot load this model directly due to the packed FP4 weight format.

Current Compatibility Status (December 2025)

Framework	Status
TensorRT-LLM	Partial support (GB10 not fully supported in v1.0.0)
vLLM	Not yet supported
Transformers	❌ Cannot load packed FP4 weights

This model is ready for when TensorRT-LLM and vLLM add full Blackwell support.

Intended Use

Once framework support is available:

# Build TensorRT engine
trtllm-build --checkpoint_dir ./Mixtral-8x7B-Instruct-v0.1-NVFP4 \
             --output_dir ./engine \
             --gemm_plugin nvfp4

# Serve with TensorRT-LLM
python -m tensorrt_llm.commands.serve --model_dir ./engine

Quantization Details

Weight Format

The model uses packed FP4 weights with block-wise scaling:

weight: uint8 (packed FP4, half original dimension)
weight_scale: float8_e4m3fn (per-block scales, group_size=16)
weight_scale_2: float32 (global scale)
input_scale: float32 (activation scale)

Layers Excluded from Quantization

lm_head (output layer)
All block_sparse_moe.gate layers (router networks)

Calibration

Quantization was performed with 512 calibration samples from the C4 dataset, running forward passes to collect weight and activation statistics for optimal scale factor determination.

Baseline Performance (BF16)

For reference, the original BF16 model on DGX Spark (GB10):

Metric	Value
Tokens/Second	4.05 tok/s
Latency (100 tokens)	24.74 s
Perplexity (WikiText-2)	3.70

NVFP4 inference benchmarks pending framework support.

Training/Quantization Environment

Hardware: NVIDIA DGX Spark (GB10 Blackwell GPU)
Memory: 128GB Unified Memory
Software:
- Python 3.12
- PyTorch 2.11
- NVIDIA TensorRT Model Optimizer (modelopt) 0.40.0
- Transformers 4.57.3

Citation

If you use this model, please cite:

@misc{mixtral-nvfp4-2025,
  title={Mixtral-8x7B-Instruct-v0.1-NVFP4},
  author={Joseph Dowling},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/[username]/Mixtral-8x7B-Instruct-v0.1-NVFP4}}
}

License

This model inherits the Apache 2.0 license from the base Mixtral model.

Acknowledgments

Mistral AI for the base Mixtral 8x7B model
NVIDIA for TensorRT Model Optimizer and NVFP4 format

Downloads last month: 4

Safetensors

Model size

23B params

Tensor type

BF16

F8_E4M3

Model tree for josephdowling10/Mixtral-8x7B-Instruct-v0.1-NVFP4

Base model

mistralai/Mixtral-8x7B-v0.1

Finetuned

mistralai/Mixtral-8x7B-Instruct-v0.1

Quantized

(43)

this model