Mixtral-8x7B-Instruct-v0.1-NVFP4

NVIDIA NVFP4 quantized version of Mixtral 8x7B-Instruct for Blackwell architecture GPUs.

Model Description

This is a 4-bit floating-point (NVFP4) quantized version of mistralai/Mixtral-8x7B-Instruct-v0.1, created using NVIDIA TensorRT Model Optimizer (modelopt).

Metric Value
Original Size 86.99 GB
Quantized Size 24.82 GB
Compression Ratio 3.50x
Size Reduction 71.5%
Quantization Method NVFP4 (calibration-based)
Calibration Samples 512 (C4 dataset)
Quantization Time ~5.6 hours

What is NVFP4?

NVFP4 is NVIDIA's native 4-bit floating-point format, introduced with the Blackwell architecture. Unlike integer quantization (INT4), NVFP4 uses a micro-exponent floating-point format:

┌───────┬───────────┬──────────┐
│ Sign  │ Exponent  │ Mantissa │
│ 1 bit │  2 bits   │  1 bit   │
└───────┴───────────┴──────────┘

This provides better dynamic range for neural network weights compared to uniform integer quantization.

Hardware Requirements

⚠️ This model requires Blackwell architecture GPUs (GB10, GB100, GB200) and TensorRT-LLM for inference.

Standard HuggingFace transformers cannot load this model directly due to the packed FP4 weight format.

Current Compatibility Status (December 2025)

Framework Status
TensorRT-LLM Partial support (GB10 not fully supported in v1.0.0)
vLLM Not yet supported
Transformers ❌ Cannot load packed FP4 weights

This model is ready for when TensorRT-LLM and vLLM add full Blackwell support.

Intended Use

Once framework support is available:

# Build TensorRT engine
trtllm-build --checkpoint_dir ./Mixtral-8x7B-Instruct-v0.1-NVFP4 \
             --output_dir ./engine \
             --gemm_plugin nvfp4

# Serve with TensorRT-LLM
python -m tensorrt_llm.commands.serve --model_dir ./engine

Quantization Details

Weight Format

The model uses packed FP4 weights with block-wise scaling:

weight: uint8 (packed FP4, half original dimension)
weight_scale: float8_e4m3fn (per-block scales, group_size=16)
weight_scale_2: float32 (global scale)
input_scale: float32 (activation scale)

Layers Excluded from Quantization

  • lm_head (output layer)
  • All block_sparse_moe.gate layers (router networks)

Calibration

Quantization was performed with 512 calibration samples from the C4 dataset, running forward passes to collect weight and activation statistics for optimal scale factor determination.

Baseline Performance (BF16)

For reference, the original BF16 model on DGX Spark (GB10):

Metric Value
Tokens/Second 4.05 tok/s
Latency (100 tokens) 24.74 s
Perplexity (WikiText-2) 3.70

NVFP4 inference benchmarks pending framework support.

Training/Quantization Environment

  • Hardware: NVIDIA DGX Spark (GB10 Blackwell GPU)
  • Memory: 128GB Unified Memory
  • Software:
    • Python 3.12
    • PyTorch 2.11
    • NVIDIA TensorRT Model Optimizer (modelopt) 0.40.0
    • Transformers 4.57.3

Citation

If you use this model, please cite:

@misc{mixtral-nvfp4-2025,
  title={Mixtral-8x7B-Instruct-v0.1-NVFP4},
  author={Joseph Dowling},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/[username]/Mixtral-8x7B-Instruct-v0.1-NVFP4}}
}

License

This model inherits the Apache 2.0 license from the base Mixtral model.

Acknowledgments

  • Mistral AI for the base Mixtral 8x7B model
  • NVIDIA for TensorRT Model Optimizer and NVFP4 format
Downloads last month
4
Safetensors
Model size
23B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for josephdowling10/Mixtral-8x7B-Instruct-v0.1-NVFP4

Quantized
(43)
this model