Instructions to use Outlier-Ai/Outlier-Nano-4B-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Outlier-Ai/Outlier-Nano-4B-MLX-4bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Outlier-Ai/Outlier-Nano-4B-MLX-4bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use Outlier-Ai/Outlier-Nano-4B-MLX-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Outlier-Ai/Outlier-Nano-4B-MLX-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Outlier-Ai/Outlier-Nano-4B-MLX-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Outlier-Ai/Outlier-Nano-4B-MLX-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Outlier-Ai/Outlier-Nano-4B-MLX-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Outlier-Ai/Outlier-Nano-4B-MLX-4bit
Run Hermes
hermes
- MLX LM
How to use Outlier-Ai/Outlier-Nano-4B-MLX-4bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "Outlier-Ai/Outlier-Nano-4B-MLX-4bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "Outlier-Ai/Outlier-Nano-4B-MLX-4bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Outlier-Ai/Outlier-Nano-4B-MLX-4bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Part of the Outlier shipping lineup. Outlier is a free macOS app that runs this model locally, with one click. Apple Silicon only.
Outlier Nano 4B (Combined-fused) — MLX 4-bit
This is the multi-domain Combined-fused checkpoint — the base Qwen3-Next 4B (hybrid linear:full attention 3:1) with a fused LoRA adapter trained jointly on math reasoning, translation, and Q&A. The earlier "base Nano" checkpoint is still available at git tag pre-combined-v1.
Why fused?
Same 4B parameter budget, but every domain gets a meaningful lift:
| Benchmark | Base Nano | Combined-fused | Delta |
|---|---|---|---|
| GSM8K math (n=50) | 5% | 78% | +73pp |
| MMLU-style (n=30) | 74.4% | 96.7% | +22pp |
| Hard reasoning (n=15) | — | 87–93% | new |
| Q&A factual (n=30) | 74% | 100% | +26pp |
| Code generation | — | 100% | new |
| Translation (n=30) | 20% | 90% | +70pp |
| RAG | — | 100% | new |
| Variant robustness (5× phrasings) | — | 100% | new |
Empirically validated on M1 Ultra over 70+ tests (v2_research session 2026-05-27). Quality holds across KV-bit quantization 5→3 for Q&A (–10pp at kv=3 for math only); ship with kv_bits=5 everywhere as the safe floor.
Performance on M1 Ultra
- TTFT (with sysprompt cache): 305 ms mean, P99 329 ms
- Decode: 89 tok/s (faster than base Nano on the same hardware)
- End-to-end (100-token response): 728 ms wall
- RAM peak: 3.17 GB constant across 100-query stress test (zero memory growth)
This stack delivers a "feels-snappy" response latency that's faster than the network round-trip alone of any cloud LLM service.
Architecture
- Base: Qwen3-Next 4B (hybrid 3:1 linear:full attention)
- Adapter: LoRA fused into base weights (rank-8, late-layer focus 17–30, MLP gates + self-attn q_proj most affected)
- Quantization: MLX 4-bit
- Context: 32K tokens
Usage
The Outlier macOS app ships this as the default Nano tier — no setup needed.
For direct mlx_lm use:
from mlx_lm import load, generate
model, tok = load("Outlier-Ai/Outlier-Nano-4B-MLX-4bit")
out = generate(model, tok, prompt="What's 17+23?",
max_tokens=50, kv_bits=5, kv_group_size=64)
print(out)
License
Apache 2.0 — same as the base Qwen3-Next 4B model.
Citation
@misc{outlier-nano-combined-4b-mlx-4bit-2026,
author = {Outlier-Ai},
title = {Outlier Nano 4B (Combined-fused, MLX 4-bit)},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/Outlier-Ai/Outlier-Nano-4B-MLX-4bit}
}
- Downloads last month
- 81
4-bit
Model tree for Outlier-Ai/Outlier-Nano-4B-MLX-4bit
Evaluation results
- accuracy on MMLU (5-shot, n=14042)test set self-reported0.744
- pass@1 on HumanEvaltest set self-reported0.579