Instructions to use rasyosef/Mistral-NeMo-Minitron-8B-Chat with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rasyosef/Mistral-NeMo-Minitron-8B-Chat with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="rasyosef/Mistral-NeMo-Minitron-8B-Chat") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("rasyosef/Mistral-NeMo-Minitron-8B-Chat") model = AutoModelForCausalLM.from_pretrained("rasyosef/Mistral-NeMo-Minitron-8B-Chat") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use rasyosef/Mistral-NeMo-Minitron-8B-Chat with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rasyosef/Mistral-NeMo-Minitron-8B-Chat" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rasyosef/Mistral-NeMo-Minitron-8B-Chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/rasyosef/Mistral-NeMo-Minitron-8B-Chat
- SGLang
How to use rasyosef/Mistral-NeMo-Minitron-8B-Chat with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rasyosef/Mistral-NeMo-Minitron-8B-Chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rasyosef/Mistral-NeMo-Minitron-8B-Chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rasyosef/Mistral-NeMo-Minitron-8B-Chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rasyosef/Mistral-NeMo-Minitron-8B-Chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use rasyosef/Mistral-NeMo-Minitron-8B-Chat with Docker Model Runner:
docker model run hf.co/rasyosef/Mistral-NeMo-Minitron-8B-Chat
ChatML template does not work properly
The models hallucinates a lot. I am using the GGUF model and ML Studio, selected the ChatML prompt template based on Model Card
@WasamiKirua not sure about the quality of the quantized versions, I'd recommend that you load the model weights in bfloat16.
Here's a colab notebook with a chat interface, you can use this to interact with the chat model.
you are using chatml , use phi-2 template for chatting , the author mentioned that in the original model model card, https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base/discussions/5#66cbf507ed5c5babdef42cd1
if you use the wrong chat template with gguf , the model hallucinates
Yeah, have to use the chat template supported by the model's tokenizer. In this case, it's chatml. It's the same template as in the model card.
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
How to explain Internet for a medieval knight?<|im_end|>
<|im_start|>assistant
Also make sure of the gguf you are using , lower quants tend to hallucinate alot , also set the temp and top p and top k to the right amounts
The SFT dataset only contained English instructions, can you try English prompts? This looks like Italian.
with English text is fine. The Minitron is a multilanguage, I dunno how much is the italian data in % but based on my experience the Mistral Nemo is able to generate Italian text pretty well. Perhaps, since you fine tuned it with Openhermes it is all about "catastrophic forget" ? but I am pretty sure OpenHermes has also Italian text. No idea honestly. Anyway did you use axolotl to fine tune ? if yes might I ask you the configuration yaml ?
The chat model is still able to generate Italian text. Catastrophic forgetting mainly occurs when you use full parameter finetuning. Since I used LoRA, there's little chance of that happening.
Here's an example:
Instruction: "Quali sono le destinazioni turistiche più apprezzate in Italia?"
Response: "Le destinazioni turistiche più apprezzate in Italia dipendono da una varietà di fattori, tra cui la storia, la cultura, la natura e la gastronomia. Alcune delle destinazioni turistiche più popolari in Italia includono: Roma, con la sua ricca storia e monumenti antichi; Venezia, famosa per le sue gondole e canali; Firenze, con il suo patrimonio artistico rinascimentale; Milano, con la sua architettura moderna e la moda; Napoli, con la sua storia e la sua cucina; e la Costa Amalfitana, con le sue spettacolari viste panoramiche. Ci sono anche molte altre destinazioni turistiche popolari in Italia, come Pompei, la Sardegna, la Sicilia, il Piemonte, la Toscana e la Liguria. In ogni regione di Italia, ci sono molte destinazioni turistiche uniche e interessanti da scoprire."
Same for me sadly, using imatrix quants, q6, the model always responds off the mark, it's being very dumb especially in E/RP...
Hi @Varkoyote, can you try the same prompts with the unquantized bfloat16 version?
Here's a colab notebook with a chat interface.
https://huggingface.co/rasyosef/Mistral-NeMo-Minitron-8B-Chat/blob/main/Mistral_NeMo_Minitron_8B_chatbot.ipynb
