MiniLingua-1b
MiniLingua-1b is a multilingual base language model with approximately 1 billion parameters, trained from scratch with a custom sentencepiece 128k token tokenizer supporting the following languages:
Bulgarian, Czech, Dutch, English, Finnish, French, German, Greek, Italian, Polish, Portuguese, Spanish, Swedish, and programming code.
Training Details
MiniLingua-1b was trained on a 1 trillion token corpus that includes:
The model was trained for 1.5 epochs over 12 days on the LUMI supercomputer, using:
- 256 AMD MI250X GPUs
- bf16 precision
- Megatron-LM library
- Data parellelism
Intended Use
This model serves as a multilingual base LLM, suitable for instruction tuning, research, and language understanding tasks in low- and high-resource European languages.
Use with transformers
Quick start with Transformers both for GPU and CPU enabled envs:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
model_name = "minilingua-ai/MiniLingua-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", dtype=torch.float16)
gen = pipeline("text-generation", model=model, tokenizer=tokenizer, trust_remote_code=True)
prompt = "Translate from Bulgarian: Здравейте! Как сте? Translation:"
out = gen(prompt, max_new_tokens=128, do_sample=False)
print(out[0])
License
Apache 2.0 — free for research and commercial use, subject to the terms.
- Downloads last month
- 23