Strive-Ewe Hybrid Tokenizer (v1.0)

Overview

This is a hybrid tokenizer for the Ewe (ɛʋɛgbɛ) language, designed by Strive to address the limitations of global multilingual models. By expanding the XLM-RoBERTa base vocabulary with over 180,000 language-specific tokens, we significantly reduce subword fragmentation.

Performance & Metrics

1. Fragmentation Efficiency (Fertility Score)

The fertility score measures the average number of tokens produced per word. A lower score indicates better morphological alignment.

Model Fertility Score (Ewe) Efficiency Gain
XLM-RoBERTa Base 2.202 -
Strive-Ewe Hybrid 1.157 +47.45%

2. Context Window Optimization

By reducing the number of tokens per sentence, this tokenizer allows the model to process approximately 90% more Ewe text within the same context window (512 tokens) compared to the standard XLM-R tokenizer.

3. Vocabulary Insights

  • New Tokens Added: 180,363
  • Base Model: XLM-RoBERTa-Base
  • Training Data: 1M+ sentences (NLLB-200 + Bible + Strive Custom Instruct Dataset)

Visual Comparison (Example)

Sentence: "Akpe ene kple alafa atɔ̃ nye alafa eve blata-atɔ̃."

  • XLM-R Base (23 tokens): ['_Ak', 'pe', '_en', 'e', ...]
  • Strive-Ewe (12 tokens): ['_Akpe', '_ene', '_kple', '_alafa', ...]

Training Methodology

The tokenizer was trained using the Unigram algorithm via the train_new_from_iterator method. It preserves the original 250k XLM-R tokens to maintain multilingual compatibility with French and other languages while optimizing for the specific phonology and morphology of Ewe (notably handling special characters like ɖ, ɛ, ƒ, ɣ, ɦ, ŋ, ɔ, ʋ).

How to Use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jojonocode/strive-ewe-xlmr-hybrid-tokenizer")

text = "Eʋegbe nye gbe nyui de."
tokens = tokenizer.tokenize(text)
print(tokens)

If you use this tokenizer in your research, please cite:

@software{strive_ewe_tokenizer_2026,
  author = {Joël ADZONYA Elisée, Strive Team},
  title = {Strive-Ewe Hybrid Tokenizer: Optimizing XLM-RoBERTa for Ewe Language},
  year = {2026},
  url = {[https://huggingface.co/](https://huggingface.co/){repo_id}}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support