You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

SmolLM2-360M-Khasi-CPT (Phase 2)

Model Description

This model is a fine-tuned version of Bapynshngain/SmolLM2-360M-Khasi-Base on the Khasi monolingual dataset. It is a Continued Pre-Training (CPT) checkpoint of the SmolLM2-360M-Instruct model, specifically adapted for the Khasi language. It represents Phase 2 of a multi-stage training pipeline aimed at developing lightweight, highly efficient linguistic models for Meghalayan languages under the Tynrai AI initiative.

⚠️ CRITICAL WARNING: INTERMEDIATE CHECKPOINT ⚠️ This is not an instruction-following model or a translator. This is a foundational CPT model trained strictly on next-token prediction. It has acquired the Khasi vocabulary but has not yet undergone semantic alignment. If prompted, it will likely exhibit Token Collision (hallucinating in Romanized Hindi, Vietnamese, or English) because its nascent Khasi neural pathways are still competing with its massive pre-trained Latin-script latent space.

Do not use this model for production tasks. It is published for research tracking and as a base for Supervised Fine-Tuning (SFT).

Training Pipeline & Methodology

This model was adapted using a careful, non-destructive vocabulary injection method to prevent catastrophic forgetting of the base model's English and logical reasoning capabilities.

1. Tokenizer Surgery & Smart Initialization

Rather than completely replacing the base BPE tokenizer (which destroys pre-trained embeddings), we performed a vocabulary merge:

  • Extracted tokens from a custom 12K Unigram Khasi SentencePiece model (Bapynshngain/enkha-hybrid-tokenizer).
  • Filtered and injected 10,899 strictly new Khasi tokens into the SmolLM2 vocabulary.
  • Smart Initialization: The newly added embedding rows were not left randomized. Instead, they were initialized by averaging the weights of the existing English sub-words that previously comprised those Khasi words. This granted the new tokens immediate semantic weight.

2. Continued Pre-Training (CPT)

The resized model underwent standard Causal Language Modeling (CLM) to teach the new tokens syntactic relationships.

  • Khasi Data: ~740K monolingual Khasi sentences (Bapynshngain/Bapyn-Kha-News).
  • English Anchor Data: ~100K high-quality English documents from FineWeb-Edu (acting as ~15% of the mix to retain structural reasoning and prevent catastrophic forgetting).
  • Hardware: Trained via Hugging Face Trainer with bfloat16 precision and Cosine Learning Rate decay.

How to Use (Inference)

Because this is a base model, you must prompt it with the beginning of a Khasi sentence and allow it to autocomplete. Chat templates will not work correctly yet.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Bapynshngain/SmolLM2-360M-Khasi-CPT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

prompt = "Ka nongbah jong ka Meghalaya ka long"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=50,
        temperature=0.2, # Keep temperature LOW (0.1 - 0.2) to prevent latent space bleed
        top_p=0.9,
        do_sample=True,
        repetition_penalty=1.05,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
-
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bapynshngain/SmolLM2-360M-Khasi-CPT

Finetuned
(147)
this model

Datasets used to train Bapynshngain/SmolLM2-360M-Khasi-CPT