XLM-RoBERTa for Nepali-English Bilingual Fake News Detection

This model is a fine-tuned version of XLM-RoBERTa-base optimized for detecting fake news in the bilingual (Nepali and English) media landscape. It specifically addresses challenges in low-resource NLP such as morphological complexity and code-switching.

Model Details

Model Description

  • Developed by: Plan Ghimire and Pranjal Shrestha (Department of Electronics and Computer Engineering, IOE, Thapathali Campus, Tribhuvan University, Nepal)
  • Model type: Transformer-based Text Classifier
  • Language(s) (NLP): Nepali (Devanagari) and English
  • License: MIT (or as specified by the authors)
  • Finetuned from model: xlm-roberta-base

Model Sources

Uses

Direct Use

This model is intended for the classification of news articles and social media posts into "Real" or "Fake." It is specifically trained to handle:

  • Code-switched content (mixing Nepali and English).
  • Agglutinative morphology of the Nepali language.
  • Social media text from platforms like Facebook, X, and TikTok.

Out-of-Scope Use

The model should not be used as the sole arbiter of truth without human oversight, particularly in sensitive political contexts. It is not designed for languages other than Nepali and English.

Bias, Risks, and Limitations

Limitations

  • Sequence Length: Optimized for a maximum sequence length of 128 tokens.
  • Context: While the model achieves high accuracy, it may struggle with highly nuanced satire that mimics formal journalism perfectly without linguistic "red flags."

Recommendations

The authors recommend using SHAP (SHapley Additive exPlanations) alongside the model to visualize token-level contributions, ensuring that the classification is based on credible linguistic patterns rather than dataset artifacts.

How to Get Started with the Model

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
MODEL_NAME = "planghimire/nepali-english-fake-news-detector"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME).to(DEVICE)
model.eval()

def predict_news(text: str):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=512,
        padding=True
    ).to(DEVICE)

    with torch.no_grad():
        probs = F.softmax(model(**inputs).logits, dim=-1)[0]

    fake_prob, real_prob = probs.tolist()
    is_real = real_prob > fake_prob

    print(
        f"Prediction: {'REAL' if is_real else 'FAKE'} | "
        f"Confidence: {max(real_prob, fake_prob):.2%} | "
        f"Real: {real_prob:.3f} | Fake: {fake_prob:.3f}"
    )

    return {
        "label": "Real" if is_real else "Fake",
        "confidence": max(real_prob, fake_prob),
        "real_prob": real_prob,
        "fake_prob": fake_prob
    }

# Test
text = "आर्थिकतामा पर्याप्त ध्यान नदिएको भन्दै आएका गुनासोलाई बेवास्ता गर्न खोज्दै ट्रम्पले डिसेम्बर ९ मा सभामा कडा आलोचना गरे।"
predict_news(text)


Downloads last month
15
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support