MDLM-TinyStories

A small Masked Diffusion Language Model (MDLM) trained on TinyStories.

Model Details

Property Value
Architecture DiT + adaLN-zero + RoPE + bidirectional attention
Parameters 29.4M
Layers 4
Hidden dim 256
Heads 4
Context length 128 tokens
Tokenizer GPT-2 (50,257 + 1 mask token)
Training MDLM (Rao-Blackwellized ELBO)
Dataset TinyStories (50k examples subset)
Steps 1500
Best val loss 7.8963

How It Works

Unlike autoregressive LMs, MDLM generates text through iterative denoising:

  1. Start with all [MASK] tokens
  2. At each step, the bidirectional model predicts clean tokens for all masked positions
  3. Gradually unmask tokens over ~100 steps
  4. Uses bidirectional attention — every position attends to every other position

Based on Simple and Effective Masked Diffusion Language Models (Sahoo et al., NeurIPS 2024).

Usage

import torch
from model import MDLMConfig, MDLM, sample
from transformers import AutoTokenizer

# Load model
model = MDLM.from_pretrained("youraveragedev/mdlm-tiny-stories", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("youraveragedev/mdlm-tiny-stories")
model.eval()

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Generate text (unconditional)
generated_ids = sample(model, seq_len=128, batch_size=1, num_steps=100, temperature=0.7, device=device)
text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(text)

Training Recipe

From the MDLM paper (NeurIPS 2024):

  • Noise process: Mask each token with probability t ~ U(0,1)
  • Loss: Cross-entropy on masked positions, weighted by 1/t (ELBO)
  • Optimizer: AdamW, lr=3e-4, linear warmup
  • Schedule: Constant after warmup

Citation

@inproceedings{sahoo2024simple,
  title={Simple and Effective Masked Diffusion Language Models},
  author={Sahoo, Subham Sekhar and Arriola, Marianne and Schiff, Yair and Gokaslan, Aaron and Marroquin, Edgar and Chiu, Justin T and Rush, Alexander and Kuleshov, Volodymyr},
  booktitle={NeurIPS},
  year={2024}
}
Downloads last month
403
Safetensors
Model size
29.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train youraveragedev/mdlm-tiny-stories

Paper for youraveragedev/mdlm-tiny-stories