MDLM-TinyStories

A small Masked Diffusion Language Model (MDLM) trained on TinyStories.

Model Details

Property	Value
Architecture	DiT + adaLN-zero + RoPE + bidirectional attention
Parameters	29.4M
Layers	4
Hidden dim	256
Heads	4
Context length	128 tokens
Tokenizer	GPT-2 (50,257 + 1 mask token)
Training	MDLM (Rao-Blackwellized ELBO)
Dataset	TinyStories (50k examples subset)
Steps	1500
Best val loss	7.8963

How It Works

Unlike autoregressive LMs, MDLM generates text through iterative denoising:

Start with all [MASK] tokens
At each step, the bidirectional model predicts clean tokens for all masked positions
Gradually unmask tokens over ~100 steps
Uses bidirectional attention — every position attends to every other position

Based on Simple and Effective Masked Diffusion Language Models (Sahoo et al., NeurIPS 2024).

Usage

import torch
from model import MDLMConfig, MDLM, sample
from transformers import AutoTokenizer

# Load model
model = MDLM.from_pretrained("youraveragedev/mdlm-tiny-stories", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("youraveragedev/mdlm-tiny-stories")
model.eval()

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Generate text (unconditional)
generated_ids = sample(model, seq_len=128, batch_size=1, num_steps=100, temperature=0.7, device=device)
text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(text)

Training Recipe

From the MDLM paper (NeurIPS 2024):

Noise process: Mask each token with probability t ~ U(0,1)
Loss: Cross-entropy on masked positions, weighted by 1/t (ELBO)
Optimizer: AdamW, lr=3e-4, linear warmup
Schedule: Constant after warmup

Citation

@inproceedings{sahoo2024simple,
  title={Simple and Effective Masked Diffusion Language Models},
  author={Sahoo, Subham Sekhar and Arriola, Marianne and Schiff, Yair and Gokaslan, Aaron and Marroquin, Edgar and Chiu, Justin T and Rush, Alexander and Kuleshov, Volodymyr},
  booktitle={NeurIPS},
  year={2024}
}

Downloads last month: 403

Safetensors

Model size

29.4M params

Tensor type

F32

Dataset used to train youraveragedev/mdlm-tiny-stories

Paper for youraveragedev/mdlm-tiny-stories

Simple and Effective Masked Diffusion Language Models

Paper • 2406.07524 • Published Jun 11, 2024 • 12