Dataset and models for transforming LFM2 2.6B into a Tic Tac Toe master using RL Environments. Free course: https://t.ly/4jIFq
Stefano Fiorucci PRO
anakin87
AI & ML interests
Language Models: orchestration, post-training, GRPO, synthetic data...
Contributing to Haystack LLM framework ๐๏ธ
Recent Activity
liked a dataset about 22 hours ago
VAGOsolutions/SauerkrautLM-Doom-MultiVec-31k upvoted an article 6 days ago
ML Intern Takes Our Post-Training Internship Test reacted to theirpost with โค๏ธ 6 days ago
A small model that struggled against a random opponent now beats GPT-5-mini at tic-tac-toe
I took https://huggingface.co/LiquidAI/LFM2-2.6B and trained it through play.
๐งโ๐ณ Here's how:
1๏ธโฃ Build a solid RL env with Verifiers (Prime Intellect)
2๏ธโฃ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env
3๏ธโฃ SFT warm-up to teach format
4๏ธโฃ Group-based RL (CISPO) against opponents making 20-70% random moves
5๏ธโฃ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies
Done! Beats GPT-5-mini ๐
---
๐ฎ Play against the model: https://huggingface.co/spaces/anakin87/LFM2-2.6B-mr-tictactoe
๐ค Model: https://huggingface.co/anakin87/LFM2-2.6B-mr-tictactoe
๐ Walkthrough/course: https://github.com/anakin87/llm-rl-environments-lil-course
๐ค Dataset and checkpoints: https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe