llamacle_drgrpo_euan_v1_step30

DrGRPO RL post-train (30 cycles) on top of euan-loracles/llama70b-loracle-25k. 2,500 held-out FineWeb-edu LoRAs as RL pool, 32 prompts/cycle x K=16 rollouts (sub-batched 4xK=4), lr=7e-6, eps=0.2/0.28, NF4-DDP across 4xB200.

AB Llama-70B (3 seeds, mean ± std)

step any-match rollout-mean
0 (euan baseline) 66.1% ± 2.7% 42.4% ± 1.0%
10 70.1% ± 1.0% 47.6% ± 3.0%
20 70.1% ± 4.3% 48.3% ± 2.5%
30 (this ckpt) 72.4% ± 0.0% 50.9% ± 0.7%
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/llamacle_drgrpo_euan_v1_step30

Finetuned
(614)
this model