llamacle_drgrpo_euan_v1_step30
DrGRPO RL post-train (30 cycles) on top of euan-loracles/llama70b-loracle-25k. 2,500 held-out FineWeb-edu LoRAs as RL pool, 32 prompts/cycle x K=16 rollouts (sub-batched 4xK=4), lr=7e-6, eps=0.2/0.28, NF4-DDP across 4xB200.
AB Llama-70B (3 seeds, mean ± std)
| step | any-match | rollout-mean |
|---|---|---|
| 0 (euan baseline) | 66.1% ± 2.7% | 42.4% ± 1.0% |
| 10 | 70.1% ± 1.0% | 47.6% ± 3.0% |
| 20 | 70.1% ± 4.3% | 48.3% ± 2.5% |
| 30 (this ckpt) | 72.4% ± 0.0% | 50.9% ± 0.7% |
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for ceselder/llamacle_drgrpo_euan_v1_step30
Base model
meta-llama/Llama-3.1-70B Finetuned
meta-llama/Llama-3.3-70B-Instruct