Gradient Overflow issue while using deepspeed

jaydeepb · August 28, 2025, 12:39am

Hi. I’m trying to fine-tune mistralai/Mistral-Small-24B-Base-2501 using deepspeed and consistently getting the overflow error. When I use bf16 and fp32,I don’t see the overflow issue but the training loss is Nan. When I switch to fp16 the training loss is correct but it throws the overflow error. How can I fix this? This works fine with smaller models. Using lr=1e-7.

My df_config.json:

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 8,
    "zero_optimization": {
        "stage": 2
    },
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "initial_scale_power": 32,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "gradient_clipping": 1.0,
    "wall_clock_breakdown": false
}

Using deepspeed 0.17.2 and transformers 4.42.4.

John6666 · August 28, 2025, 1:04am

If the GPU supports bfloat16, it’s probably better to use bfloat16. Regarding NaN issues, SDPA seems to be the culprit in many cases. Try attn_implementation="eager".

jaydeepb · August 28, 2025, 4:50am

@John6666 loading the model in bfloat16 and then using bf16=true in deepspeed seems to solve this issue for now!

Topic		Replies	Views
Overflow when using DeepSpeed for GPT-J (training aborts) DeepSpeed	4	9655	March 9, 2023
Checkpoint breaks with deepspeed 🤗Transformers	6	3580	March 20, 2021
Gettings nan with deepspeed 🤗Transformers	0	933	March 20, 2021
Enabling gradient checkpointing and deepspeed ZeRO3 raise train failure 🧨 Diffusers	1	2716	May 25, 2024
Error using deepspeed for sftconfig DeepSpeed	1	137	April 21, 2025

Gradient Overflow issue while using deepspeed

Related topics