Gradient Overflow issue while using deepspeed

Hi. I’m trying to fine-tune mistralai/Mistral-Small-24B-Base-2501 using deepspeed and consistently getting the overflow error. When I use bf16 and fp32,I don’t see the overflow issue but the training loss is Nan. When I switch to fp16 the training loss is correct but it throws the overflow error. How can I fix this? This works fine with smaller models. Using lr=1e-7.

My df_config.json:

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 8,
    "zero_optimization": {
        "stage": 2
    },
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "initial_scale_power": 32,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "gradient_clipping": 1.0,
    "wall_clock_breakdown": false
}

Using deepspeed 0.17.2 and transformers 4.42.4.

If the GPU supports bfloat16, it’s probably better to use bfloat16. Regarding NaN issues, SDPA seems to be the culprit in many cases. Try attn_implementation="eager".

@John6666 loading the model in bfloat16 and then using bf16=true in deepspeed seems to solve this issue for now!