Hello!
I trained a LoRA adapter for a 13B model, I was training in a quantized setup - so on top of a 4bit quantized model, I trained a 16bit LoRA adapter.
Now, after training, I would like to get my model back together, i.e., merge LoRA weights.
Unfortunately, if I do merge_and_unload right away, the model’s outputs become complete garbage. I suppose this is because the LoRA weights get converted to 4 bits and then added to the base weights.
Therefore, I thought it would be smarter to first dequantize my model, and only then merge, but when I call .dequantize() on my 4bit base + 16 bit LoRA model, I quickly get OOMs. At the same time, dequantizing on a CPU is not possible, as this is only implemented for 8 bit quant.
Is there any way out of this stalemate? To sum up:
- I can’t merge_and_unload → garbage output
- I can’t dequantize and merge_and_unload → OOM on dequantize
- I can’t dequantize on CPU → not supported for 4 bit
I have tried loading the base model in fp16 and then applying my trained LoRA weights to it, but when done this way the LoRA doesn’t seem to affect the output at all - seems like the scale of the weights is not compatible if the original model is loaded in fp16.
Is there any way to still salvage my model?