How do you Quantize the model?

by fahadh4ilyas - opened Jul 1, 2023

Jul 1, 2023

I'm still new in using Lora and I thought that adapter model can't be merged. But, I saw your script to merge adapter to base model.

What I don't understand is quantization. Do you have to quantize the base model first then merge it with the adapter, or you just quantize the merged model? Can you please explain how and kindly share your way to quantize? Thank you....

TheBloke

Owner Jul 1, 2023

The latter. I merge the lora on to the unquantised base model to produce a merged model, then I quantise that merged model.

I quantised these with the old CUDA branch of GPTQ-for-LLaMa, to maximise compatibility. The command to do that is simple:

python llama.py source_model wikitext --wbits 4 --true-sequential --groupsize 128 --act-order  --save_safetensors outputfile.safetensors

However for most users I recommend quantising with AutoGPTQ instead. There are example scripts shown in the AutoGPTQ repo on Github. Or here is a script I made that provides command line parameters to easily choose the quantisation configuration: https://gist.github.com/TheBloke/b47c50a70dd4fe653f64a12928286682

fahadh4ilyas

Jul 1, 2023

So, we can quantize merged model? Okay thank you for the information and the script...

fahadh4ilyas changed discussion status to closed Jul 1, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment