Navigating the Maze of Model Quantization: Techniques, Innovations, and the Power of Open Source
It's truly exciting to see the strides made in quantization techniques, especially when it comes to increasing the speed and efficiency of model training. But this progress has also introduced a new challenge: the multitude of quantization methods available. Among the most notable are GGML (with at least 3 incompatible versions), GPTQ (including gptq for llama and autgptq, most popular in 4bit but also available in 8bit) and BitsandBytes, both in 8bit and 4bit format. These are primarily supported by Transformers, but already quantized models in these formats are not as widespread as one might expect.
Fine-tuning adds another layer of complexity, with at least 30 variations, some differing only by one dataset. Despite the complexity, this is a testament to the impressive advancements made by this community. It is an indicator of the shifting power dynamics from large corporations to open source communities. Special mention goes to u/The-Bloke for their significant contributions to the field.
A case in point for the potential of these advancements is the fine-tuning approach QLoRA, which allows for a 65B parameter model to be fine-tuned on a single 48GB GPU. This new approach, along with their best model family, Guanaco, achieves a remarkable 99.3% performance level of ChatGPT. This is made possible through innovations like 4-bit NormalFloat (NF4), Double Quantization, and Paged Optimizers.
New models from QLoRA, such as Guanaco 7B and Guanaco 65B, have even surpassed the performance of Turbo, as per their claims. These models use the new bitsandbytes 4-bit quantization and can be fine-tuned efficiently, even on limited hardware.
As the field continues to evolve and the open source community continues to impress, we anticipate further advancements and improvements in these methods. The future is undoubtedly bright for model training and quantization techniques.