ExLlamaV2: The Fastest Library to Run LLMs
Quantizing Large Language Models (LLMs) is the most popular approach to reduce the size of these models and speed up inference...
Among these techniques, GPTQ delivers amazing performance on GPUs. Compared to unquantized models, this method uses almost 3 times less VRAM while providing a similar level of accuracy and faster generation.
It became so popular that it has recently been directly integrated into the transformers library.
**ExLlamaV2** is a library designed to squeeze even more performance out of GPTQ. Thanks to new kernels, it’s optimized for (blazingly) fast inference. It also introduces a new quantization format, EXL2, which brings a lot of flexibility to how weights are stored.
This new format is based on the same optimization method as GPTQ and supports a range of quantization levels: 2, 3, 4, 5, 6, and 8 bits. The flexibility of EXL2 allows for mixing quantization levels within a model, achieving any average bitrate between 2 and 8 bits per weight.
Furthermore, EXL2 enables the application of multiple quantization levels to each linear layer, creating a form of sparse quantization where more important weights (columns) are quantized with more bits. The remapping trick that enables ExLlama to work efficiently with act-order models also facilitates this mixing of formats with minimal impact on performance.
As usual, the code is available on GitHub and Google Colab.
Tags: GPTQ ExLlamaV2 Quantization