Unlocking GPU Inferencing Power with GGUF, GPTQ/AWQ, and EXL2
If you are into the fascinating world of GPU inference and exploring the capabilities of different models, you might have encountered the tweet by turboderp_ showcasing some 3090 inference on EXL2. The discussion that followed revealed intriguing insights into GGUF, GPTQ/AWQ, and the efficient GPU inferencing powerhouse - EXL2.
GGUF, described as the container of LLMs (Large Language Models), resembles the .AVI or .MKV of the inference world. Inside this container, it supports various quants, including traditional ones (4_0, 4_1, 6_0, 8_0), k-quants (2-5 bit with different sizes), and even a new 2-bit format based on quip#.
This discussion unveils the complexity of GGUF, with at least two ways to create k-quants: traditional quantize and AWQ activations. The format is actively developed, making it a dynamic and evolving container for various quantization approaches.
However, for pure GPU inferencing, GGUF may not be the optimal choice. GGUF, as described, grew out of CPU inference hacks. If you are aiming for pure efficient GPU inferencing, two names stand out - GPTQ/AWQ and EXL2.
GPTQ/AWQ is tailored for GPU inferencing, claiming to be 5x faster than GGUF when running purely on GPU. This provides a significant speed boost for those who rely heavily on GPU power for their models.
Now, let's talk about the real game-changer - EXL2. This platform is designed to let your quant fit precisely into your GPU, unleashing the full potential of your hardware. Whether you are on Windows with minimal passive VRAM consumption or pushing the limits with a 70B model at 30 t/s, EXL2 seems to be the go-to choice for efficient GPU inferencing.
In conclusion, the world of GPU inferencing is vast, with each format and platform having its unique strengths. GGUF might be versatile, but for pure GPU power, GPTQ/AWQ and EXL2 take the lead. The choice ultimately depends on your specific needs and the scale of your inferencing tasks.