Accelerated Machine Learning on Consumer GPUs with MLC.ai
MLC.ai is a machine learning compiler that allows real-world language models to run smoothly on consumer GPUs on phones and laptops without the need for server support. This innovative tool can target various GPU backends such as Vulkan, Metal, and CUDA, making it possible to run large language models like Vicuña with impressive speed and accuracy.
The performance of MLC.ai is truly mind-blowing. The demo mlc_chat_cli
runs over three times faster than 7B q4_2 quantized Vicuña running on LLaMA.cpp on an M1 Max MBP. The project was originally designed for WebGPUs, which are hundreds of lines long, and only takes tens of lines to expand it to other GPU backends.
One of the developers commented that this is the first demo where a machine learning compiler helps to deploy a real-world LLM (Vicuña) to consumer-class GPUs on phones and laptops. The possibilities for this tool are endless, and combining it with powerful frontends like SillyTavern, which can even run on a smartphone, would be very interesting.
MLC.ai supports various GPU backends like Vulkan, Metal, and CUDA. The performance of AMD cards using Vulkan is excellent, making it possible to run llms on GPUs using this API. This is great news for AMD users who can now take advantage of the impressive speed and accuracy of MLC.ai.
To try out MLC.ai, users can switch out and test different language models. The tool is highly versatile and easy to use. Developers can install other models by copying their Pygmalion files to the 'dist' folder.