New Advances in AI Model Handling: GPU and CPU Interplay
With recent breakthroughs, it appears that AI models can now be shared between the CPU and GPU, potentially making expensive, high-VRAM GPUs less of a necessity. Users have reported impressive results with models like Wizard-Vicuna-13B-Uncensored.ggml.q8_0.bin using this technique, yielding fast execution with minimal VRAM use. This could be a game-changer for those with limited VRAM but ample RAM, like users of the 3070ti mobile GPU with 64GB of RAM.
There's an ongoing discussion about the possibilities of splitting the load across multiple GPUs. However, it's unclear whether this feature would require specific builds with openBLAS or cuBLAS. You can keep track of the conversation and latest models in GGML v2 format here.
The potential for these models to work with Apple Silicon and AMD is also being explored, with users eagerly awaiting these developments. Users are seeing promising results on top-tier consumer hardware, sparking excitement for the future of AI models on consumer-grade equipment.
This breakthrough seems to differ from existing approaches like GPTQ and KoboldAI, although more benchmarks on various hardware setups are needed for a comprehensive comparison.
AI, GPU, CPU, VRAM, RAM, GGMLv2, AppleSilicon, AMD, GPTQ, KoboldAI