LLaVA: Large Language and Vision Assistant
The paper presents the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, the authors introduce LLaVA, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.
LLaVA demonstrates impressive multimodel chat abilities and yields an 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. The authors make GPT-4 generated visual instruction tuning data, their model and code base publicly available.
One interesting comment suggests the possibility of hooking LLaVA up to a webcam to provide additional context on who the chatbot is talking to and how the responses are being received. Another comment questions the meaningful differences between LLaVA and MiniGPT-4, and cites Bing's analysis of the two models. A third comment expresses excitement about the potential of LLaVA.
The paper represents an important step in the exploration of instruction tuning in the multimodal field, and the results demonstrate the promise of LLaVA for general-purpose visual and language understanding. Entities mentioned in the comments include MiniGPT-4, Bing, Vicuna, and CLIP ViT-L/14.
Links: