A team of researchers from Microsoft Research, UC Berkeley, CMU, UIUC, UW–Madison, and MIT-IBM Watson AI Lab have developed a new large multimodal model called “Large Language and Vision Assistant” (LLaVA).
It aims to develop a general-purpose visual assistant that can follow both language and image instructions to perform various tasks such as visual question answering, image captioning, visual chat, and more.
The researchers hope that LLaVA can serve as a foundation for building and surpassing multimodal GPT-4 level models in the future. They also plan to improve LLaVA with reinforcement learning from human feedback (RLHF) to reduce hallucination and increase factual grounding.
There are two variants of this model, LLaVA-1.5-7B and LLaVA-1.5-13B. You can download the weights from here.
LLaVA in action
We can try out LLaVA model for free right now at llava-vl.github.io and below is an example of visual question-answering.