New Vision LLM: Large Language and Vision Assistant (LLaVA)

A team of researchers from Microsoft Research, UC Berkeley, CMU, UIUC, UW–Madison, and MIT-IBM Watson AI Lab have developed a new large multimodal model called “Large Language and Vision Assistant” (LLaVA).

It aims to develop a general-purpose visual assistant that can follow both language and image instructions to perform various tasks such as visual question answering, image captioning, visual chat, and more.

The researchers hope that LLaVA can serve as a foundation for building and surpassing multimodal GPT-4 level models in the future. They also plan to improve LLaVA with reinforcement learning from human feedback (RLHF) to reduce hallucination and increase factual grounding.

LLaVA weights

There are two variants of this model, LLaVA-1.5-7B and LLaVA-1.5-13B. You can download the weights from here.

LLaVA in action

We can try out LLaVA model for free right now at llava-vl.github.io and below is an example of visual question-answering.

Output:

New Vision LLM: Large Language and Vision Assistant (LLaVA)

LLaVA weights

LLaVA in action

Prateek

SDXL Turbo – Express Text-To-Image Generation Model Released

Now Access Google’s Gemini Pro Models in Kaggle Notebooks For Free

WhisperSpeech – New Text-To-Speech Model In Town

Leave a Reply Cancel reply

LLaVA weights

LLaVA in action

Prateek

Related Posts

Leave a Reply Cancel reply