Visual Instruction Tuning (LLaVA)

Visual Instruction Tuning (LLaVA)

3 minute read

Published:

What is the work trying to do?

Instruction tuning has been shown to improve zero-shot capabilities in LLMs. The work extends this to multi-modal models. In this regard, they augment existing visual question-answering prompts using GPT-4 and train an image conditioned language generation model. This architecture consists of a pre-trained large language model (LlaMA) and a vision encoder (CLIP).

What is instruction tuning for LLMs?

Explicitness in Instructions:

  • Initial instruction: “Translate this text.”
  • Tuned instruction: “Translate this English text into French.”

Asking for Step-by-step Explanations:

  • Initial instruction: “How do you make coffee?”
  • Tuned instruction: “Provide a step-by-step guide on how to make coffee.”

Guiding the Format of the Answer:

  • Initial instruction: “Tell me about the solar system.”
  • Tuned instruction: “List the planets in the solar system in order of their distance from the sun.”

Controlling Verbosity:

  • Initial instruction: “Explain photosynthesis.”
  • Tuned instruction: “Explain photosynthesis in one sentence.”

Avoiding Biases or Assumptions:

  • Initial instruction: “Describe a typical wedding.”
  • Tuned instruction: “Describe a traditional Western wedding.”

Seeking Multiple Perspectives:

  • Initial instruction: “What’s the impact of deforestation?”
  • Tuned instruction: “Describe both the environmental and economic impacts of deforestation.”

Clarifying Ambiguities:

  • Initial instruction: “How long is a marathon?”
  • Tuned instruction: “How many kilometers long is a standard marathon race?”

What is the prior work in this regard?

(i) End-to-end models that are task specific, but don’t use instruction tuning

(ii) A system comprising of disparate models

Side note: What is visual prompt tuning? Adding additional trainable parameters to a LMM to enable parameter-efficient tuning.

How is the data (expanded instructions) generated?

One simple way is to leverage a list of questions.

What is the shortcoming of such an approach? It lacks diversity in the generated questions and hence the generated answers.

How do the authors overcome this shortcoming? They provide a textual description of the images, such as captions (from the labels) and bounding boxes (spatial representation) and prompt an LLM to generate conversation pairs.

The generated questions (instruction-following data) are categorized into 3 types:

  1. Conversation
  2. Detailed description
  3. Complex reasoning

Training Architecture Details

The authors consider a linear layer to project the embeddings generated by the vision encoder to the same dimension as the LLM’s word embeddings.

How is the model trained?

This is perhaps the most interesting aspect of the entire paper.

Step 1: Pre-training - The vision encoder (CLIP) and the LLM (LlaMa) weights are kept frozen. On a collected and filtered dataset (noisy), the projection matrix is learned. This is done so that the visual embeddings can be aligned with the LLM word embedding.

Step 2: End-to-end fine-tuning - On COCO dataset, the vision encoder is kept frozen and the weight matrix and the language model weights are updated.

Notable Results

On the ScienceQA dataset, the authors use GPT-4 to evaluate generated answers. The LlaVA model outperforms multi-model GPT. Moreover, when the GPT-4 answers differ from the LLaVA answers, they use GPT-4 to consider the information and generate an updated model. Here, the ensemble such created boosts the performance.

Strengths

  • Ensemble: The use of GPT-4 when model answers between GPT-4 and LlaVa vary to create an ensemble of models which outperforms either of the models
  • Ablation: The authors show the impact of using CoT prompting with LMM’s and show that even though reasoning does not boost the highest achieved performance, it reduces the time to convergence by 50%

Weaknesses

  • The use of GPT-4 to evaluate with no discussion on human evaluation or the correlation of the evaluation metric with human evaluation

Rating

3.5/5 - The primary reason for this rating is the methodology of evaluation and no major discussion on different methods for in-context learning or experimentation with architectural choices. However, the methodology for augmenting image-based questions & answers is quite impressive and something that can be adapted quickly to a new problem for enhanced performance.

Back to Profile View All Papers