ALIGNING LARGE MULTIMODAL MODELS WITH FACTUALLY AUGMENTED RLHF
This is my insights from the paper ALIGNING LARGE MULTIMODAL MODELS WITH FACTUALLY AUGMENTED RLHF .
?
What is Large Multimodal Model (LMM)?
LMMs are advanced AI systems capable of understanding and generating information from multiple data modalities or sources, such as text, images, audio, and video.
?
What is a hallucination?
The term refers to a situation where the LMM generates a response that includes information not present or implied in the image or previous conversation. A hallucination could be a false claim about an object, action, emotion, or any other detail that is not grounded in the image.
?
Types of hallucinations?
Object Hallucination:
Image Contents: Lynx, Cat
Question: How many cats can you see in this image?
Standard Human-Generated Answer: There are two cats visible in the image.
LMM Response to Evaluate: There is one cat in the image.
Analysis: The LMM's response is false because it claims there is only one cat in the image, when in fact there are two.
Hallucination: Yes.
?
Detail Hallucination:
Image Contents: Tree, Plant, Panda, Animal
Question: What can you see from this photo?
Standard Human-Generated Answer: In this photo, we can see branches. We can see a panda on a branch. The background is blurry.
LMM Response to Evaluate: In the photo, I can see a large black and white panda sitting on a tree branch. The image captures the panda's position near the water, but there is no water in the image.
Hallucination: Yes.
?
领英推荐
Location Hallucination:
Image Contents: Clothing, Luggage and bags, Man, Building, Woman, Person
Question: Where is this photo taken?
Standard Human-Generated Answer: From the slogan "Welcome to Houston, Texas", we can infer that this photo is taken at an airport in Houston, Texas.
LMM Response to Evaluate: This photo is taken at an airport, specifically at the entrance of a terminal. However, the specific location of this airport cannot be determined from the image.
Hallucination: Partly.
?
To address and reduce these hallucinations, the paper suggests the following methods:
Enriching Vision Instruction Tuning Data:
(LLaVA is a vision-language model trained on synthetic image-based dialogues)
The paper suggests enriching the GPT-4 generated vision instruction tuning data from LLaVA with existing human-authored image-text pairs. This approach aims to improve the quality and reliability of the data used for training LMMs.
?
Adopting the Reinforcement Learning from Human Feedback (RLHF) Algorithm:
The RLHF algorithm, originally from the text domain, is adapted to bridge vision-language gaps. In this approach, human evaluators compare two responses and pinpoint the more hallucinated one. The vision-language model is then trained to maximize the simulated human rewards.
?
Factually Augmented RLHF:
A novel alignment algorithm called Factually Augmented RLHF is introduced. This method augments the reward model with additional factual-information, such as image captions and ground-truth multi-choice options. This strategy aims to counter the reward hacking phenomenon observed in RLHF and further enhance model performance.
?
MMHAL-BENCH Evaluation Benchmark:
To evaluate the effectiveness of the proposed strategies in real-world scenarios, the paper introduces MMHAL-BENCH. This new evaluation benchmark specifically focuses on penalizing hallucinations, providing a tangible measure of how well the model avoids generating ungrounded or inaccurate information.
?
Open-Sourcing for Community Engagement:
The paper concludes with the intention to open-source their code, models, and data. This move is expected to foster community engagement and further research, potentially leading to more strategies and refinements in reducing hallucinations in LMMs.