What can a Multimodal Large Language Model do with an image? Lately, we have written about Apple’s Ferret open-source model. We talked a bit about how it works, so now it’s time to look at a use case. Ferret might be the first MLLM to perform a free-form referring and grounding. It means that its hallucination-to-correct answer ratio is very low. It refers to an object in an image and grounds the response on an object in an image very well. Look at the image below, as it represents how interactions within Ferret work. As the user selects points, boxes, or free-form objects, the model is able to identify elements, recognize relations between objects, and use the LLM part to synthesize more complex answers within a processed image. Are we much closer to a reliable AI assistant? --- For more insights on?#data?and?#MachineLearning, follow?Sparkbit?on Linkedin. If you're looking for a tech partner in your?#AI?projects, DM us or leave a message via the contact form on our website at?https://www.sparkbit.pl/