Apple Ferret outclasses ChatGPT-Vision. Context: LLMs are stochastic word predictors prone to hallucination. In textual tasks, grounding the model is easy. You have to paste the relevant data to the prompt. It’s called in-context learning. Due to the amount of data used for model training, LLMs are extremely effective on previously unseen textual data. How about Large Multimodal Models? Well, images are a different story. ?? Use case: I have a table full of ingredients and ask the model, “What can I cook from these.” Now, the model’s task is to recognize what’s on the image (the grounding data) and find a match with the recipe (that’s the easy part, as, again, that’s textual). It’s a task that will give GPT Vision a lot of trouble. Ferret, on the other hand, will do it and do it well. While being an order of magnitude smaller. The most interesting part - How does this work then? The bottom line is “hybrid representation,” and here’s how it works: ?? A standard Transformer-based model processes images by “brute force” applying an attention mechanism called “global attention.” Humans do it intelligently. ?? Example - a panoramic photo of a city. How do we figure out which city it is? Usually, by looking at characteristic buildings (e.g., an opera house indicating the city is Sydney). ?? A Transformer-based model will scrutinize all the pixels systematically. The bigger the picture, the more resources will be used. ?? For the Ferret model, apart from global attention, there’s a free-form component - a spatial-aware visual sampler. ?? In simple words, Ferret takes an input of an image, text prompt, and a visual cue (e.g., a free-form shape - an indication of an object). ?? It then samples some points inside the visual cue region and uses KNN architecture to extend the shape to capture the whole object. ?? It means that apart from global attention (to learn what’s on the image), it focuses on a certain part of an image (a region of interest). ?? With that knowledge, it can also understand the relation of our selected object and the whole image. Apparently, Apple is getting ready to make MLLMs fully operational, and we can't wait to see how it will improve its product portfolio. --- For more insights on?#data?and?#MachineLearning, follow?Sparkbit?on Linkedin. If you're looking for a tech partner in your?#AI?projects, DM us or leave a message via the contact form on our website at?https://www.sparkbit.pl/