登录查看更多内容

From LLMs to MLLMs: Unlocking Advanced Machine Intelligence

Ashish Bhatia

Product Manager @ Microsoft

发布日期: 2023年7月3日

This article talks about the Microsoft Research Paper that describes the research behind Multimodal Large Language Models KOSMOS-1 and KOSMOS-2. KOSMOS-1 paper was released in Feb 2023 whereas KOSMOS-2 was released last week in June 2023.

Vision, the ability to detect and interpret light, was a key factor in the evolution of life and intelligence on Earth. It enabled the Cambrian explosion, a rapid diversification of animal forms and functions about 540 million years ago, by providing animals with a new way of exploring, exploiting, and competing in their environment. It also stimulated the evolution of neural complexity, behavior, and ecology, as animals had to process and integrate more information, and to make more sophisticated decisions and responses. Vision is also essential for the development of Artificial General Intelligence (AGI), the hypothetical ability of a machine or system to perform any intellectual task that a human can. A machine or system with vision would need to have or simulate the capabilities and flexibility of human vision, such as detecting, recognizing, inferring, generating, manipulating, learning, and integrating visual information, as well as using it for communication, reasoning, and creativity.

KOSMOS-1 is designed to understand and process various types of data, such as text, images, and audio. It combines the strengths of language models with the ability to perceive and analyze other modalities, like visual and auditory information. This makes it a highly versatile and advanced AI model capable of handling a wide range of tasks. To build KOSMOS-1, researchers used a large amount of data collected from the internet, including text documents, image-caption pairs, and documents containing a mix of text and images. This massive amount of diverse data allowed the model to learn how to understand and process different types of information effectively. The model was built using Transformer architecture.

KOSMOS-1 has several unique capabilities that set it apart from traditional language models:

Multimodal Perception: KOSMOS-1 can process and understand different types of data, such as text, images, and audio, allowing it to make sense of complex information and perform a wider range of tasks.
In-Context Learning: The model can learn from a small number of examples (few-shot learning) or even from just the instructions given to it (zero-shot learning). This means it can quickly adapt to new tasks without the need for extensive training.
Instruction Following: KOSMOS-1 can understand and follow natural language instructions, making it easier for users to interact with the model and use it for various tasks.

Here are some tasks that KOSMOS-1 can be leveraged:

Image Captioning: Given an image, KOSMOS-1 can generate a descriptive caption, accurately summarizing the content of the image. For instance, if shown a picture of a dog playing in a park, the model might generate a caption like "A happy dog running on the grass in a sunny park."
Visual Question Answering: KOSMOS-1 can answer questions based on visual information. For example, if asked, "What color is the car in the picture?" and given an image of a blue car, the model would correctly respond with "blue."
Multimodal Dialogue: The model can engage in conversations that involve both text and images. For instance, a user might ask KOSMOS-1, "What type of animal is in this picture?" and provide an image of a zebra. The model would be able to recognize the animal and respond with "It's a zebra."
OCR-Free Text Classification: KOSMOS-1 can analyze and understand text directly from images without relying on Optical Character Recognition (OCR) technology. This means it can process documents, web pages, and other text-based images more efficiently and accurately.

Some of the limitations of KOSMOS-1 are as follows:

Lack of grounding capabilities: KOSMOS-1 does not have the ability to directly link text to specific regions in an image using bounding boxes or other object descriptions. This limitation makes it less effective in tasks that require precise understanding and localization of objects within images.
Limited nonverbal reasoning: Although KOSMOS-1 demonstrates some ability to perform nonverbal reasoning tasks like the Raven IQ test, its performance is still significantly below the average level of adults.
Incomplete understanding of the visual world: KOSMOS-1's training data is mainly based on web-scale multimodal corpora, which may not cover all aspects of the visual world. As a result, the model may have difficulty understanding certain objects or scenes that are not well-represented in its training data.
Sensitivity to prompt design: The performance of KOSMOS-1 can be significantly influenced by the design of the input prompt. Crafting effective prompts can be challenging, and suboptimal prompts may lead to reduced performance or incorrect outputs.

KOSMOS-2 builds upon the capabilities of KOSMOS-1. KOSMOS-2 not only processes different types of data like text, images, and audio but also adds grounding capabilities. This means that the model can link specific regions in an image to relevant text, enhancing its understanding of both the visual and textual information. Researchers combined the multimodal corpora used for KOSMOS-1 with a new large-scale dataset called GRIT (Grounded Image-Text pairs). GRIT is created by extracting and associating noun phrases and referring expressions in captions to their corresponding objects or regions in images. By training KOSMOS-2 on this data, the model learns how to connect textual descriptions to specific parts of an image, enabling it to better understand the relationship between text and visual content.

Data & Analytics 1 年前

Mastering AI: Tools and Techniques for Developing a…

Amr Saafan 8 个月前

Elon Musk's Grok-1 Goes Open Source: Democratizing AI…

Innovation Incubator Advisory 8 个月前

KOSMOS-2?has unique capabilities, including:

Multimodal Grounding: KOSMOS-2 can connect textual descriptions directly to specific regions of an image, improving its ability to comprehend and analyze complex information in both text and images.
Multimodal Referring: The model can understand image regions or objects referred to by users via bounding boxes, making it easier for users to interact with the model and reducing ambiguity.
Improved Nonverbal Reasoning: KOSMOS-2 demonstrates better performance in nonverbal reasoning tasks like the Raven IQ test when compared to KOSMOS-1, indicating its potential for recognizing abstract concepts and identifying underlying patterns in nonverbal contexts.

Here are some examples to illustrate the capabilities and unique features of KOSMOS-2:

Phrase Grounding: Given an image and a phrase like "the red ball," KOSMOS-2 can generate a bounding box around the red ball in the image, demonstrating its ability to link textual descriptions to specific regions of an image.
Referring Expression Comprehension: If KOSMOS-2 is given an image and a referring expression like "the man wearing a blue hat," the model can identify and generate a bounding box around the person in the image wearing a blue hat.
Referring Expression Generation: In this task, KOSMOS-2 is given an image and a bounding box around a specific object or region. The model then generates a referring expression that accurately describes the object or region within the bounding box. For example, if given an image of a park and a bounding box around a bench, the model might generate the expression "the wooden bench under the tree."
Grounded Image Captioning: KOSMOS-2 can generate image captions that are grounded in specific regions of the image. For example, if given an image of a beach scene, the model might generate a caption like "A group of people is playing volleyball near the water, with a red umbrella?in the foreground."

KOSMOS-2 is an advanced AI model that improves upon the capabilities of KOSMOS-1 by incorporating grounding capabilities. Its unique features make it a valuable tool for a wide range of applications and demonstrates the potential for even more advanced AI models in the future.

In conclusion, the advancements in Multimodal Large Language Models are driving innovation and pushing the boundaries of artificial intelligence. These models highlight the ability to process and understand a wide range of data types and perform complex tasks across different modalities. While KOSMOS-1 laid the foundation for a versatile and powerful AI model, KOSMOS-2 took a step further by incorporating grounding capabilities that link textual descriptions to specific regions of an image. This improvement allows for a deeper understanding of both visual and textual information, making it a valuable tool for various applications. I feel the ability to perceive the environment through visual input is going to be an important ingredient for advanced machine intelligence, whether it's AGI or not that is a debate for another day.

Reference Paper: