From LLMs to MLLMs: Unlocking Advanced Machine Intelligence
This article talks about the Microsoft Research Paper that describes the research behind Multimodal Large Language Models KOSMOS-1 and KOSMOS-2. KOSMOS-1 paper was released in Feb 2023 whereas KOSMOS-2 was released last week in June 2023.
Vision, the ability to detect and interpret light, was a key factor in the evolution of life and intelligence on Earth. It enabled the Cambrian explosion, a rapid diversification of animal forms and functions about 540 million years ago, by providing animals with a new way of exploring, exploiting, and competing in their environment. It also stimulated the evolution of neural complexity, behavior, and ecology, as animals had to process and integrate more information, and to make more sophisticated decisions and responses. Vision is also essential for the development of Artificial General Intelligence (AGI), the hypothetical ability of a machine or system to perform any intellectual task that a human can. A machine or system with vision would need to have or simulate the capabilities and flexibility of human vision, such as detecting, recognizing, inferring, generating, manipulating, learning, and integrating visual information, as well as using it for communication, reasoning, and creativity.
KOSMOS-1 is designed to understand and process various types of data, such as text, images, and audio. It combines the strengths of language models with the ability to perceive and analyze other modalities, like visual and auditory information. This makes it a highly versatile and advanced AI model capable of handling a wide range of tasks. To build KOSMOS-1, researchers used a large amount of data collected from the internet, including text documents, image-caption pairs, and documents containing a mix of text and images. This massive amount of diverse data allowed the model to learn how to understand and process different types of information effectively. The model was built using Transformer architecture.
KOSMOS-1 has several unique capabilities that set it apart from traditional language models:
Here are some tasks that KOSMOS-1 can be leveraged:
Some of the limitations of KOSMOS-1 are as follows:
KOSMOS-2 builds upon the capabilities of KOSMOS-1. KOSMOS-2 not only processes different types of data like text, images, and audio but also adds grounding capabilities. This means that the model can link specific regions in an image to relevant text, enhancing its understanding of both the visual and textual information. Researchers combined the multimodal corpora used for KOSMOS-1 with a new large-scale dataset called GRIT (Grounded Image-Text pairs). GRIT is created by extracting and associating noun phrases and referring expressions in captions to their corresponding objects or regions in images. By training KOSMOS-2 on this data, the model learns how to connect textual descriptions to specific parts of an image, enabling it to better understand the relationship between text and visual content.
领英推荐
?
KOSMOS-2?has unique capabilities, including:
Here are some examples to illustrate the capabilities and unique features of KOSMOS-2:
KOSMOS-2 is an advanced AI model that improves upon the capabilities of KOSMOS-1 by incorporating grounding capabilities. Its unique features make it a valuable tool for a wide range of applications and demonstrates the potential for even more advanced AI models in the future.
In conclusion, the advancements in Multimodal Large Language Models are driving innovation and pushing the boundaries of artificial intelligence. These models highlight the ability to process and understand a wide range of data types and perform complex tasks across different modalities. While KOSMOS-1 laid the foundation for a versatile and powerful AI model, KOSMOS-2 took a step further by incorporating grounding capabilities that link textual descriptions to specific regions of an image. This improvement allows for a deeper understanding of both visual and textual information, making it a valuable tool for various applications. I feel the ability to perceive the environment through visual input is going to be an important ingredient for advanced machine intelligence, whether it's AGI or not that is a debate for another day.
Reference Paper:
Chief Operating Officer @ Planet Computers | MBA from Boston University
1 年Dr. Janko Mrsic-Flogel
Director Data & AI Microsoft Hispanic South America | CTO | Tech Executive | Advisor
1 年Gonzalo Becerra
Ashish Bhatia very interesting Ajit Jaokar this is very symbiotic if the work we are discussing which Peter Lee is proposing in his book https://www.amazon.co.uk/AI-Revolution-Medicine-GPT-4-Beyond/dp/0138200130/ref=sr_1_1
Thanks for Sharing! ?? Ashish Bhatia