Multimodality: Next Wave in Artificial Intelligence
MUST Research Labs
Data Science | Cognitive Computing | Artificial Intelligence | Machine Learning | Advanced Analytics
Author: Pooja Palod
Multimodal AI learns from various data modalities (text, images, audio, and videos) learning to better understand and analyze information. Multimodal AI outperforms single-modal AI in real-world problems. We, humans, prefer multimodality. We learn more from text with images/videos rather than only text. Our cognitive ability is associated with modality in general. Machine learning systems do not possess a lot of common-sense knowledge like humans. Common sense is typically acquired through a combination of visual, linguistic, and sensory cues rather than language alone. Multimodality can address this lack of common sense in artificially intelligent systems.
This week many giant corporates in research have released various large multi-modal models. The latest is GPT4 released by Open AI and PALM-E released by Google.
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks. On a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers. This contrasts with GPT-3.5, which scores in the bottom 10%. On a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models and most state-of-the-art systems. On the MMLU benchmark, an English language suite of multiple-choice questions covering 57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English but also demonstrates strong performance in other languages GPT-4 has outperformed its predecessor but still it has limitations like earlier GPT models. GPT-4 suffers from hallucinations, it's not reliable. It can't learn from experiences. There are safety challenges involved with the use of these large language models. Authors in the technical report mention that careful study of these challenges is an important area of research given the potential societal impact. Some of the risks which they foresee are bias, disinformation, over-reliance, privacy, cybersecurity, proliferation, and more. They also discuss interventions they made to mitigate potential harms from the deployment of GPT-4, including adversarial testing with domain experts, and a model-assisted safety pipeline.
Google has released PaLM-E: An Embodied Multimodal Language Model? PaLM, a powerful large language model, and “embodied” it (the “E” in PaLM-E), by complementing it with sensor data from the robotic agent. In The language, the model was trained using raw streams of robot sensor data. The resulting model not only enables highly effective robot learning, but it is also a state-of-the-art general-purpose visual-language model while maintaining excellent language-only task capabilities. Researchers demonstrated that diversity in training leads to several approaches of transfer from the vision language domains into embodied decision-making, enabling robot planning tasks to be achieved data efficiently. Google has also released PALM API and made a suite. Developers and businesses can leverage this API to build AI apps.
Microsoft released Kosmos-1, a large language model capable of perceiving multimodal input, following instructions, and performing in-context learning for not only language tasks but also multimodal tasks. This model enhances ChatGPT with voice and visual commands. The kosmos-1 model was trained on web-scale multimodal corpora, which ensures that the model robustly learns from diverse sources.
Microsoft researchers also show that MLLMs can benefit from cross-modal transfer, which means moving knowledge from language to multimodal and from multimodal to language.
As we can see all tech giants have released their first models in this space of multimodality. This multi-modality model will soon find applications in almost all industries and domains ranging from medical to manufacturing. The consumers of this model have to be very careful while developing applications as these models have potential bias and privacy/security pitfalls. These models have opened a new era and have made AI accessible to the general public. But as every coin has two sides, if not used properly it can harm mankind.
领英推荐
References:
Contact: Joy Mustafi
Data Scientist @ SAP | Thought Process Leader | Analytical Thinker | 1x AWS Certified | 5x Azure Certified | Certified Scrum Master | The views expressed in my posts are my own |
1 年Great post! Thanks for sharing.