An Emerging Frontier in AI: Unlocking Non-Human Modalities for Domain Experts
It is unsurprising that the ongoing AI revolution is intertwined with language models. As Yuval Harari points out [1], "Language serves as the operating system of human culture," and in the last couple of years we witnessed AI achieving language mastery beyond human norms. Humanity’s worldview and reasoning are intricately tied to language—hence, we marveled at ChatGPT's ability to write essays and poems, and Dall-E's knack for creating pictures of avocado-shaped armchairs. Somehow, language skills felt much more striking than similarly impressive advancements in non-verbal AIs like the complex interaction of sensing and actions in self-driving cars or AlphaFold’s effective solution [2] to the protein folding problem. I believe, though, that not only is language making progress more noticeable, but that it will accelerate it beyond pure language and image applications by integrating with modalities beyond human senses.?
The recent surge of interest centers around multi-modal models that fuse language with various human senses such as image, audio, and video. Much like their language counterparts, these models exhibit promising potential across industries such as film, gaming, education, music/video production, and ads and marketing. Yet, an overlooked aspect, and the topic of this article, is a similar level of performance beyond human-perceived modalities currently being explored, such as audio and video. There is no intrinsic difference between a numerical representation of air pressure waves (audio) versus the one of vibrations in the structure of a bridge, or even non-physical time series such as fluctuations in stock prices. These modalities are simply data representations of physical and abstract processes that humans did not evolve to sense directly (they are meta-anthropic, if you will), but, from a data perspective, are analogous to encodings for audio or video.?
The slow adoption of AI in organizations stems from two key issues, among others: the skills gap in combining AI and domain expertise, and the high cost of high-quality training data. These compound with the fact that most traditional ML models are aimed at very specific use cases, each requiring both AI and domain-specific skills along with curated use case-specific labeled data. As we have seen with large language models, foundational models can be applied to many use cases with low-cost prompt engineering or at most fine tuning. Training foundational multi-modal models combining domain-specific modalities with large language models can lower the bar of applying AI to specific use cases to domain-experts that do not possess deep AI skills. Just as important, foundational multi-modal models can take advantage of the independent training of the two modalities on large corpus of unlabeled data, fusing the two aspects with limited labeled data that combines them. Multi-modal models are in the position of democratizing the use of AI beyond language, audio, and video.?
Let us dive into a concrete example to illustrate this: structural monitoring of physical buildings and infrastructure [3]. In this scenario, sensors provide data points on structure vibration, temperature, and humidity levels of physical structures, and critical conditions are described by a large corpus of existing structural inspection documents and building codes and textbooks. A foundational model combining the two modalities, time series analysis for sensor data and language models for inspection and building knowledge, enables domain experts with limited AI skills to apply the model to use cases such as safety analysis of structures, root cause of failures, and recommendations of maintenance procedures. ?
There are two analogies with image and text models that can make the possibilities clearer. Today, we have models that can detect people's mood from an image based on facial expression and body language. This would be analogous of looking at the sensor data coming from an instrumented bridge and detect, with a confidence level, if the bridge is “unhappy,” in the sense of in need of maintenance procedures. Even more interesting is the combination of the sensor modality with the knowledge in the language model. In this analogy, we consider the emergent capabilities [4, 5] of large multi-modal models such as GPT-4, where the model can understand what the image represents and combine it with its own knowledge. This allows a model to take data from sensors and derive not only that the data is anomalous, and the health is at risk, but also identify likely causes along with maintenance recommendations based on past activities and operating procedures.?
The applicability of meta-anthropic modalities can benefit many industries, the most impact from the ones where skills gap and lack of clean training data is most present. Beside the structural monitoring example, healthcare and finance are two large industries with plenty of existing data for meta-anthropic modalities (e.g., medical sensor data and exams, stock price time series), a large corpus of natural language knowledge, and, critically, the natural language interpretations of the non-language data. After a first wave of large data-rich companies building specialized foundational LLMs, I am expecting meta-anthropic modalities to provide the next opportunity for AI impact and democratization. What other industries do you think could revolutionize their operations by embracing AI beyond traditional human-perceived modalities?