Two Phase Modality Fusion

Two Phase Modality Fusion

Introduction

Multimodal systems, including user interfaces, robot controllers, or medical imaging machines, often reap the benefits of combining multiple elements. Yet, they also face the challenge of integrating disparate components. This process, known as modality fusion, involves amalgamating different data sources.

In many multimodal user interfaces (MMUIs), we must combine linguistic information derived from speech recognition or typed input with spatial input. This spatial input could come from a simple two-dimensional device like a mouse or pen, a vision system, a data glove, or a multi-touch screen with several degrees of freedom. In natural communication, humans tend to use the linguistic channel primarily for language and the spatial channel for spatial information, such as pointing to or indicating the shape of objects. Linguistic information is discrete and structured in the form of language, while spatial information often conveys continuous or multidimensional continuous values. These can include indicating a length (“this fish was this big” <gesture>), specifying a velocity (“make it go this fast” <gesture>), or describing a complex shape (“create a gourd that is this shape” <wave hands>).


In a virtual reality system, you could wave your hands around to form the general shape of an object. Combined with a spoken prompt, this could grant you the "God-like" powers of 3D asset creation that many of us desire.

Without a description of what the markings or hand movements are supposed to indicate, it would be very difficult for a system, or even a human, to interpret them. Some gesture-only user interfaces, like sign language interpretation systems, extract linguistic information from gestures. If you only have one channel, you will generally need to get both kinds of information from the same channel. However, most people don’t want to learn sign language, and in natural interaction, people tend to use language more for linguistic content and use the spatial modality for continually valued spatial content or pointing to things. Nevertheless, in a multimodal user interface (MMUI), it may be necessary to classify a gesture to correctly extract the spatial information from it. For instance, if the user draws an "X" versus an arrow, we should use a different algorithm for determining the referent of the gesture.

A deictic gesture employed to select an object or location will have a different point of focus and potential level of ambiguity depending on its structure and the conventions in which the marking is used in the culture/domain.

It may also be beneficial to develop multimodal user interfaces that extract continuously valued information from the speech channel. Significant research has focused on extracting emotional cues to enhance conversational systems. Additionally, capturing intonation and prosody can be highly useful for controlling virtual actors.

Approaches to Fusion

Each modality in a multimodal system typically undergoes multiple stages of processing before it is utilized. These stages can be conceptualized as a pipeline. In such systems, there are generally two pipelines that converge at a single node during the fusion process.

Two processing pipelines must eventually be merged at some point.

Natural language processing typically involves several stages:

·? Lexical Analysis: Examining words individually.

·? Syntactic Analysis: Organizing words into structures like sentences.

·? Semantic Analysis: Extracting meanings in relation to context.

·? Discourse Analysis: Interpreting words within the broader dialogue.

·? Pragmatic Analysis: Considering real-world implications to determine the final action.

While these phases are crucial for understanding texts, command systems are generally simpler. Even fairly recently many developers were still using simple template matching to select an intent from a few options and one or two entities, such as:

?“Turn <on/off> the <appliance name>.” -> chage_state(<on/off>, <appliance name>)

A more advanced parser can handle commands like:

?“Delete all the red squares that are over one inch in size.”

Modern large language models (LLMs) offer a similar sequence-to-sequence processing approach, effectively reducing the complexity of these phases into a largely opaque mechanism that converts transcripts into executable programs.

Most speech recognition-based agents today employ separate units for ASR and NLP, though there has been some work on end-to-end systems that take sound in and output function calls.

Multimodal system developers speak of early, mid-stage, and late fusion. Early fusion involves bringing the pipelines together more quickly. For example, we might convert the spatial input into tokens and then merge these with the tokens from the speech recognizer before parsing them, a form of early fusion. In my newest system, I am converting gestures to descriptions and then concatenating them with the prompt that I then send to a pre-trained LLM.

In a speech system, we could potentially have an even earlier form of fusion. When speech recognition engines were relatively primitive and required a lot of prior information in the form of a small grammar to correctly identify what the user was saying, systems would continually update this grammar at different points of the conversation. In one innovative system, clicking on different objects on the screen would invoke particular grammar relevant to that object. This is analogous to loading the right-click menu into the ASR engine.

Alternatively, we could employ a very late fusion strategy. In the system I built for my dissertation, I used a parser to compile the natural language into a set of programs that, as part of their function, would query the log of gestures made during the turn and extract the desired parameters. Without considering the gestures, the natural language system could potentially have some ambiguity as to how to process the command. It had the ability to create multiple programs that could, in addition to performing an action, return a score indicating how well they had managed to use the accompanying markings. The hypothesis with the highest score would then be executed to perform what would hopefully be the desired action. (Note these were not actual programs so much as complex objects with a “get_score” function and a “execute” function that reflected interpretations of what the user had said)

Multimodal input systems, whether used as interfaces or sensors can potentially provide a more accurate understanding of the world. An analysis of a second modality can resolve ambiguities, inaccuracies, or uncertainties found in the first. This requires that we retain the inaccuracies, ambiguities, and uncertainties until we can consider additional information. This even happens in a single-mode system. For instance, we can take the n-best results from a speech recognizer and then use our parser to act only on the one that actually makes any sense for that application and its current state.

For example, a speech recognition engine may return this list of alternate results:

  1. Reheat the red chair
  2. Preheat the red chair
  3. Delete the red chair
  4. Reheat the dead chair
  5. Preheat the dead chair
  6. Delete the dead chair
  7. Reheat the red square
  8. Delete the red square
  9. Reheat the dead square
  10. Delete the dead square
  11. Preheat the dead bear
  12. Reheat that red bear
  13. Delete the dread bear

You can imagine how a graphics program should probably favor option 8, while a household control system might favor options 1 and 2. A cooking robot might consider option 11, and perhaps the cryogenic controls on an interstellar colonization ship might favor option 12 (though probably only in video games). Generally, 10 options seem to be more than enough. In my experience, the first option will be the best, and the second option will occasionally be used to reduce your error rate by a few points.

In contrast, a simple dictation system would generally be better off displaying the most highly rated of the n-best results. However, if it had a good language model, it could potentially choose from among alternate results or even hypothesize what the user may have actually said or meant to say. (Note: n-best can also be useful in a dictation system to give the user alternatives as part of the recognition error recovery process.)

In a multimodal system, we can bring forward this indeterminacy from both modalities, cross-correlate them in a manner that reduces the uncertainty, and in some cases resolve a singular interpretation. For instance, if there had been a red chair, a dead bear, and a dread bear on the screen of a drawing program then an accompanying gesture would have been able to recommend the correct object to delete.

(Note that we can represent ambiguities using an "n-best approach", where we maintain a list of alternatives that we can narrow down with further evidence. Alternatively, we could store incomplete hypotheses, such as a vector graphics representation with missing parameters, or assign probability distribution functions to parameters.)

Sometimes, though, it may be desirable to iterate between the input modalities and gradually build an interpretation hypothesis by taking in a little information from one channel and then a little bit more from another modality.

The point of this post, however, is to make a case for merging modalities in an even more iterative process in which different chunks of input are reanalyzed after an initial analysis of them has been combined with information from the other modality. This may have advantages when dealing with gestures that need to be combined with utterances in complex ways, drawing understanding, and possibly in some generative image processing systems as well.

Consider how we can process an arc that the user makes on the screen. The interpretation of this marking will depend highly on what else is on the screen and what the user says while making it. In one situation, the user might draw it around the corner of a square and say, “Rotate the square around its lower corner by 30°,” as in the figure below.

The gesture will help select the object along with part of the utterance. While either could be ambiguous in the diagram below, neither would be sufficient alone in more complex situations, as in the first example in the subsequent diagram. Similarly, the gesture and utterance together specify which corner around which the object should be rotated. Hand markings are not ideal for specifying extremely precise values. Arc gestures are difficult to make precisely; however, the linguistic channel can specify angles very precisely, to fractions of an arc second if desired.

Technical terms like "counterclockwise" and "clockwise" can easily be confused, particularly in a three-dimensional domain where the frame of reference may be ambiguous. A spatial marking, however, can clarify this easily. In an animation program or simulation, the user might also specify the speed of rotation and possibly even the angular acceleration through the spatial channel. In a simulation, they might indirectly specify the coefficients of an axial spring to simulate how an object might move back and forth if perturbed by a certain amount.

Multimodal interaction allows the user to specify such motions and characteristics in a much more intuitive manner, though it does present some difficulties for the developer.


A marking made along with an utterance specifies? Which object to rotate,? how much to rotate it by, and what point to rotate it around.


If only one object is present in the vicinity, the task should be straightforward. However, if there are multiple squares, we need to distinguish between the square with a circled lower corner and one with a circled upper corner. The same marking, in different utterances or with varying background graphics, could be interpreted as a connection line, a selection line or arc, or even a new geometric element. In these instances, we would need to extract additional parameters. Often, the utterance itself would guide the selection and parameterization of the extraction algorithm, and revisiting the gesture might be necessary.

Five alternative environments in which the exact same gesture when combined with different utterances could invoke radically different actions.

This becomes even more important when dealing with beautification. Sketch clean-up, or beautification, is the process by which the user draws something, and the system replaces it with a precise line drawing. Some systems even transform rough 2D markings into 3D geometric models. The interaction paradigm is similar to using a generative network to combine prompts with initial images. Extensive research was conducted in this field in the 1990s. Unfortunately, casually drawn sketches are almost impossible to understand without context. This context can be provided by selecting a very narrow domain, such as circuit layout or stuffed toy creation, or by offering controls to indicate how markings should be interpreted. The latter approach was cumbersome with menu-driven interfaces, while the former was limiting. However, using natural language descriptions to interpret strokes provides a more feasible solution. In fact, people often do this when having conversations over whiteboards.

In an envisioned multimodal vector drawing system, the user would draw a simple figure while simultaneously providing verbal context. They might say something like, "Put a block here with a hole in it." (In VR, the user could indicate a complex shape by sketching it in the air with multiple fingers at the same time.) In an envisioned multimodal vector drawing system, the user would draw some simple figure while simultaneously providing some interpretation context verbally. They might say something like "Put a block here with a hole in it."

Our sequence-to-sequence NLP system could then be used to convert the utterance to an abstract interpretation that leaves the unspecified numeric values needed to instantiate the model uninitialized. We would then give this model to a vision system that would attempt to parameterize it by looking at the markings the user had made. If the NLP system did not reject all but one of the N-best speech results we would have multiple abstract models to try to parameterize and then choose from. Worse yet it is highly likely that the utterance would be sufficiently vague that the NLP system would return multiple potential abstract interpretations. There could be a combinatorially large number of these and the parameterization/evaluation algorithm could be extremely computationally intensive. We may however be able to reduce the number of abstract models to try to fit against the image by using an initial fusion step where generalized analyses of the markings are combined with the linguistic tokens.

A? sketch of a "block with a hole in it" and a number of possible interpretations of the phrase. a multimodal beautification system must find the appropriate interpretation and parameterize the model from the sketch.?

A while ago, I built a multimodal chart parser-based system which performed the fusion in essentially two stages. Initially, gestures would be converted into tokens. This process might result in a set of alternative hypotheses. For instance, if the user made two lines, these could be converted into two tokens, or if they crossed, they might also be converted into an "X" marking. The parser would then attempt to combine these with the n-best results from the speech recognition engine. This would essentially create a number of competing evaluation / actions scripts, though fewer if we had not considered tokens from both streams as in the system mentioned above. Then, in the second phase of fusion (essentially when the results objects were asked for scores), parameters would be extracted from the gestures to fill out the necessary fields in the command.

I am currently working on implementing a similar approach using Langchain’s agent capabilities. In this setup, the prompt will initially be enhanced with descriptions of gestures, and the agent will then use tools I provide it to analyze the new gesture log.

While this approach may not be necessary for most multimodal user interfaces, it could be quite beneficial for some applications. Additionally, the concept of dual or multi-phase modality fusion might be applicable to other multimodal systems, such as guidance systems. Though this is somewhat outside my usual scope, I hope it will inspire some of my readers.

In a future post, I plan to explore potential algorithms for parameterizing abstract models through vision and seek suggestions. Eventually, I aim to present my findings on multimodal sketch clean-up.

要查看或添加评论,请登录

Andre Milota的更多文章

社区洞察

其他会员也浏览了