What’s New in Deep Learning Research: Using Cross-Modal Learning to Build Neural Networks that See and Listen

What’s New in Deep Learning Research: Using Cross-Modal Learning to Build Neural Networks that See and Listen

Since we are babies, we intuitively develop the ability to correlate the input from different cognitive sensors such as vision, audio and text. While listening to a symphony we immediately visualize an orchestra or when admiring a landscape painting, our brain associates the visual with specific sounds. The relationships between images, sounds and texts are dictated by connections between different sections of the brain responsible from analyzing specific cognitive input. In that sense, you can say that we are hardwired to learn simultaneously from multiple cognitive signals. Despite the advancements in different deep learning areas such as image, language and sound analysis, most neural networks remain specialized on a single input data type. Recently, researchers from Alphabet’s subsidiary DeepMind published a research paper proposing a method that can simultaneously analyze audio and visual inputs and learn the relationships between objects and sounds in a common environment.

Under the title “Objects that Sound”, the DeepMind research paper focuses on an subdiscipline known as cross-modal learning which focuses on studying the hidden relationships between images, sounds and text. Cross-modal learning has seen some success in the image-text relationship area but very little has done in terms of models that can correlate images and sounds. The explanation for that is very simple, text is much closer to a semantic annotation than audio. When analyzing the text form of a provided caption of an image, the objects are directly available and the problem is then to provide a correspondence between the noun and a spatial region in the image. In the case of audio, obtaining the semantics is less direct. Think about the differences between classifying an image as to whether it contains a dog or not, and classifying an audio clip as to whether it contains the sound of a dog or not.

The traditional approach to cross-modal learning problems has been to use tech-student supervision networks where the “teacher” has been trained using a large number of human annotations. For instance, a vision network trained on ImageNet can be used to annotate frames of a YouTube video as “acoustic guitar”, which provides training data to the “student” audio network for learning what an “acoustic guitar” sounds like. The challenge with the teacher-student approach is that images and audio are not processed in the same time and space sequence which introduces a lot of contextual variances. Additionally, teacher student models are notoriously expensive to implement at scale as require large curated training datasets.

AVC & AVE-Net

To address the limitations of teacher-student models, the DeepMind team relied on a form of cross-modal learning known as audio-visual-correspondence(AVC). The AVC method takes an input pair of a video frame and 1 second of audio and it tries to decide whether they are in correspondence or not. Using the previous analogy, the AVC model will train both visual and audio networks from scratch, enabling the concept of the “acoustic guitar” to naturally emerge in both modalities.

The specific AVC model introduced in the DeepMind paper is called Audio-Visual Embedding Network(AVE-Net) which takes an input dataset formed by pairs of images and 1 second audio spectrograms. The models processes the input using audio and visual subnetworks followed by a feature vision layer that tries to determine if there is a relationship between the image and the sound. The following diagram illustrates the AVE-Net neural network architecture.

By processing audio and visuals concurrently, the AVE-Net model can use a simple Euclidean distance technique to determine the relationships between the embeddings of the two subnetworks [audio, images]. Initial tests showed that AVE-Net was incredibly smart detecting bidirectional correlations between objects and sounds as clearly seen in the following videos.

In the first wave of experiments, AVE-Net outperformed traditional cross-modal learning models by a wide margin across different environments.

Can You Tell Me What Object Makes This Sound?

Parents constantly ask babies to imitate sounds of different objects or animals. Cognitively, this is a good exercise to develop cross-modal learning abilities in infants. The AVE-Net architecture shown in the previous section proved efficient on determining correlations between images and sounds domains, but it still not able to identify which objects within an image or video frame produces a specific sound. To address that challenge, the DeepMind team created a variation of the AVE-Net model that goes a level of granularity deeper to try to correlate regions/objects within and image to specific sounds. Called Audio-Visual Object Localization(AVOL-Net, the model takes an image and sound pair and tries to find regions of the image which explain a specific sound, while other regions should not be correlated with it and belong to the background.

The AVOL-Net architecture looks similar to the AVE-Net model with a variation in the visual network that generates a grid of visual embeddings corresponding to different regions in the image vector. The similarities between the audio and all vision embeddings reveal the location of the object that makes the sound, while the maximal similarity is used as the correspondence score.

The following video shows the effectiveness of the AVOL-Net model identifying the specific objects which an image that correspond to a target sound.

Cross-modal learning is still in its infancy but methods such as AVE-Net and AVOL-Net represent major milestones in this area of deep learning. Both techniques were able to learn semantic relationships between images and sounds operating in the same environment while the AVOL-Net model was able to correlate sounds to specific objects within an image. Methods such as AVE-Net and AVOL-Net can become extremely relevant in artificial intelligence(AI) agents that operate that are exposed to real world environments.

要查看或添加评论,请登录

Jesus Rodriguez的更多文章

社区洞察

其他会员也浏览了