登录查看更多内容

What’s New in Deep Learning Research: Using Cross-Modal Learning to Build Neural Networks that See and Listen

Jesus Rodriguez

CEO of IntoTheBlock, Co-Founder, Co-Founder of LayerLens, Faktory,and NeuralFabric, Founder of The Sequence AI Newsletter, Guest Lecturer at Columbia, Guest Lecturer at Wharton Business School, Investor, Author.

发布日期: 2018年8月20日

Since we are babies, we intuitively develop the ability to correlate the input from different cognitive sensors such as vision, audio and text. While listening to a symphony we immediately visualize an orchestra or when admiring a landscape painting, our brain associates the visual with specific sounds. The relationships between images, sounds and texts are dictated by connections between different sections of the brain responsible from analyzing specific cognitive input. In that sense, you can say that we are hardwired to learn simultaneously from multiple cognitive signals. Despite the advancements in different deep learning areas such as image, language and sound analysis, most neural networks remain specialized on a single input data type. Recently, researchers from Alphabet’s subsidiary DeepMind published a research paper proposing a method that can simultaneously analyze audio and visual inputs and learn the relationships between objects and sounds in a common environment.

Under the title “Objects that Sound”, the DeepMind research paper focuses on an subdiscipline known as cross-modal learning which focuses on studying the hidden relationships between images, sounds and text. Cross-modal learning has seen some success in the image-text relationship area but very little has done in terms of models that can correlate images and sounds. The explanation for that is very simple, text is much closer to a semantic annotation than audio. When analyzing the text form of a provided caption of an image, the objects are directly available and the problem is then to provide a correspondence between the noun and a spatial region in the image. In the case of audio, obtaining the semantics is less direct. Think about the differences between classifying an image as to whether it contains a dog or not, and classifying an audio clip as to whether it contains the sound of a dog or not.

The traditional approach to cross-modal learning problems has been to use tech-student supervision networks where the “teacher” has been trained using a large number of human annotations. For instance, a vision network trained on ImageNet can be used to annotate frames of a YouTube video as “acoustic guitar”, which provides training data to the “student” audio network for learning what an “acoustic guitar” sounds like. The challenge with the teacher-student approach is that images and audio are not processed in the same time and space sequence which introduces a lot of contextual variances. Additionally, teacher student models are notoriously expensive to implement at scale as require large curated training datasets.

AVC & AVE-Net

To address the limitations of teacher-student models, the DeepMind team relied on a form of cross-modal learning known as audio-visual-correspondence(AVC). The AVC method takes an input pair of a video frame and 1 second of audio and it tries to decide whether they are in correspondence or not. Using the previous analogy, the AVC model will train both visual and audio networks from scratch, enabling the concept of the “acoustic guitar” to naturally emerge in both modalities.

The specific AVC model introduced in the DeepMind paper is called Audio-Visual Embedding Network(AVE-Net) which takes an input dataset formed by pairs of images and 1 second audio spectrograms. The models processes the input using audio and visual subnetworks followed by a feature vision layer that tries to determine if there is a relationship between the image and the sound. The following diagram illustrates the AVE-Net neural network architecture.

By processing audio and visuals concurrently, the AVE-Net model can use a simple Euclidean distance technique to determine the relationships between the embeddings of the two subnetworks [audio, images]. Initial tests showed that AVE-Net was incredibly smart detecting bidirectional correlations between objects and sounds as clearly seen in the following videos.

In the first wave of experiments, AVE-Net outperformed traditional cross-modal learning models by a wide margin across different environments.

Can You Tell Me What Object Makes This Sound?

Parents constantly ask babies to imitate sounds of different objects or animals. Cognitively, this is a good exercise to develop cross-modal learning abilities in infants. The AVE-Net architecture shown in the previous section proved efficient on determining correlations between images and sounds domains, but it still not able to identify which objects within an image or video frame produces a specific sound. To address that challenge, the DeepMind team created a variation of the AVE-Net model that goes a level of granularity deeper to try to correlate regions/objects within and image to specific sounds. Called Audio-Visual Object Localization(AVOL-Net, the model takes an image and sound pair and tries to find regions of the image which explain a specific sound, while other regions should not be correlated with it and belong to the background.

The AVOL-Net architecture looks similar to the AVE-Net model with a variation in the visual network that generates a grid of visual embeddings corresponding to different regions in the image vector. The similarities between the audio and all vision embeddings reveal the location of the object that makes the sound, while the maximal similarity is used as the correspondence score.

The following video shows the effectiveness of the AVOL-Net model identifying the specific objects which an image that correspond to a target sound.

Cross-modal learning is still in its infancy but methods such as AVE-Net and AVOL-Net represent major milestones in this area of deep learning. Both techniques were able to learn semantic relationships between images and sounds operating in the same environment while the AVOL-Net model was able to correlate sounds to specific objects within an image. Methods such as AVE-Net and AVOL-Net can become extremely relevant in artificial intelligence(AI) agents that operate that are exposed to real world environments.

要查看或添加评论，请登录

Jesus Rodriguez的更多文章

Robust Agents Are All We Need: Faktory Emerges from Stealth Mode with a Private?Alpha

2024年2月28日

Robust Agents Are All We Need: Faktory Emerges from Stealth Mode with a Private?Alpha

Last year, I had the unique opportunity to incubate a new project in the autonomous agents space, alongside a…

1 条评论
Google’s BLEURT is BERT for Evaluating Natural Language Generation Models

2020年5月27日

Google’s BLEURT is BERT for Evaluating Natural Language Generation Models

Natural language generation(NLG) is one of the fastest growing areas of research in deep learning. NLG applications are…
Two Deep Learning Frameworks and an AI Super-Computer: Microsoft Launches New Efforts to Achieve Large-Scale AI

2020年5月25日

Two Deep Learning Frameworks and an AI Super-Computer: Microsoft Launches New Efforts to Achieve Large-Scale AI

Training models with massive datasets is becoming the norm in modern deep learning applications. Some of the latest…
Uber Open Sources a New Framework for Designing Optimal Statistical Experiments

2020年5月18日

Uber Open Sources a New Framework for Designing Optimal Statistical Experiments

Rapid experimentation is a key element of modern software development. The raise in popularity of machine learning, has…
Uber Unveils Its New Data Quality Management Solution

2020年5月13日

Uber Unveils Its New Data Quality Management Solution

Data quality management is one of those often forgotten aspects of machine learning workflows. Small inconsistencies or…
LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability

2020年5月7日

LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability

Interoperating TensorFlow and Apache Spark is a common challenge in real world machine learning scenarios. TensorFlow…
Google Unveils TAPAS, a BERT-Based Neural Network for Querying Tables Using Natural Language

2020年5月6日

Google Unveils TAPAS, a BERT-Based Neural Network for Querying Tables Using Natural Language

Querying relational data structures using natural languages has long been a dream of technologists in the space. With…
Facebook Open Sources Blender, the Largest-Ever Open Domain Chatbot

2020年5月4日

Facebook Open Sources Blender, the Largest-Ever Open Domain Chatbot

Natural language understanding(NLU) has been one of the most active areas adopting state-pf-the-art deep learning…

2 条评论
Microsoft Research Unveils Three Efforts to Advance Deep Generative Models

2020年4月27日

Microsoft Research Unveils Three Efforts to Advance Deep Generative Models

Generative models have been an important component of machine learning for the last few decades. With the emergence of…
Facebook and Amazon Bring Two Projects to PyTorch 1.5 that Streamline the Lifecycle of Production-Ready Deep Learning Models

2020年4月22日

Facebook and Amazon Bring Two Projects to PyTorch 1.5 that Streamline the Lifecycle of Production-Ready Deep Learning Models

PyTorch is one of the fastest growing open source projects in the deep learning space. Initially incubated by Facebook,…

See all articles

What’s New in Deep Learning Research: Using Cross-Modal Learning to Build Neural Networks that See and Listen

Jesus Rodriguez

CEO of IntoTheBlock, Co-Founder, Co-Founder of LayerLens, Faktory,and NeuralFabric, Founder of The Sequence AI Newsletter, Guest Lecturer at Columbia, Guest Lecturer at Wharton Business School, Investor, Author.

AVC & AVE-Net

Can You Tell Me What Object Makes This Sound?

Jesus Rodriguez的更多文章

社区洞察

其他会员也浏览了

DEEP LEARNING BASED OJECT RECOGNITION SYTEM: Analyzing The Effect Of The Learning Rate In A Convolutional Neural Network

Training Models in AI: Comparing Neural Network Training to Human Cognitive Development

Leveraging Transfer Learning for Computer Vision

WHAT IS DEEP LEARNING

Automating Neural Network Configuration with Keras-Tuner

Top 10 Activation Functions in Deep Learning

What's the basis of modern Deep Learning Models?

How We Learn - A Book about Learning in Human Brain and Machines

Deep Learning Demystified: Understanding Neural Networks and Their Applications

Artificial Intelligence - Part 4 - Deep Learning

AVC & AVE-Net

Can You Tell Me What Object Makes This Sound?

Jesus Rodriguez的更多文章

Robust Agents Are All We Need: Faktory Emerges from Stealth Mode with a Private?Alpha

Google’s BLEURT is BERT for Evaluating Natural Language Generation Models

Two Deep Learning Frameworks and an AI Super-Computer: Microsoft Launches New Efforts to Achieve Large-Scale AI

Uber Open Sources a New Framework for Designing Optimal Statistical Experiments

Uber Unveils Its New Data Quality Management Solution

LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability

Google Unveils TAPAS, a BERT-Based Neural Network for Querying Tables Using Natural Language

Facebook Open Sources Blender, the Largest-Ever Open Domain Chatbot

Microsoft Research Unveils Three Efforts to Advance Deep Generative Models

Facebook and Amazon Bring Two Projects to PyTorch 1.5 that Streamline the Lifecycle of Production-Ready Deep Learning Models

社区洞察

其他会员也浏览了

DEEP LEARNING BASED OJECT RECOGNITION SYTEM: Analyzing The Effect Of The Learning Rate In A Convolutional Neural Network

Training Models in AI: Comparing Neural Network Training to Human Cognitive Development

Leveraging Transfer Learning for Computer Vision

WHAT IS DEEP LEARNING

Automating Neural Network Configuration with Keras-Tuner

Top 10 Activation Functions in Deep Learning

What's the basis of modern Deep Learning Models?

How We Learn - A Book about Learning in Human Brain and Machines

Deep Learning Demystified: Understanding Neural Networks and Their Applications

Artificial Intelligence - Part 4 - Deep Learning