登录查看更多内容

Meet Alibaba's EMO: Emote Portrait Alive

NIDHIN KUMAR

? ???? Lead | 9+ Years of Crafting Digital Journey | User-focused web applications by combining the structured power of Angular | React | Next.js with cutting edge innovations. Driven to build what’s next! ??

发布日期: 2024年3月18日

In recent years, we are witnessing the advancements that are happening in the field of image and video generation.

One of the recent development in this domain is EMO: Emote Portrait Alive a framework introduced by Alibaba's Group Institute for Intelligent Computing.

EMO utilizes an audio2video difussion model to generate expressive portrait videos with remarkable realism and accuracy.EMO pushes the boundaries of What is possible in talking head video generation.

Understanding the EMO Framework

The EMO Framework is a two stage process that combines audio and visual information to generate highly expressive portrait videos.

In the Initial stage called Frames encoding, a neural network called ReferenceNet which extracts features from a single reference image and motion frames. This encoding process lays the foundation for the subsequent difussion process.

During the Diffusion Process stage, EMO utilizes a pretrained audio encoder to process the audio embedding. The facial region mask is integrated with multi-frame noise, which governs the generation of facial imagery.

The Backbone Network, incorporating Reference-Attention and Audio-Attention mechanisms, plays a crucial role in preserving the character’s identity and modulating their movements.

Additionally, Temporal Modules are employed to manipulate the temporal dimension and adjust the velocity of motion.

The combination of these innovative techniques enables EMO to generate vocal avatar videos with expressive facial expressions, various head poses, and any duration depending on the length of the input audio.

Vocal Avatar Generations

EMO goes beyond traditional talking head videos by introducing the concept of vocal avatar generation.

By inputting a single character image and a vocal audio, such as singing. EMO can generate vocal avatar videos with expressive facial expressions, various head poses, and any duration based on the length of the input audio.

Singing Avatars

EMO can generate singing avatars that convincingly mimic the facial expressions and head movements of the reference character.

Multilingual and Multicultural Expressions

EMO supports songs in various languages and brings diverse portrait styles to life. It intuitively recognizes tonal variations in the audio, enabling the generation of dynamic, expression-rich avatars.

EMO framework is its ability to support songs in various languages and bring diverse portrait styles to life.

领英推荐

AI's Future in Creative Industries: Blurring…

TechUnity, Inc. 4 个月前

Mastering Speech Emotion Recognition for Market…

Rudder Analytics 8 个月前

How Generative AI Revolutionizes Creative Industries:…

Reckonsys Tech Labs 7 个月前

With its intuitive recognition of tonal variations in audio, EMO can generate dynamic and expression-rich avatars that reflect the cultural nuances of different languages.

Talking with different characters

EMO framework can accommodate spoken audio in various languages and animate portraits from bygone eras, paintings, 3D models, and AI-generated content.

By infusing these characters with lifelike motion and realism, EMO expands the possibilities of character portrayal in multilingual and multicultural contexts.

Training and Dataset

The EMO model was trained with a dataset of over 250 hours of footage, and more than 150 million images.

This dataset includes the footages from television interviews, singing performanaces, covering multiple languages.

Qualitative Comparison

In the figure, you can find the visual comparison between the EMO method and previous approaches. When given a single reference image, Wav2Lip often produces videos with blurry mouth regions and static head poses, lacking eye movement.

DreamTalk’s supplied style clips may distort original faces, limiting facial expressions and head movement dynamism. In contrast, the EMO method outperforms SadTalker and DreamTalk by generating a broader range of head movements and dynamic facial expressions. The EMO approach doesn’t utilize audio-driven character motion without relying on direct signals like blend shapes

Qualitative comparsions with several talking head generation works (

Limitations

EMO demonstrates amazing capabilities in generating expressive portrait videos, there are still limitations to be addressed.

The framework relies heavily on the quality of the input audio and reference image, and improvements in audio-visual synchronization can further enhance the realism of the generated videos.

Code

When we tried to access the Git repo of EMO we can see that there is no code available in the repo and a lot of issues has been created for the same. may be it would have been taken down.

And it is mentioned that this project is inteded solely for academic research and effect demonstration.

Reference Link: https://humanaigc.github.io/emote-portrait-alive/

Github Link: https://github.com/HumanAIGC/EMO

要查看或添加评论，请登录

NIDHIN KUMAR的更多文章

He was the Pillar

2025年3月3日

He was the Pillar

In the grand kingdom of TechWar Inc., there existed a mighty wizard.
Fluid Compute

2025年2月9日

Fluid Compute

In the fast-evolving world of web development, efficiency and scalability are crucial. While dedicated servers provide…
Storybook 8.5

2025年1月25日

Storybook 8.5

Storybook has long been the go-to workshop for building, documenting, and testing UI components. It’s a tool loved by…
Future of Transportation (CES 2025 Highlights)

2025年1月11日

Future of Transportation (CES 2025 Highlights)

The world of transportation is undergoing a significant transformation, driven by advancements in Artificial…
Next.js use cache directive

2025年1月3日

Next.js use cache directive

directive from Next.js designates a component or a function to be cached.
AgiBot Robotic Learning Dataset

2025年1月1日

AgiBot Robotic Learning Dataset

Chinese robotics firm AgiBot, also known as Zhiyuan Robotics, has announced the release of the largest robotic learning…
Spotify Leverages Llama for Contextualized Recommendations

2024年12月23日

Spotify Leverages Llama for Contextualized Recommendations

Spotify, the world’s leading audio streaming platform, is taking artist discovery and user engagement to the next level…
Pika 2.0

2024年12月16日

Pika 2.0

Pika proudly unveils its latest innovation, Pika 2.0, designed to redefine your creative experience.
Holidays Are Coming: Coca-Cola’s AI-Infused Christmas Campaign Explained

2024年11月16日

Holidays Are Coming: Coca-Cola’s AI-Infused Christmas Campaign Explained

Coca-Cola's 2024 Christmas advert has sparked a range of reactions, with some fans voicing disappointment over the…
Poop Health with AI

2024年10月26日

Poop Health with AI

AI is becoming an integral part of our daily lives—from the phones we use, the songs we listen, the food that we eat…

See all articles

Meet Alibaba's EMO: Emote Portrait Alive

NIDHIN KUMAR

? ???? Lead | 9+ Years of Crafting Digital Journey | User-focused web applications by combining the structured power of Angular | React | Next.js with cutting edge innovations. Driven to build what’s next! ??

Understanding the EMO Framework

Vocal Avatar Generations

Singing Avatars

Multilingual and Multicultural Expressions

领英推荐

Talking with different characters

Training and Dataset

Qualitative Comparison

Limitations

Code

NIDHIN KUMAR的更多文章

社区洞察

其他会员也浏览了

How to not get fooled by AI audio deepfakes

August 2024

The Future is Here: AI Tools to Streamline Your Processes

OctoAI is now GA ??

Listen for yourself- ElevenLabs's new AI tool generates sound effects using prompts

Crafting Realism: AI Voice Synthesis in OTT

AI in the Entertainment Industry: A Glimpse into the Future of Film and TV

AI and Creativity: Unraveling the Impact of Artificial Intelligence in Art and Music Introduction

More Nail-Biting Drama at OpenAI??

Elevating Audio Datasets: The Power of Augmentation Techniques

Understanding the EMO Framework

Vocal Avatar Generations

Singing Avatars

Multilingual and Multicultural Expressions

领英推荐

Talking with different characters

Training and Dataset

Qualitative Comparison

Limitations

Code

NIDHIN KUMAR的更多文章

He was the Pillar

Fluid Compute

Storybook 8.5

Future of Transportation (CES 2025 Highlights)

Next.js use cache directive

AgiBot Robotic Learning Dataset

Spotify Leverages Llama for Contextualized Recommendations

Pika 2.0

Holidays Are Coming: Coca-Cola’s AI-Infused Christmas Campaign Explained

Poop Health with AI

社区洞察

其他会员也浏览了

How to not get fooled by AI audio deepfakes

August 2024

The Future is Here: AI Tools to Streamline Your Processes

OctoAI is now GA ??

Listen for yourself- ElevenLabs's new AI tool generates sound effects using prompts

Crafting Realism: AI Voice Synthesis in OTT

AI in the Entertainment Industry: A Glimpse into the Future of Film and TV

AI and Creativity: Unraveling the Impact of Artificial Intelligence in Art and Music Introduction

More Nail-Biting Drama at OpenAI??

Elevating Audio Datasets: The Power of Augmentation Techniques