Meet Alibaba's EMO: Emote Portrait Alive
Cover Art: Nidhin with Canva

Meet Alibaba's EMO: Emote Portrait Alive

In recent years, we are witnessing the advancements that are happening in the field of image and video generation.

One of the recent development in this domain is EMO: Emote Portrait Alive a framework introduced by Alibaba's Group Institute for Intelligent Computing.

EMO utilizes an audio2video difussion model to generate expressive portrait videos with remarkable realism and accuracy.EMO pushes the boundaries of What is possible in talking head video generation.

Understanding the EMO Framework

Frames Encoding Method: EMO - Source(

The EMO Framework is a two stage process that combines audio and visual information to generate highly expressive portrait videos.

In the Initial stage called Frames encoding, a neural network called ReferenceNet which extracts features from a single reference image and motion frames. This encoding process lays the foundation for the subsequent difussion process.

During the Diffusion Process stage, EMO utilizes a pretrained audio encoder to process the audio embedding. The facial region mask is integrated with multi-frame noise, which governs the generation of facial imagery.

The Backbone Network, incorporating Reference-Attention and Audio-Attention mechanisms, plays a crucial role in preserving the character’s identity and modulating their movements.

Additionally, Temporal Modules are employed to manipulate the temporal dimension and adjust the velocity of motion.

The combination of these innovative techniques enables EMO to generate vocal avatar videos with expressive facial expressions, various head poses, and any duration depending on the length of the input audio.

Vocal Avatar Generations

EMO goes beyond traditional talking head videos by introducing the concept of vocal avatar generation.

By inputting a single character image and a vocal audio, such as singing. EMO can generate vocal avatar videos with expressive facial expressions, various head poses, and any duration based on the length of the input audio.

Singing Avatars

EMO can generate singing avatars that convincingly mimic the facial expressions and head movements of the reference character.

Multilingual and Multicultural Expressions

EMO supports songs in various languages and brings diverse portrait styles to life. It intuitively recognizes tonal variations in the audio, enabling the generation of dynamic, expression-rich avatars.

EMO framework is its ability to support songs in various languages and bring diverse portrait styles to life.

With its intuitive recognition of tonal variations in audio, EMO can generate dynamic and expression-rich avatars that reflect the cultural nuances of different languages.

Talking with different characters

EMO framework can accommodate spoken audio in various languages and animate portraits from bygone eras, paintings, 3D models, and AI-generated content.

By infusing these characters with lifelike motion and realism, EMO expands the possibilities of character portrayal in multilingual and multicultural contexts.

Training and Dataset

The EMO model was trained with a dataset of over 250 hours of footage, and more than 150 million images.

This dataset includes the footages from television interviews, singing performanaces, covering multiple languages.

Qualitative Comparison

Qualitative Comparison

In the figure, you can find the visual comparison between the EMO method and previous approaches. When given a single reference image, Wav2Lip often produces videos with blurry mouth regions and static head poses, lacking eye movement.

DreamTalk’s supplied style clips may distort original faces, limiting facial expressions and head movement dynamism. In contrast, the EMO method outperforms SadTalker and DreamTalk by generating a broader range of head movements and dynamic facial expressions. The EMO approach doesn’t utilize audio-driven character motion without relying on direct signals like blend shapes

Qualitative comparsions with several talking head generation works (

Limitations

EMO demonstrates amazing capabilities in generating expressive portrait videos, there are still limitations to be addressed.

The framework relies heavily on the quality of the input audio and reference image, and improvements in audio-visual synchronization can further enhance the realism of the generated videos.

Code

When we tried to access the Git repo of EMO we can see that there is no code available in the repo and a lot of issues has been created for the same. may be it would have been taken down.

And it is mentioned that this project is inteded solely for academic research and effect demonstration.

Reference Link: https://humanaigc.github.io/emote-portrait-alive/

Github Link: https://github.com/HumanAIGC/EMO

要查看或添加评论,请登录

NIDHIN KUMAR的更多文章

  • He was the Pillar

    He was the Pillar

    In the grand kingdom of TechWar Inc., there existed a mighty wizard.

  • Fluid Compute

    Fluid Compute

    In the fast-evolving world of web development, efficiency and scalability are crucial. While dedicated servers provide…

  • Storybook 8.5

    Storybook 8.5

    Storybook has long been the go-to workshop for building, documenting, and testing UI components. It’s a tool loved by…

  • Future of Transportation (CES 2025 Highlights)

    Future of Transportation (CES 2025 Highlights)

    The world of transportation is undergoing a significant transformation, driven by advancements in Artificial…

  • Next.js use cache directive

    Next.js use cache directive

    directive from Next.js designates a component or a function to be cached.

  • AgiBot Robotic Learning Dataset

    AgiBot Robotic Learning Dataset

    Chinese robotics firm AgiBot, also known as Zhiyuan Robotics, has announced the release of the largest robotic learning…

  • Spotify Leverages Llama for Contextualized Recommendations

    Spotify Leverages Llama for Contextualized Recommendations

    Spotify, the world’s leading audio streaming platform, is taking artist discovery and user engagement to the next level…

  • Pika 2.0

    Pika 2.0

    Pika proudly unveils its latest innovation, Pika 2.0, designed to redefine your creative experience.

  • Holidays Are Coming: Coca-Cola’s AI-Infused Christmas Campaign Explained

    Holidays Are Coming: Coca-Cola’s AI-Infused Christmas Campaign Explained

    Coca-Cola's 2024 Christmas advert has sparked a range of reactions, with some fans voicing disappointment over the…

  • Poop Health with AI

    Poop Health with AI

    AI is becoming an integral part of our daily lives—from the phones we use, the songs we listen, the food that we eat…

社区洞察

其他会员也浏览了