TrackGo: A Flexible and Efficient Method for Controllable Video Generation
Credit: https://arxiv.org/pdf/2408.11475

TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Today's paper introduces TrackGo, a new approach for controllable video generation. It allows users to precisely control object motion in generated videos using free-form masks and arrows. The method leverages a new component called TrackAdapter to efficiently inject motion control information into a pre-trained video diffusion model.

Method Overview

TrackGo works in two main stages: Point Trajectories Generation and Conditional Video Generation.

In the first stage, it takes user inputs in the form of free-form masks (to specify target objects/areas) and arrows (to indicate motion trajectories). These are converted into point trajectories that precisely describe the desired motion.

In the second stage, TrackGo uses a pre-trained Stable Video Diffusion (SVD) model as the base architecture. It introduces the TrackAdapter, which is integrated into the temporal self-attention layers of the SVD model.

The TrackAdapter introduces a dual-branch architecture within the existing temporal self-attention layers. One branch focuses on the motion within the target area, while the original branch handles the rest. This allows precise control of specified objects while maintaining overall video coherence.

The point trajectories are encoded and injected into the model via the TrackAdapter. An attention mask mechanism is used to separate the specified motion areas from the rest, allowing fine-grained control.

During training, they use a combination of noise prediction loss and a novel attention-based loss to optimize the model. At inference time, users can adjust parameters to control the movement of unspecified areas, allowing for flexible video generation.

Results

TrackGo outperformed baseline methods across multiple metrics:

  • Better video quality (lower FVD scores)
  • Improved image quality (lower FID scores)
  • More faithful motion control (lower ObjMC scores)

It also achieved faster inference speeds and required fewer additional parameters compared to other approaches.

Qualitative results showed TrackGo could handle complex scenarios involving multiple objects, fine-grained object parts, and sophisticated movement trajectories better than baselines. It maintained background consistency while precisely controlling target object motion.

A user study found that 62% of participants preferred videos generated by TrackGo over those from competing methods.

Conclusion

TrackGo introduces an effective and efficient approach for controllable video generation. By leveraging point trajectories and the novel TrackAdapter, it achieves precise motion control while maintaining high video quality. For more information please consult the?full paper.

Congrats to the authors for their work!

Zhou, Haitao, et al. "TrackGo: A Flexible and Efficient Method for Controllable Video Generation." arXiv preprint arXiv:2408.11475 (2024).

要查看或添加评论,请登录

Vlad Bogolin的更多文章

社区洞察

其他会员也浏览了