Tell, Don’t Show!: Language Guidance Eases Transfer Across Domains in Images and Videos
Credit: https://arxiv.org/pdf/2403.05535

Tell, Don’t Show!: Language Guidance Eases Transfer Across Domains in Images and Videos

Today's paper introduces LaGTran, a novel framework that uses text supervision to improve transfer learning across domains for image and video classification tasks. The method leverages readily available text descriptions associated with images and videos to bridge domain gaps more effectively than traditional unsupervised domain adaptation approaches. LaGTran demonstrates significant performance gains on challenging benchmarks, highlighting the power of incorporating language guidance for cross-domain transfer.

Method Overview

LaGTran works by utilizing text descriptions associated with images or videos in both the source and target domains. The overall pipeline involves three main steps:

First, a text classifier is trained on the labeled source domain using the text descriptions and corresponding labels. This classifier learns to predict categories based on textual input.

Next, the trained text classifier is used to generate pseudo-labels for the unlabeled target domain data. It does this by processing the text descriptions associated with target domain images/videos and predicting their most likely categories.

Finally, these pseudo-labels are used as supervision to train an image/video classifier on the target domain data, alongside the labeled source domain data. This allows the model to learn features that work well across both domains.

The key insight of LaGTran is that text descriptions often contain valuable semantic information that is more invariant across domains compared to raw image or video data. By leveraging this text modality, the method can more effectively bridge domain gaps that are challenging for traditional pixel-based adaptation approaches.

Importantly, LaGTran only requires text descriptions during training. At inference time, the final classifier operates solely on image/video inputs, incurring no additional computational overhead compared to standard models.

The paper also introduces a new benchmark called Ego2Exo for studying cross-view transfer in videos between egocentric (first-person) and exocentric (third-person) perspectives. This dataset highlights the challenges of adapting between significantly different viewpoints in action recognition tasks.

Results

LaGTran achieves state-of-the-art results on several challenging domain adaptation benchmarks:

  • On the GeoNet dataset for geographic transfer, it outperforms previous methods by over 10% average accuracy.

  • For the DomainNet dataset, LaGTran sets a new state-of-the-art, surpassing prior approaches on most transfer tasks.

  • On the newly introduced Ego2Exo benchmark, it demonstrates significant gains over existing video domain adaptation techniques.

Conclusion

This paper introduces LaGTran, a straightforward yet effective approach for leveraging text supervision to improve cross-domain transfer in computer vision tasks. For more information please consult the?full paper.

Congrats to the authors for their work!

Kalluri, Tarun, et al. "Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos." arXiv preprint arXiv:2403.05535 (2024).

要查看或添加评论,请登录

社区洞察

其他会员也浏览了