Tell, Don’t Show!: Language Guidance Eases Transfer Across Domains in Images and Videos
Today's paper introduces LaGTran, a novel framework that uses text supervision to improve transfer learning across domains for image and video classification tasks. The method leverages readily available text descriptions associated with images and videos to bridge domain gaps more effectively than traditional unsupervised domain adaptation approaches. LaGTran demonstrates significant performance gains on challenging benchmarks, highlighting the power of incorporating language guidance for cross-domain transfer.
Method Overview
LaGTran works by utilizing text descriptions associated with images or videos in both the source and target domains. The overall pipeline involves three main steps:
First, a text classifier is trained on the labeled source domain using the text descriptions and corresponding labels. This classifier learns to predict categories based on textual input.
Next, the trained text classifier is used to generate pseudo-labels for the unlabeled target domain data. It does this by processing the text descriptions associated with target domain images/videos and predicting their most likely categories.
Finally, these pseudo-labels are used as supervision to train an image/video classifier on the target domain data, alongside the labeled source domain data. This allows the model to learn features that work well across both domains.
The key insight of LaGTran is that text descriptions often contain valuable semantic information that is more invariant across domains compared to raw image or video data. By leveraging this text modality, the method can more effectively bridge domain gaps that are challenging for traditional pixel-based adaptation approaches.
Importantly, LaGTran only requires text descriptions during training. At inference time, the final classifier operates solely on image/video inputs, incurring no additional computational overhead compared to standard models.
The paper also introduces a new benchmark called Ego2Exo for studying cross-view transfer in videos between egocentric (first-person) and exocentric (third-person) perspectives. This dataset highlights the challenges of adapting between significantly different viewpoints in action recognition tasks.
领英推荐
Results
LaGTran achieves state-of-the-art results on several challenging domain adaptation benchmarks:
Conclusion
This paper introduces LaGTran, a straightforward yet effective approach for leveraging text supervision to improve cross-domain transfer in computer vision tasks. For more information please consult the?full paper.
Congrats to the authors for their work!
Kalluri, Tarun, et al. "Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos." arXiv preprint arXiv:2403.05535 (2024).