登录查看更多内容

Tell, Don’t Show!: Language Guidance Eases Transfer Across Domains in Images and Videos

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2024年8月17日

Today's paper introduces LaGTran, a novel framework that uses text supervision to improve transfer learning across domains for image and video classification tasks. The method leverages readily available text descriptions associated with images and videos to bridge domain gaps more effectively than traditional unsupervised domain adaptation approaches. LaGTran demonstrates significant performance gains on challenging benchmarks, highlighting the power of incorporating language guidance for cross-domain transfer.

Method Overview

LaGTran works by utilizing text descriptions associated with images or videos in both the source and target domains. The overall pipeline involves three main steps:

First, a text classifier is trained on the labeled source domain using the text descriptions and corresponding labels. This classifier learns to predict categories based on textual input.

Next, the trained text classifier is used to generate pseudo-labels for the unlabeled target domain data. It does this by processing the text descriptions associated with target domain images/videos and predicting their most likely categories.

Finally, these pseudo-labels are used as supervision to train an image/video classifier on the target domain data, alongside the labeled source domain data. This allows the model to learn features that work well across both domains.

The key insight of LaGTran is that text descriptions often contain valuable semantic information that is more invariant across domains compared to raw image or video data. By leveraging this text modality, the method can more effectively bridge domain gaps that are challenging for traditional pixel-based adaptation approaches.

Importantly, LaGTran only requires text descriptions during training. At inference time, the final classifier operates solely on image/video inputs, incurring no additional computational overhead compared to standard models.

The paper also introduces a new benchmark called Ego2Exo for studying cross-view transfer in videos between egocentric (first-person) and exocentric (third-person) perspectives. This dataset highlights the challenges of adapting between significantly different viewpoints in action recognition tasks.

Arbisoft 3 个月前

The rise and fall of synthetic datasets and smaller…

Thomas Wolf 1 个月前

LLM-Prompting for Mathematical Reasoning; Any-To-Any…

Danny Butvinik 11 个月前

Results

LaGTran achieves state-of-the-art results on several challenging domain adaptation benchmarks:

On the GeoNet dataset for geographic transfer, it outperforms previous methods by over 10% average accuracy.

For the DomainNet dataset, LaGTran sets a new state-of-the-art, surpassing prior approaches on most transfer tasks.

On the newly introduced Ego2Exo benchmark, it demonstrates significant gains over existing video domain adaptation techniques.

Conclusion

This paper introduces LaGTran, a straightforward yet effective approach for leveraging text supervision to improve cross-domain transfer in computer vision tasks. For more information please consult the?full paper.

Congrats to the authors for their work!

Kalluri, Tarun, et al. "Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos." arXiv preprint arXiv:2403.05535 (2024).

要查看或添加评论，请登录

查看全部

Tell, Don’t Show!: Language Guidance Eases Transfer Across Domains in Images and Videos

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

领英推荐

Results

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Evaluating RAG Systems: A Comprehensive Approach to Assessing Retrieval and Generation Performance

Top LLM Papers of the Week (March Week-3 2024)

Evaluating LLM and RAG Systems

Improving Large Language Models Domain-Specific Answers with local long-term Memory. Testing "Cheshire Cat" with my book "Scrum for Hardware"

Introducing HaluMon: Ensuring Language Model Reliability

Metrics That Matter: Measuring LLM Performance

How to scale Large Language Models (LLMs) to infinite context?

Demystifying Domain-Specific Languages: The Dawn of DSPyGen DSL

How RAG Works: A Detailed Explanation of its Components and Steps

Leveraging LLM Tools for Beyond Language Tasks

Method Overview

领英推荐

Results

Conclusion

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

2024年10月16日

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

2024年10月15日

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

2024年10月14日

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

2024年10月13日

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

2024年10月12日

Aria: An Open Multimodal Native Mixture-of-Experts Model

2024年10月11日

Pixtral 12B

2024年10月10日

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

2024年10月9日

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

2024年10月8日

Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

2024年10月7日

社区洞察

其他会员也浏览了

Evaluating RAG Systems: A Comprehensive Approach to Assessing Retrieval and Generation Performance

Top LLM Papers of the Week (March Week-3 2024)

Evaluating LLM and RAG Systems

Improving Large Language Models Domain-Specific Answers with local long-term Memory. Testing "Cheshire Cat" with my book "Scrum for Hardware"

Introducing HaluMon: Ensuring Language Model Reliability

Metrics That Matter: Measuring LLM Performance

How to scale Large Language Models (LLMs) to infinite context?

Demystifying Domain-Specific Languages: The Dawn of DSPyGen DSL

How RAG Works: A Detailed Explanation of its Components and Steps

Leveraging LLM Tools for Beyond Language Tasks