登录查看更多内容

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2024年9月19日

Today's paper introduces a new approach for selecting optimal subsets of instruction data to finetune large language models (LLMs). The authors propose a diversity-centric method that uses k-means clustering and iterative refinement to efficiently sample high-quality, representative data. Their approach outperforms existing methods on various downstream tasks while using only a small fraction of the full dataset.

Method Overview

The paper introduces a two-stage approach for data selection: static data selection and iterative data selection.

For static data selection, they first use k-means clustering to group similar data points together. This ensures diversity by separating dissimilar samples into different clusters. They then sample data from each cluster, with the number of samples proportional to the cluster size. Within each cluster, they use a quality-based sampling approach, where higher quality instances have a higher probability of being selected.

The iterative data selection method builds on this by incorporating feedback from the training process. After an initial round of finetuning on a small subset, they evaluate how well the model learns from different clusters. Clusters that contribute more to the model's learning get higher weights in subsequent sampling rounds. This allows to automatically reduce the impact of low-quality clusters and focus on more valuable data.

The authors also investigate different ways to encode the instruction data, score sample quality, and determine the optimal number of clusters. They find that using a reward model for scoring and the silhouette score to estimate the ideal number of clusters works well.

领英推荐

How Data Science Enhances Business Decision-Making

Analytics Insight? 7 个月前

Blueprint for Leveraging Vector Database in Business

Oak Business Consultant 8 个月前

Demystifying AI-Driven Data Engineering: Transforming…

Pronix Inc 7 个月前

Results

The proposed methods consistently outperform random sampling and other state-of-the-art data selection approaches across a range of tasks including natural language reasoning, world knowledge, code generation, and math reasoning. Their best method, iterative k-means quality sampling, achieves up to 7% improvement over random selection and 3.8% over previous sampling methods while using only 5% of the full dataset.

The authors also demonstrate that their findings generalize to different base models like Mistral-7B, though results vary for more advanced models like Llama-3.

Conclusion

This paper presents an efficient, diversity-focused approach to selecting instruction data for finetuning LLMs. By using clustering to ensure diversity and incorporating model feedback through iterative refinement, they achieve significant performance improvements while using only a small fraction of the available data. For more information please consult the?full paper.

Congrats to the authors for their work!

Yu, Simon, et al. "Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement." arXiv preprint arXiv:2409.11378 (2024).

AI Paper of the Day

1,306 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

Unified Reward Model for Multimodal Understanding and Generation

2025年3月10日

Unified Reward Model for Multimodal Understanding and Generation

Today’s paper introduces UnifiedReward, the first unified reward model capable of evaluating both understanding and…
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

2025年3月9日

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Today's paper introduces Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) that achieves state-of-the-art…
START: Self-taught Reasoner with Tools

2025年3月8日

START: Self-taught Reasoner with Tools

Today's paper introduces START (Self-taught Reasoner with Tools), a novel approach that enhances large language models'…
Token-Efficient Long Video Understanding for Multimodal LLMs

2025年3月7日

Token-Efficient Long Video Understanding for Multimodal LLMs

Today's paper introduces STORM (Spatiotemporal Token Reduction for Multimodal LLMs), a novel architecture for efficient…
Predictive Data Selection: The Data That Predicts Is the Data That Teaches

2025年3月6日

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Today's paper introduces PRESELECT, a novel approach for selecting high-quality data for language model pretraining…
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

2025年3月5日

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Today's paper introduces MultiAgentBench, a comprehensive benchmark designed to evaluate Large Language Model (LLM)…
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

2025年3月4日

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Today's paper introduces Phi-4-Mini and Phi-4-Multimodal, two compact yet powerful language models. Phi-4-Mini is a 3.
How far can we go with ImageNet for Text-to-Image generation?

2025年3月3日

How far can we go with ImageNet for Text-to-Image generation?

Today's paper challenges the prevailing "bigger is better" paradigm in text-to-image generation by demonstrating that…
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

2025年3月2日

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Today's paper introduces DeltaBench, a comprehensive benchmark for evaluating the ability of Large Language Models…
Language Models' Factuality Depends on the Language of Inquiry

2025年3月1日

Language Models' Factuality Depends on the Language of Inquiry

Today's paper investigates an interesting limitation in multilingual language models (LMs): their inconsistency in…

See all articles

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

领英推荐

Results

Conclusion

AI Paper of the Day

1,306 位关注者

Vlad Bogolin的更多文章

社区洞察

其他会员也浏览了

Demystifying AI-Driven Data Engineering: Transforming Raw Data into Actionable Insights

Why Vector Databases Are Really Fast: An In-depth Look at FAISS

Edition 25 - What Retrieval Approaches Actually Work?

OpenAI Introduces Structured Outputs - A Breakthrough for Developers

Creating Dynamic Data Visualizations with OpenAI's GPT-3 and React

Automating data preparation and preprocessing in ML models

What’s coming for 2025? Analytics Tools & Techniques

Unlocking the Potential of LlamaIndex: Revolutionizing Data Integration with AI

In-situ Federated Data Processing | ML4ALL | Apache Wayang (incubating)

Will Generative AI Spell The End Of Work? Nah.

Method Overview

领英推荐

Results

Conclusion

AI Paper of the Day

1,306 位关注者

Vlad Bogolin的更多文章

Unified Reward Model for Multimodal Understanding and Generation

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

START: Self-taught Reasoner with Tools

Token-Efficient Long Video Understanding for Multimodal LLMs

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

How far can we go with ImageNet for Text-to-Image generation?

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Language Models' Factuality Depends on the Language of Inquiry

社区洞察

其他会员也浏览了

Demystifying AI-Driven Data Engineering: Transforming Raw Data into Actionable Insights

Why Vector Databases Are Really Fast: An In-depth Look at FAISS

Edition 25 - What Retrieval Approaches Actually Work?

OpenAI Introduces Structured Outputs - A Breakthrough for Developers

Creating Dynamic Data Visualizations with OpenAI's GPT-3 and React

Automating data preparation and preprocessing in ML models

What’s coming for 2025? Analytics Tools & Techniques

Unlocking the Potential of LlamaIndex: Revolutionizing Data Integration with AI

In-situ Federated Data Processing | ML4ALL | Apache Wayang (incubating)

Will Generative AI Spell The End Of Work? Nah.