Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement
Today's paper introduces a new approach for selecting optimal subsets of instruction data to finetune large language models (LLMs). The authors propose a diversity-centric method that uses k-means clustering and iterative refinement to efficiently sample high-quality, representative data. Their approach outperforms existing methods on various downstream tasks while using only a small fraction of the full dataset.
Method Overview
The paper introduces a two-stage approach for data selection: static data selection and iterative data selection.
For static data selection, they first use k-means clustering to group similar data points together. This ensures diversity by separating dissimilar samples into different clusters. They then sample data from each cluster, with the number of samples proportional to the cluster size. Within each cluster, they use a quality-based sampling approach, where higher quality instances have a higher probability of being selected.
The iterative data selection method builds on this by incorporating feedback from the training process. After an initial round of finetuning on a small subset, they evaluate how well the model learns from different clusters. Clusters that contribute more to the model's learning get higher weights in subsequent sampling rounds. This allows to automatically reduce the impact of low-quality clusters and focus on more valuable data.
The authors also investigate different ways to encode the instruction data, score sample quality, and determine the optimal number of clusters. They find that using a reward model for scoring and the silhouette score to estimate the ideal number of clusters works well.
领英推荐
Results
The proposed methods consistently outperform random sampling and other state-of-the-art data selection approaches across a range of tasks including natural language reasoning, world knowledge, code generation, and math reasoning. Their best method, iterative k-means quality sampling, achieves up to 7% improvement over random selection and 3.8% over previous sampling methods while using only 5% of the full dataset.
The authors also demonstrate that their findings generalize to different base models like Mistral-7B, though results vary for more advanced models like Llama-3.
Conclusion
This paper presents an efficient, diversity-focused approach to selecting instruction data for finetuning LLMs. By using clustering to ensure diversity and incorporating model feedback through iterative refinement, they achieve significant performance improvements while using only a small fraction of the available data. For more information please consult the?full paper.
Congrats to the authors for their work!
Yu, Simon, et al. "Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement." arXiv preprint arXiv:2409.11378 (2024).