登录查看更多内容

Repetition in data helps. Research by FAIR, Meta

TuringPost

Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??

发布日期: 2024年10月20日

FAIR, AI at Meta challenged the idea that LLMs need more unique training data.

They found that repetition in data isn't harmful for training models and can actually improve performance. It's especially beneficial when the total number of training examples is fixed.

This new approach was tested on 3 math problems:

- greatest common divisor (GCD),

- multiplication modulo 67

- calculating the eigenvalues of real symmetric matrices

Here are the key findings:

Repetition of data helps:

Models trained on fewer unique examples with more repetition often outperform those trained on larger datasets with unique examples. Researchers saw that repeating examples led to "emergent learning," enabling models to learn new patterns.

Two-set training:

During this training, examples from two sets are mixed:

? Small set: A small group of examples that the model sees repeatedly to help it learn faster.

? Large set: A larger group that the model sees less often to prevent overfitting.

The balance between them can be fine-tuned.

With a mix of a small set of repeated examples alongside a regular set of randomly selected examples, learning becomes faster and performance improves. It doesn't matter what examples are repeated; only that they are repeated more often matters.

The repeated set size depends on the task—smaller for simple problems like GCD and larger for complex tasks like modular multiplication and eigenvalues.

So this approach challenges the idea that more unique data is always better, potentially changing views on training set sizes.

Original paper: EMERGENT PROPERTIES WITH REPEATED EXAMPLES

Repetition in data helps. Research by FAIR, Meta

TuringPost

Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??

Turing Post

2,212 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Can Machines Really Learn?

Dimensionality Reduction in Machine Learning

Transforming Industries and Empowering Innovation

The Marvels of Machine Learning

What is exactly Machine Learning?

What is Machine Learning and why is it important or useful?

Machine Learning Key Concepts

Understanding Optimization Techniques in Machine Learning: Feature Scaling, Batch Normalization, Gradient Descent Variants, and Learning Rate Decay

Understanding the Trade-Off in Machine Learning: Bias vs. Variance

Navigating the Machine Learning Landscape in the world of lot of DATA !!

Turing Post

2,212 位关注者

NLRL: Natural Language Reinforcement Learning redefines Reinforcement Learning.

2024年11月28日

Topic 19: Inside LLaVA-o1

2024年11月28日

Hymba small model: a great combo of 2 concepts

2024年11月28日

??#77: Amid Big Model Chaos: Small Models and Embeddings Steal the Spotlight

2024年11月26日

????#5: Building Blocks of Agentic Systems

2024年11月25日

SAMURAI model for perfect segmenting and tracking objects in videos

2024年11月25日

Concepts: Supervised, Semi-Supervised, Self-Supervised, Unsupervised types of Machine Learning

2024年11月23日

FastRAG for semi-structured data

2024年11月23日

SEALONG, self-impovement approach for long context reasoning

2024年11月22日

??#76: Rethinking Scaling Laws (when plateau is actually a fork)

2024年11月19日

社区洞察

其他会员也浏览了

Can Machines Really Learn?

Dimensionality Reduction in Machine Learning

Transforming Industries and Empowering Innovation

The Marvels of Machine Learning

What is exactly Machine Learning?

What is Machine Learning and why is it important or useful?

Machine Learning Key Concepts

Understanding Optimization Techniques in Machine Learning: Feature Scaling, Batch Normalization, Gradient Descent Variants, and Learning Rate Decay

Understanding the Trade-Off in Machine Learning: Bias vs. Variance

Navigating the Machine Learning Landscape in the world of lot of DATA !!