Repetition in data helps. Research by FAIR, Meta

Repetition in data helps. Research by FAIR, Meta

FAIR, AI at Meta challenged the idea that LLMs need more unique training data.

They found that repetition in data isn't harmful for training models and can actually improve performance. It's especially beneficial when the total number of training examples is fixed.

  • This new approach was tested on 3 math problems:

- greatest common divisor (GCD),

- multiplication modulo 67

- calculating the eigenvalues of real symmetric matrices

Here are the key findings:

  • Repetition of data helps:

Models trained on fewer unique examples with more repetition often outperform those trained on larger datasets with unique examples. Researchers saw that repeating examples led to "emergent learning," enabling models to learn new patterns.

Image credit: Original paper

  • Two-set training:

During this training, examples from two sets are mixed:

? Small set: A small group of examples that the model sees repeatedly to help it learn faster.

? Large set: A larger group that the model sees less often to prevent overfitting.

The balance between them can be fine-tuned.

With a mix of a small set of repeated examples alongside a regular set of randomly selected examples, learning becomes faster and performance improves. It doesn't matter what examples are repeated; only that they are repeated more often matters.

The repeated set size depends on the task—smaller for simple problems like GCD and larger for complex tasks like modular multiplication and eigenvalues.

Image credit: Original paper

So this approach challenges the idea that more unique data is always better, potentially changing views on training set sizes.


Original paper: EMERGENT PROPERTIES WITH REPEATED EXAMPLES

要查看或添加评论,请登录

社区洞察

其他会员也浏览了