Repetition in data helps. Research by FAIR, Meta
TuringPost
Newsletter about AI and ML. ?? Sign up for free to get your list of essential AI resources ??
FAIR, AI at Meta challenged the idea that LLMs need more unique training data.
They found that repetition in data isn't harmful for training models and can actually improve performance. It's especially beneficial when the total number of training examples is fixed.
- greatest common divisor (GCD),
- multiplication modulo 67
- calculating the eigenvalues of real symmetric matrices
Here are the key findings:
Models trained on fewer unique examples with more repetition often outperform those trained on larger datasets with unique examples. Researchers saw that repeating examples led to "emergent learning," enabling models to learn new patterns.
During this training, examples from two sets are mixed:
? Small set: A small group of examples that the model sees repeatedly to help it learn faster.
? Large set: A larger group that the model sees less often to prevent overfitting.
The balance between them can be fine-tuned.
With a mix of a small set of repeated examples alongside a regular set of randomly selected examples, learning becomes faster and performance improves. It doesn't matter what examples are repeated; only that they are repeated more often matters.
The repeated set size depends on the task—smaller for simple problems like GCD and larger for complex tasks like modular multiplication and eigenvalues.
So this approach challenges the idea that more unique data is always better, potentially changing views on training set sizes.
Original paper: EMERGENT PROPERTIES WITH REPEATED EXAMPLES