Paper Review: NeoBERT: A Next-Generation BERT
Andrey Lukyanenko
Data Scientist / Machine Learning Engineer. Kaggle Competition Master, Notebooks Top-1.
NeoBERT is a next-generation bidirectional encoder; it incorporates state-of-the-art architectural advancements, modern data, and optimized pre-training to bridge the gap between encoders and powerful autoregressive language models. NeoBERT supports a context length of 4096 tokens and maintains a compact 250M parameter size. Despite its size, it achieves state-of-the-art results on the MTEB benchmark, outperforming larger models under identical fine-tuning conditions.
The approach
The Architecture
NeoBERT incorporates several architectural improvements to enhance efficiency and performance:
Data
Pre-training
Following RoBERTa’s example, NeoBERT is pre-trained solely on masked language modeling with a 20% masking rate. It is pre-trained on 2.1T Tokens. For efficiency, it uses DeepSpeed ZeRO, FlashAttention, and fused operators (xFormers), ensures dimensions align with GPU architecture (multiples of 64), and removes biases to simplify computation.
Ablations
The largest improvements include replacing the dataset (+3.6% GLUE) and increasing the model size (+2.0% GLUE).
Experiments
Despite being 100M to 150M parameters smaller than comparable large models, NeoBERT achieves an 89.0% score, matching the performance of previous state-of-the-art models. The GLUE benchmark, while outdated, is reported for easy comparison with prior encoders.
MTEB Benchmark is a more modern and challenging benchmark covering 7 tasks and 56 datasets in English. Unlike traditional masked language models, which struggle with direct embedding evaluations, NeoBERT uses a model-agnostic contrastive fine-tuning strategy to ensure fair comparisons. It is trained with contrastive learning on a dataset of 9 million query-document pairs with hard negatives and in-batch negatives. Training beyond 2,000 steps provides minimal gains.
NeoBERT outperforms all large baselines on MTEB-English with a +4.5% relative increase over the second-best model despite having fewer parameters.