Paper Review: NeoBERT: A Next-Generation BERT

Paper Review: NeoBERT: A Next-Generation BERT

Paper

Code

NeoBERT is a next-generation bidirectional encoder; it incorporates state-of-the-art architectural advancements, modern data, and optimized pre-training to bridge the gap between encoders and powerful autoregressive language models. NeoBERT supports a context length of 4096 tokens and maintains a compact 250M parameter size. Despite its size, it achieves state-of-the-art results on the MTEB benchmark, outperforming larger models under identical fine-tuning conditions.

The approach

The Architecture

NeoBERT incorporates several architectural improvements to enhance efficiency and performance:

  • Research shows that while most early language models suffered from “depth inefficiency,” smaller models like BERT and RoBERTa faced “width inefficiency.” NeoBERT retains BERTbase’s width of 768 but increases depth.
  • Traditional absolute positional embeddings struggle with long sequences. NeoBERT uses Rotary Position Embeddings for better extrapolation and supports Yet Another RoPE Extension to handle extended contexts effectively.
  • NeoBERT uses Pre-Layer Normalization inside residual connections with RMSNorm.
  • NeoBERT replaces GELU with SwiGLU, a more efficient activation function used in models like LLaMA.

Data

  • NeoBERT is pre-trained on RefinedWeb, a massive dataset with 600B tokens - 18 times larger than RoBERTa’s corpus.
  • NeoBERT has two-stage pre-training: first, it is trained for 1m steps (2T tokens) with a sequence length of 1024 tokens, then it is further trained for 50K steps (100B tokens), increasing the maximum sequence length to 4096 tokens.
  • Additional sub-datasets are used to expose NeoBERT to longer sequences, ensuring a mix of sequence lengths during training.

Pre-training

Following RoBERTa’s example, NeoBERT is pre-trained solely on masked language modeling with a 20% masking rate. It is pre-trained on 2.1T Tokens. For efficiency, it uses DeepSpeed ZeRO, FlashAttention, and fused operators (xFormers), ensures dimensions align with GPU architecture (multiples of 64), and removes biases to simplify computation.

Ablations

The largest improvements include replacing the dataset (+3.6% GLUE) and increasing the model size (+2.0% GLUE).

Experiments

Despite being 100M to 150M parameters smaller than comparable large models, NeoBERT achieves an 89.0% score, matching the performance of previous state-of-the-art models. The GLUE benchmark, while outdated, is reported for easy comparison with prior encoders.

MTEB Benchmark is a more modern and challenging benchmark covering 7 tasks and 56 datasets in English. Unlike traditional masked language models, which struggle with direct embedding evaluations, NeoBERT uses a model-agnostic contrastive fine-tuning strategy to ensure fair comparisons. It is trained with contrastive learning on a dataset of 9 million query-document pairs with hard negatives and in-batch negatives. Training beyond 2,000 steps provides minimal gains.

NeoBERT outperforms all large baselines on MTEB-English with a +4.5% relative increase over the second-best model despite having fewer parameters.

要查看或添加评论,请登录

Andrey Lukyanenko的更多文章