登录查看更多内容

Paper Review: NeoBERT: A Next-Generation BERT

Andrey Lukyanenko

Data Scientist / Machine Learning Engineer. Kaggle Competition Master, Notebooks Top-1.

发布日期: 2025年3月3日

+ 关注

Paper

Code

NeoBERT is a next-generation bidirectional encoder; it incorporates state-of-the-art architectural advancements, modern data, and optimized pre-training to bridge the gap between encoders and powerful autoregressive language models. NeoBERT supports a context length of 4096 tokens and maintains a compact 250M parameter size. Despite its size, it achieves state-of-the-art results on the MTEB benchmark, outperforming larger models under identical fine-tuning conditions.

The approach

The Architecture

NeoBERT incorporates several architectural improvements to enhance efficiency and performance:

Research shows that while most early language models suffered from “depth inefficiency,” smaller models like BERT and RoBERTa faced “width inefficiency.” NeoBERT retains BERTbase’s width of 768 but increases depth.
Traditional absolute positional embeddings struggle with long sequences. NeoBERT uses Rotary Position Embeddings for better extrapolation and supports Yet Another RoPE Extension to handle extended contexts effectively.
NeoBERT uses Pre-Layer Normalization inside residual connections with RMSNorm.
NeoBERT replaces GELU with SwiGLU, a more efficient activation function used in models like LLaMA.

Data

NeoBERT is pre-trained on RefinedWeb, a massive dataset with 600B tokens - 18 times larger than RoBERTa’s corpus.
NeoBERT has two-stage pre-training: first, it is trained for 1m steps (2T tokens) with a sequence length of 1024 tokens, then it is further trained for 50K steps (100B tokens), increasing the maximum sequence length to 4096 tokens.
Additional sub-datasets are used to expose NeoBERT to longer sequences, ensuring a mix of sequence lengths during training.

Pre-training

Following RoBERTa’s example, NeoBERT is pre-trained solely on masked language modeling with a 20% masking rate. It is pre-trained on 2.1T Tokens. For efficiency, it uses DeepSpeed ZeRO, FlashAttention, and fused operators (xFormers), ensures dimensions align with GPU architecture (multiples of 64), and removes biases to simplify computation.

Ablations

The largest improvements include replacing the dataset (+3.6% GLUE) and increasing the model size (+2.0% GLUE).

Experiments

Despite being 100M to 150M parameters smaller than comparable large models, NeoBERT achieves an 89.0% score, matching the performance of previous state-of-the-art models. The GLUE benchmark, while outdated, is reported for easy comparison with prior encoders.

MTEB Benchmark is a more modern and challenging benchmark covering 7 tasks and 56 datasets in English. Unlike traditional masked language models, which struggle with direct embedding evaluations, NeoBERT uses a model-agnostic contrastive fine-tuning strategy to ensure fair comparisons. It is trained with contrastive learning on a dataset of 9 million query-document pairs with hard negatives and in-batch negatives. Training beyond 2,000 steps provides minimal gains.

NeoBERT outperforms all large baselines on MTEB-English with a +4.5% relative increase over the second-best model despite having fewer parameters.

要查看或添加评论，请登录

Andrey Lukyanenko的更多文章

Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

2025年2月24日

Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper Project SigLIP 2 is a new family of multilingual vision-language encoders that improve upon the original SigLIP…
Paper Review: Goku: Flow Based Video Generative Foundation Models

2025年2月17日

Paper Review: Goku: Flow Based Video Generative Foundation Models

Paper Code Project Goku is a family of joint image-and-video generation models built on rectified flow Transformers…

1 条评论
Paper Review: Titans: Learning to Memorize at Test Time

2025年2月3日

Paper Review: Titans: Learning to Memorize at Test Time

Paper Titans is a new neural architecture that combines attention mechanisms with a long-term memory module to…

6 条评论
Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2025年1月27日

Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper Project Hugging Face page Code The DeepSeek team introduces two reasoning models, DeepSeek-R1-Zero and…
Paper Review: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

2025年1月13日

Paper Review: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper Project Code STAR improves real-world video super-resolution by addressing over-smoothing and temporal…
Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

2025年1月6日

Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

Paper Coconut (Chain of Continuous Thought) is a new reasoning paradigm for LLMs that operates in latent space, using…

1 条评论
Paper Review: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

2024年12月23日

Paper Review: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper Code Weights Blogpost ModernBERT introduces modern optimizations to BERT: trained on 2 trillion tokens with an…

1 条评论
Paper Review: Byte Latent Transformer: Patches Scale Better Than Tokens

2024年12月16日

Paper Review: Byte Latent Transformer: Patches Scale Better Than Tokens

Paper link Code link The Byte Latent Transformer is a novel byte-level LLM that rivals tokenization-based LLMs in…

2 条评论
Paper Review: Reverse Thinking Makes LLMs Stronger Reasoners

2024年12月9日

Paper Review: Reverse Thinking Makes LLMs Stronger Reasoners

Paper link Reverse-Enhanced Thinking (RevThink) enables LLMs to reason backward by using structured forward-backward…

4 条评论
Paper Review: Project Sid: Many-agent simulations toward AI civilization

2024年11月25日

Paper Review: Project Sid: Many-agent simulations toward AI civilization

Paper link Large-scale simulations of 10–1000+ AI agents show their ability to develop specialized roles, adapt…

2 条评论

See all articles

The approach

The Architecture

Data

Pre-training

Ablations

Experiments

Andrey Lukyanenko的更多文章

Paper Review: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper Review: Goku: Flow Based Video Generative Foundation Models

Paper Review: Titans: Learning to Memorize at Test Time

Paper Review: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper Review: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

Paper Review: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper Review: Byte Latent Transformer: Patches Scale Better Than Tokens

Paper Review: Reverse Thinking Makes LLMs Stronger Reasoners

Paper Review: Project Sid: Many-agent simulations toward AI civilization