DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Today's paper introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. The model is further pre-trained from DeepSeek-V2 with an additional 6 trillion tokens, significantly enhancing its coding and mathematical reasoning capabilities while maintaining strong general language performance. DeepSeek-Coder-V2 expands support to 338 programming languages and extends the context length to 128K tokens.
Method Overview
DeepSeek-Coder-V2 is built upon the foundation of DeepSeek-V2 and undergoes additional pre-training with a carefully curated dataset. The pre-training corpus consists of 60% source code (1,170B tokens), 10% math content (221B tokens), and 30% natural language data. This diverse dataset exposes the model to a wide range of programming languages and mathematical concepts.
The training process involves two main phases: pre-training and alignment. During pre-training, the model is exposed to the multi-source corpus, allowing it to learn patterns and structures from code, math, and general language data. This phase significantly enhances the model's coding and mathematical reasoning abilities.
In the alignment phase, the model undergoes fine-tuning using an instruction dataset that includes code, math, and general instruction data. This is followed by reinforcement learning using the Group Relative Policy Optimization (GRPO) algorithm. The GRPO process aligns the model's behavior with human preferences, particularly in coding tasks. Preference data is collected using compiler feedback and test cases, and a reward model guides the policy model's training.
To support code completion functionality, the model also incorporates the Fill-In-Middle approach during fine-tuning. This allows DeepSeek-Coder-V2 to effectively complete partial code snippets.
The model's architecture is based on a Mixture-of-Experts (MoE) design, which allows for efficient scaling and specialized knowledge across different domains. The context length is extended to 128K tokens, enabling the model to handle more complex and extensive coding tasks.
Results
DeepSeek-Coder-V2 demonstrates impressive performance across various code and math benchmarks:
领英推荐
The model also maintains comparable performance to DeepSeek-V2 in general language tasks, showcasing its versatility.
Conclusion
DeepSeek-Coder-V2 represents a significant advancement in open-source code language models, achieving performance comparable to or exceeding closed-source models like GPT4-Turbo in various code and math benchmarks. By combining a diverse pre-training corpus, extended context length, and advanced alignment techniques, the model demonstrates strong capabilities in coding, mathematical reasoning, and general language understanding. This work narrows the gap between open-source and closed-source models in the field of code generation. For more information please consult the?full paper.
Congrats to the authors for their work!
Zhu, Qihao, et al. "DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence." arXiv preprint arXiv:2406.11931 (2024).