DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Credit: https://arxiv.org/pdf/2406.11931

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Today's paper introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. The model is further pre-trained from DeepSeek-V2 with an additional 6 trillion tokens, significantly enhancing its coding and mathematical reasoning capabilities while maintaining strong general language performance. DeepSeek-Coder-V2 expands support to 338 programming languages and extends the context length to 128K tokens.

Method Overview

DeepSeek-Coder-V2 is built upon the foundation of DeepSeek-V2 and undergoes additional pre-training with a carefully curated dataset. The pre-training corpus consists of 60% source code (1,170B tokens), 10% math content (221B tokens), and 30% natural language data. This diverse dataset exposes the model to a wide range of programming languages and mathematical concepts.

The training process involves two main phases: pre-training and alignment. During pre-training, the model is exposed to the multi-source corpus, allowing it to learn patterns and structures from code, math, and general language data. This phase significantly enhances the model's coding and mathematical reasoning abilities.

In the alignment phase, the model undergoes fine-tuning using an instruction dataset that includes code, math, and general instruction data. This is followed by reinforcement learning using the Group Relative Policy Optimization (GRPO) algorithm. The GRPO process aligns the model's behavior with human preferences, particularly in coding tasks. Preference data is collected using compiler feedback and test cases, and a reward model guides the policy model's training.

To support code completion functionality, the model also incorporates the Fill-In-Middle approach during fine-tuning. This allows DeepSeek-Coder-V2 to effectively complete partial code snippets.

The model's architecture is based on a Mixture-of-Experts (MoE) design, which allows for efficient scaling and specialized knowledge across different domains. The context length is extended to 128K tokens, enabling the model to handle more complex and extensive coding tasks.

Results

DeepSeek-Coder-V2 demonstrates impressive performance across various code and math benchmarks:

  • Achieves 90.2% accuracy on HumanEval, outperforming GPT4-Turbo (88.2%) and other closed-source models for Python.
  • Scores 75.3% on MBPP+, surpassing GPT4-Turbo (72.2%) and other competitors.

  • Performs exceptionally well on GSM8K with 94.9% accuracy, slightly below GPT4-Turbo (95.0%) but ahead of other models

  • Excels in practical coding benchmarks like Aider (73.7%) and LiveCodeBench (43.4%) outperforming most other models.

The model also maintains comparable performance to DeepSeek-V2 in general language tasks, showcasing its versatility.

Conclusion

DeepSeek-Coder-V2 represents a significant advancement in open-source code language models, achieving performance comparable to or exceeding closed-source models like GPT4-Turbo in various code and math benchmarks. By combining a diverse pre-training corpus, extended context length, and advanced alignment techniques, the model demonstrates strong capabilities in coding, mathematical reasoning, and general language understanding. This work narrows the gap between open-source and closed-source models in the field of code generation. For more information please consult the?full paper.

Congrats to the authors for their work!

Zhu, Qihao, et al. "DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence." arXiv preprint arXiv:2406.11931 (2024).

要查看或添加评论,请登录

Vlad Bogolin的更多文章

社区洞察

其他会员也浏览了