The first releases of Code LLM - Code Intelligence Breakdown | Multi-Program Synthesis

The first releases of Code LLM - Code Intelligence Breakdown | Multi-Program Synthesis

Hello All,

This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI & AGI Research. Welcome to Learn with Me Newsletter Week 1 where I will be focusing on the advancements of Generative AI

1.?????? Deep Seek Coder – When the LLM meets Programming

DeepSeek-Coder is the base model, which should be the frontier for Code Llama and Mistral 7B Models. DeepSeek-Coder is a series of Open-Source Models with sizes 1.3B to 33B, trained from Scratch in 2 trillion tokens from 87 programming languages. The model is pre-trained with high-quality project-level code corpus and employs fill-in-the-blanks tasks with a 16K window to enhance code generation and infilling. DeepSeek-Coder model also surpasses existing closed-source models like Codex, and GT-3.5 paving the way to open-source models.

As we know, the major challenge behind LLM in Code Generation and Debugging is the performance gap between the Open source models and Closed Source Models. To mitigate this, DeepSeek-Coder has been released. It is a comprehensive understanding of Coding Languages and Syntax, and in addition to employing the next token prediction loss during pre-training, the FIM approach (Fill in the Middle) has been used.

Let’s jot down the points behind the DeepSeek-Coder Series

  1. DeepSeek-Coder Base 33B delivers superior performance across all benchmarks
  2. DeepSeek-Coder Instruct 33B surpasses OpenAI GPT 3.5 Turbo
  3. DeepSeek-Coder Base 7B provides a competitive performance that is five times larger than Code Llama 13B.

One of the fascinating things about DeepSeek-Coder is the pretraining process. The data is pretrained at the repository level to enhance the model’s understanding capability within the context of cross files within a repository. Let’s see the Data Collection Percentage from Various sources below.

  • 87% Source Code, 10% English Code – Related Natural Language Corpus
  • 3% Code – Unrelated Chinese Natural Language Corpus
  • English Corpus – Github’s Markdown and stack exchange – Handle tasks like library usage and bug fixing.
  • Chinese Corpus – High-quality articles aimed at improving the model’s proficiency in understanding the Chinese language.

The Data Collection process involved two steps namely GitHub Data Crawling and Data Filtering, and Dependency Parsing. Previous LLMs have been pretrained on file-level source code ignoring the dependency between different files in a project. This research leverages the dependencies between files within the same repo by parsing the dependencies between files and arranging i.e., the context each file relies on is placed before that file is in the input sequence which represents more accurately real coding practices and structures.

The Algorithm called Topological Sort for Dependency Analysis has been followed for dependency analysis on the list of files within the same project. The Algorithm is given below for your reference. It states that

  • Updating the dependencies between the files in a directory and degrees of connection
  • Find disconnected sub-graphs with overall dependency graphs that employ modified topological sort.

Now, let’s talk about Repo Level Deduplication. Deduplication of training datasets for LLMs demonstrated significant performance improvements. This research focused on the deduplication at the Repository level rather than the file levels. In addition to the filtering at the first stage of GitHub Crawling, this research employed a compiler & Quality Model with Heuristic Rules which is used to filter out the low-quality data. This includes Code with Syntax Errors, Poor Readability, and Low Modularity. To ensure code data is not contaminated at the Test Set, the n-gram filtering process is used to remove the code segment that matches the specific criteria.

Now, let’s talk about the Training Strategy used in DeepSeek-Coder. The first training objective is the next token prediction. Second comes the Fill in the Middle Approach (FIM). The FIM approach randomly divides the text into three phases namely prefix, middle, and suffix, then shuffling these parts and connecting them with special characters. There are two modes of connection namely PSM, and SPM. P- Prefix, S-Suffix, and M-Middle.

Hugging Face Tokenizer Library has been used to train byte pair encoding (BPE) tokenizers with a Vocabulary Size of 32,000. The Model Architecture is as follows

  • Model Parameters: 1.3B, 6.7B, and 33B Parameters.
  • DeepSeek-Coder is built on DeepSeek LLM outlined by DeepSeek AI
  • Decoder Only Architecture incorporating Rotary Position Embedding (RoPE)
  • Grouped Query Attention (GQA) with a group size of 8 has been integrated to enhance both training and inference efficiency.
  • Flash Attention v2 – to expedite the computation involved in the attention mechanism.

For optimization, they used AdamW optimizer with beta1 = 0.9 and beta2 = 0.95. They used the HAI LLM framework which incorporates various parallelism strategies to optimize computational efficiencies. The various parallelism includes Tensor Parallelism, ZeRO Data Parallelism, and Pipe-Dream Parallelism.

As a result, after enhancing the DeepSeek-Coder Base through Instruction-based finetuning using high-quality data to produce DeepSeek-Coder Instruct. This data is structured by Alpaca Instruction Format.

Hyperparameters of DeepSeek-Coder
Example Output from DeepSeek-Coder
Performance of approaches on the Multilingual HumanEval and MBPP Benchmarks
Performance of Models with LeetCode Contest Benchmark.

Access the paper from this link: https://arxiv.org/abs/2401.14196

2.?????? CodeGen – An Open-Source Large Language Model for Code with Multi turn program synthesis

CodeGen is an open-source language model for Code with Multi-Turn Program Synthesis. It is a family of Large Language Models with up to 16.1B parameters on natural language and programming language data and an open-source training library called JAXFORMER. Also, the researcher constructed the MTBB – Multi-Turn Programming Benchmark which has 115 diverse problems that are factorized into multi-turn prompts.

Two key challenges when striving to achieve program synthesis are

  1. Intractability ?? (Impossible to Solve) of the Search Space
  2. The difficulty of properly specifying user intent.

To maintain expressive search, a large space window is needed. When navigating through enormous program space is learning a conditional distribution of the next token given preceding tokens and leverage transformers, and that’s where the term Multi-Turn Program Synthesis has been introduced here.

In Multi-turn program synthesis, the user communicates to the program synthesis by providing specifications in Natural Language, and the receiving responses are in the form of synthesized sub-programs that the user together with the system can complete the program in multiple steps.

CodeGen is the autoregressive Model that predicts the future token with the past predicted token trained on the Natural Language Corpus and programming language data curated in GitHub. The family of CodeGen models is trained sequentially on three datasets: The Pile, BigQuery, and Big Python. The preprocessing follows filtering, deduplication, tokenization, shuffling, and concatenation.

Auto-Regressive Models – Predicts the future next token with the past predicted token.

Let’s see more about the model specification for Code Gen Family.

  1. The number of model parameters – 350M, 27B, 6.1B, 16.1B
  2. Transformer Decoder with left-to-right casual masking
  3. For positional encoding – Rotary Position Encoding
  4. For forward pass – Self Attention & Feed Forward Circuits in parallel for improved communication

Evaluation results of CodeGen Family Models
Example of Multi-turn Program Synthesis

Access the paper from here: https://arxiv.org/abs/2203.13474

Access the entire model from this repo: https://github.com/salesforce/CodeGen

These two models served as a base for lots of researchers, and lots of open-source models for program synthesis.

That’s it for Week 4. Happy Day, Happy AI.

Follow me Raghul Gopal here to know more about the releases of AI, and AGI with a clear understanding ??

要查看或添加评论,请登录

Raghul Gopal的更多文章

社区洞察

其他会员也浏览了