登录查看更多内容

The first releases of Code LLM - Code Intelligence Breakdown | Multi-Program Synthesis

Raghul Gopal

Data Science at Logitech | AWS Community Builder ??(ML & GenAI) | Talks and Writes about AI, AGI & Cloud Deployments of AI & AGI | Public Speaker ??| Blogger ??| Unlocking Data Secrets with Math & AI ????

发布日期: 2024年5月24日

Hello All,

This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI & AGI Research. Welcome to Learn with Me Newsletter Week 1 where I will be focusing on the advancements of Generative AI

1.?????? Deep Seek Coder – When the LLM meets Programming

DeepSeek-Coder is the base model, which should be the frontier for Code Llama and Mistral 7B Models. DeepSeek-Coder is a series of Open-Source Models with sizes 1.3B to 33B, trained from Scratch in 2 trillion tokens from 87 programming languages. The model is pre-trained with high-quality project-level code corpus and employs fill-in-the-blanks tasks with a 16K window to enhance code generation and infilling. DeepSeek-Coder model also surpasses existing closed-source models like Codex, and GT-3.5 paving the way to open-source models.

As we know, the major challenge behind LLM in Code Generation and Debugging is the performance gap between the Open source models and Closed Source Models. To mitigate this, DeepSeek-Coder has been released. It is a comprehensive understanding of Coding Languages and Syntax, and in addition to employing the next token prediction loss during pre-training, the FIM approach (Fill in the Middle) has been used.

Let’s jot down the points behind the DeepSeek-Coder Series

DeepSeek-Coder Base 33B delivers superior performance across all benchmarks
DeepSeek-Coder Instruct 33B surpasses OpenAI GPT 3.5 Turbo
DeepSeek-Coder Base 7B provides a competitive performance that is five times larger than Code Llama 13B.

One of the fascinating things about DeepSeek-Coder is the pretraining process. The data is pretrained at the repository level to enhance the model’s understanding capability within the context of cross files within a repository. Let’s see the Data Collection Percentage from Various sources below.

87% Source Code, 10% English Code – Related Natural Language Corpus
3% Code – Unrelated Chinese Natural Language Corpus
English Corpus – Github’s Markdown and stack exchange – Handle tasks like library usage and bug fixing.
Chinese Corpus – High-quality articles aimed at improving the model’s proficiency in understanding the Chinese language.

The Data Collection process involved two steps namely GitHub Data Crawling and Data Filtering, and Dependency Parsing. Previous LLMs have been pretrained on file-level source code ignoring the dependency between different files in a project. This research leverages the dependencies between files within the same repo by parsing the dependencies between files and arranging i.e., the context each file relies on is placed before that file is in the input sequence which represents more accurately real coding practices and structures.

The Algorithm called Topological Sort for Dependency Analysis has been followed for dependency analysis on the list of files within the same project. The Algorithm is given below for your reference. It states that

Updating the dependencies between the files in a directory and degrees of connection
Find disconnected sub-graphs with overall dependency graphs that employ modified topological sort.

Now, let’s talk about Repo Level Deduplication. Deduplication of training datasets for LLMs demonstrated significant performance improvements. This research focused on the deduplication at the Repository level rather than the file levels. In addition to the filtering at the first stage of GitHub Crawling, this research employed a compiler & Quality Model with Heuristic Rules which is used to filter out the low-quality data. This includes Code with Syntax Errors, Poor Readability, and Low Modularity. To ensure code data is not contaminated at the Test Set, the n-gram filtering process is used to remove the code segment that matches the specific criteria.

Now, let’s talk about the Training Strategy used in DeepSeek-Coder. The first training objective is the next token prediction. Second comes the Fill in the Middle Approach (FIM). The FIM approach randomly divides the text into three phases namely prefix, middle, and suffix, then shuffling these parts and connecting them with special characters. There are two modes of connection namely PSM, and SPM. P- Prefix, S-Suffix, and M-Middle.

Hugging Face Tokenizer Library has been used to train byte pair encoding (BPE) tokenizers with a Vocabulary Size of 32,000. The Model Architecture is as follows

Model Parameters: 1.3B, 6.7B, and 33B Parameters.
DeepSeek-Coder is built on DeepSeek LLM outlined by DeepSeek AI
Decoder Only Architecture incorporating Rotary Position Embedding (RoPE)
Grouped Query Attention (GQA) with a group size of 8 has been integrated to enhance both training and inference efficiency.
Flash Attention v2 – to expedite the computation involved in the attention mechanism.

For optimization, they used AdamW optimizer with beta1 = 0.9 and beta2 = 0.95. They used the HAI LLM framework which incorporates various parallelism strategies to optimize computational efficiencies. The various parallelism includes Tensor Parallelism, ZeRO Data Parallelism, and Pipe-Dream Parallelism.

As a result, after enhancing the DeepSeek-Coder Base through Instruction-based finetuning using high-quality data to produce DeepSeek-Coder Instruct. This data is structured by Alpaca Instruction Format.

Ritesh Kanjee 2 个月前

Introducing PromptLang: A simple prompt-based…

Cohen Reuven 1 年前

Kotlin in AI and Machine Learning: How Specialized…

Roy Malhotra 1 周前

Performance of approaches on the Multilingual HumanEval and MBPP Benchmarks

Performance of Models with LeetCode Contest Benchmark.

Access the paper from this link: https://arxiv.org/abs/2401.14196

2.?????? CodeGen – An Open-Source Large Language Model for Code with Multi turn program synthesis

CodeGen is an open-source language model for Code with Multi-Turn Program Synthesis. It is a family of Large Language Models with up to 16.1B parameters on natural language and programming language data and an open-source training library called JAXFORMER. Also, the researcher constructed the MTBB – Multi-Turn Programming Benchmark which has 115 diverse problems that are factorized into multi-turn prompts.

Two key challenges when striving to achieve program synthesis are

Intractability ?? (Impossible to Solve) of the Search Space
The difficulty of properly specifying user intent.

To maintain expressive search, a large space window is needed. When navigating through enormous program space is learning a conditional distribution of the next token given preceding tokens and leverage transformers, and that’s where the term Multi-Turn Program Synthesis has been introduced here.

In Multi-turn program synthesis, the user communicates to the program synthesis by providing specifications in Natural Language, and the receiving responses are in the form of synthesized sub-programs that the user together with the system can complete the program in multiple steps.

CodeGen is the autoregressive Model that predicts the future token with the past predicted token trained on the Natural Language Corpus and programming language data curated in GitHub. The family of CodeGen models is trained sequentially on three datasets: The Pile, BigQuery, and Big Python. The preprocessing follows filtering, deduplication, tokenization, shuffling, and concatenation.

Auto-Regressive Models – Predicts the future next token with the past predicted token.

Let’s see more about the model specification for Code Gen Family.

The number of model parameters – 350M, 27B, 6.1B, 16.1B
Transformer Decoder with left-to-right casual masking
For positional encoding – Rotary Position Encoding
For forward pass – Self Attention & Feed Forward Circuits in parallel for improved communication

Evaluation results of CodeGen Family Models

Access the paper from here: https://arxiv.org/abs/2203.13474

Access the entire model from this repo: https://github.com/salesforce/CodeGen

These two models served as a base for lots of researchers, and lots of open-source models for program synthesis.

That’s it for Week 4. Happy Day, Happy AI.

Follow me Raghul Gopal here to know more about the releases of AI, and AGI with a clear understanding ??

Learn with Me

1,505 位关注者

要查看或添加评论，请登录

Raghul Gopal的更多文章

Attention as an RNN - Aaren ?? | Don't Memorize - Be like a Goldfish??to mitigate Memorization in LLMs ??

2024年6月20日

Attention as an RNN - Aaren ?? | Don't Memorize - Be like a Goldfish??to mitigate Memorization in LLMs ??

1. Attention as an RNN Transformers models marked a significant breakthrough in sequence modeling providing a highly…

1 条评论
Mixed Modal FM ??- Chances of Llama 4 | Aya 23 - Successor of Aya 101 ???

2024年6月20日

Mixed Modal FM ??- Chances of Llama 4 | Aya 23 - Successor of Aya 101 ???

1. Chameleon – Mixed – Modal Early Fusion Foundational Model As you might hear the news that the Chameleon is the next…
Safety Responses Automation ??| Segment Anything with Lightweight Model ??|?? - Release #9

2024年5月25日

Safety Responses Automation ??| Segment Anything with Lightweight Model ??|?? - Release #9

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…

1 条评论
Predecessor of Phi3 ????- Textbook Are All You Need ??| Speech-to-Speech ????Translation with Monolingual Data ??

2024年5月12日

Predecessor of Phi3 ????- Textbook Are All You Need ??| Speech-to-Speech ????Translation with Monolingual Data ??

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…

1 条评论
Magician behind Coding ????♂???♀?| SLMs are the best? ??♂???♀?

2024年4月30日

Magician behind Coding ????♂???♀?| SLMs are the best? ??♂???♀?

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…
Interbreeding Camels ????Version 2 - Camels in a Changing Climate

2024年4月24日

Interbreeding Camels ????Version 2 - Camels in a Changing Climate

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…
Refresh LLMs with SE Data ?? | Interbreeding of Camels ??

2024年4月23日

Refresh LLMs with SE Data ?? | Interbreeding of Camels ??

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…

1 条评论
Learn First Multimodal LLM without Trouble and Perfect Medical LLM for Medicinal Research

2024年4月17日

Learn First Multimodal LLM without Trouble and Perfect Medical LLM for Medicinal Research

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…

3 条评论
Fine Tune LLMs - Don't go for Billion Parameters ??

2024年4月13日

Fine Tune LLMs - Don't go for Billion Parameters ??

Hello All, This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI &…

2 条评论
Focusing on Attention and Hallucinations

2024年4月10日

Focusing on Attention and Hallucinations

Hello All ???? This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI…

See all articles

The first releases of Code LLM - Code Intelligence Breakdown | Multi-Program Synthesis

Raghul Gopal

Data Science at Logitech | AWS Community Builder ??(ML & GenAI) | Talks and Writes about AI, AGI & Cloud Deployments of AI & AGI | Public Speaker ??| Blogger ??| Unlocking Data Secrets with Math & AI ????

领英推荐

Learn with Me

1,505 位关注者

Raghul Gopal的更多文章

社区洞察

其他会员也浏览了

DSPy: A New Framework - Program Your Foundation Models, Not Just Prompting

Tutorial: Run Aider Code Bot Free using Google Colab with Embedded UI

Tab, Tab, Tab ??

CodeTeller: Part 2. Understanding Retrieval-Augmented Generation

The Implications of Prompt Interfaces (Vol. 7)

We need to invent new programming languages to interact with LLMs

OpenAI API Calls - Is that the future of programming ?

Why do ML Projects Fail?

AI-assisted code generators really that good? Yes. They. Are.

领英推荐

Learn with Me

1,505 位关注者

Raghul Gopal的更多文章

Attention as an RNN - Aaren ?? | Don't Memorize - Be like a Goldfish??to mitigate Memorization in LLMs ??

Mixed Modal FM ??- Chances of Llama 4 | Aya 23 - Successor of Aya 101 ???

Safety Responses Automation ??| Segment Anything with Lightweight Model ??|?? - Release #9

Predecessor of Phi3 ????- Textbook Are All You Need ??| Speech-to-Speech ????Translation with Monolingual Data ??

Magician behind Coding ????♂???♀?| SLMs are the best? ??♂???♀?

Interbreeding Camels ????Version 2 - Camels in a Changing Climate

Refresh LLMs with SE Data ?? | Interbreeding of Camels ??

Learn First Multimodal LLM without Trouble and Perfect Medical LLM for Medicinal Research

Fine Tune LLMs - Don't go for Billion Parameters ??

Focusing on Attention and Hallucinations

社区洞察

其他会员也浏览了

DSPy: A New Framework - Program Your Foundation Models, Not Just Prompting

Tutorial: Run Aider Code Bot Free using Google Colab with Embedded UI

Tab, Tab, Tab ??

CodeTeller: Part 2. Understanding Retrieval-Augmented Generation

The Implications of Prompt Interfaces (Vol. 7)

We need to invent new programming languages to interact with LLMs

OpenAI API Calls - Is that the future of programming ?

Why do ML Projects Fail?

AI-assisted code generators really that good? Yes. They. Are.