Gen AI - Generating Code using Advanced Large Language Models
DeepSeek Coder - Generating Code using Advanced Large Language Models

Gen AI - Generating Code using Advanced Large Language Models

Today's article is about a rather very interesting topic on how to use Large Language Models (LLM) for generating code.

In this article I will first cover the model architecture of code generating LLMs, followed by data used for pre-training, instruction tuning of such LLMs and finally an example of code generation using LLM.

This whole concept mentioned above, I will explain using DeepSeek-Coder LLM.

DeepSeek-Coder is state of the art code generating LLM from DeepSeek-AI. 2401.14196 (arxiv.org)

So, let's dive in. Let's start with the model architecture.

In general Code generating LLMs are Decoder Only Transformer models.

The transformer model generates code token by token just like any text generation. The only difference is that the token generated by LLMs are code tokens. As code is also text.

Refer below image for a high-level Model Architecture.

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

(Image Source mentioned in credits below)

Now if you see the DeepSeek-Coder model architecture it is a decoder only Transformer Model which generates Tokens. It uses Rotary Position Embeddings and Group Query Attention. Why Rotatory Position Embeddings? -This helps to understand the relative position of the token better in the code sequence and this is key in Code Generation. Why Group Query Attention? - This helps in faster computation and uses less memory space than Self-Attention.

Overall, the difference between this LLM is that Tokens to train here is code and output tokens is also code. Straight and simple. But the whole concept is revolutionary.

Key to this code generating LLM is data set used for pre-training and then instruct fine tuning.

Before we dive into next section of the article covering pre-training, instruction fine tuning, let's cover the type of code generated

Code generation can be categorized into three main categories.

  • Code generation like Python Code, SQL code etc.
  • Complex Reasoning code generation
  • Fill-in-the-Middle (FIM). Fill the code between two code blocks

Under code generation category, the generated code can be of complex code for Linear Regression Algorithm.

Having covered what are code generation LLMs. Let's now cover the datasets used for pre-training, pre-training and instruction tuning as well.

DeepSeek-Coder training dataset has 87% source code, 10% English code-related natural language corpus and 3% code-unrelated Chinese natural language corpus. English corpus has GitHub and StatExchange Markdowns.

DeepSeek-Coder Data Pipeline for Data Used in Training the Model - Refer below image.

DeepSeek-Coder Data Pipeline for Data Used in Training the Model

The key in this data pipeline are the stages of filtering such as Rule based etc., that are used to create the right code data set for the Model.

DeepSeek-Coder model is trained on following languages.

Languages Used to Train DeepSeek-Coder

Let's cover Pre-Training.

DeepSeek-Coder Model Pre-Training

Model is pre-trained on the Next Token Prediction and FIM tasks.

For FIM the following construct is used for training. Where "fim_start" is a sentinel token or mask token

<|fim_start|> ????????<|fim_hole|> ?????? ??<|fim_end|> ??????????????<|eos_token|>

As training steps.

  • The first step model is code-pretrained on 4K Window and 1.8T tokens.
  • followed by second step trained on a 16K window size and 200B Tokens to lead to DeepSeek-Coder-Base Model
  • then instruct tuned on 16K window and 2B Tokens to lead to DeepSeek-Coder-Instruct Model

Having covered the datasets used for pre-training, pre-training and instruction tuning let's cover a sample example of code generated by DeepSeek-Coder LLM.

Refer below Python Code developed by human to invoke DeepSeek-Coder for generating code for - "write a quick sort algorithm in python"

===========================================

from transformers import AutoTokenizer, AutoModelForCausalLM

import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()

messages= [

{'role': 'user', 'content': "write a quick sort algorithm in python."}

]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)

# tokenizer.eos_token_id is the id of <|EOT|> token

outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)

print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

===========================================

Refer below for code generated by "deepseek-coder-6.7b-instruct" for "write a quick sort algorithm in python"

Sure, here is a simple implementation of the Quick Sort algorithm in Python:

def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        less_than_pivot = [x for x in arr[1:] if x <= pivot]
        greater_than_pivot = [x for x in arr[1:] if x > pivot]
        return quick_sort(less_than_pivot) + [pivot] + quick_sort(greater_than_pivot)

# Test the function
arr = [10, 7, 8, 9, 1, 5]
print ("Original array:", arr)
print ("Sorted array:", quick_sort(arr))

This code works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The pivot element is then in its final position. The process is then repeated for the sub-arrays.
        

Amazing isn't it.

Code Generation LLM are disrupting the entire landscape of software development and software engineering.

The business use case is Co-Pilot.

Thanks All. Hope you have a good read.

Disclaimer: Opinion / Views expressed above are the author's personal and has no bearing or affiliation to the authors current employer or any earlier/past employers.

Credit:

https://arxiv.org/pdf/2401.14196

https://github.com/deepseek-ai/deepseek-coder?tab=readme-ov-file

Image Credit:

https://medium.com/@tanalpha-aditya/deepseek-coder-when-the-llm-meets-programming-better-than-gpt-3-5-054cf85e3493

https://github.com/deepseek-ai/deepseek-coder

??Robin Ayoub

AI Training Data | NLP | Prompt Engineering | Multilingual Speech-to-Text Transcription | Chatbot | Conversational AI | Machine translation | Human in the loop AI integration

9 个月

Nikhil, Very interesting, thanks for sharing!

回复
Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

9 个月

You mentioned exploring AI, LLMs, and code generation, illustrating a deep interest in leveraging advanced technologies for innovative solutions. Drawing a parallel to historical advancements, similar enthusiasm for automation arose during the industrial revolution, leading to transformative changes in manufacturing processes. However, it also sparked debates on job displacement and ethical considerations. Considering the potential impact of AI-driven code generation on software development practices, how do you envision addressing concerns regarding code quality, maintainability, and ethical implications? Additionally, how might regulatory frameworks evolve to ensure responsible deployment of AI in code generation while fostering innovation?

回复

要查看或添加评论,请登录

Nikhil Goel的更多文章

社区洞察

其他会员也浏览了