登录查看更多内容

Gen AI - Generating Code using Advanced Large Language Models

Nikhil Goel

AI | Machine Learning | AI SAAS B2C Platform Leader

发布日期: 2024年6月9日

Today's article is about a rather very interesting topic on how to use Large Language Models (LLM) for generating code.

In this article I will first cover the model architecture of code generating LLMs, followed by data used for pre-training, instruction tuning of such LLMs and finally an example of code generation using LLM.

This whole concept mentioned above, I will explain using DeepSeek-Coder LLM.

DeepSeek-Coder is state of the art code generating LLM from DeepSeek-AI. 2401.14196 (arxiv.org)

So, let's dive in. Let's start with the model architecture.

In general Code generating LLMs are Decoder Only Transformer models.

The transformer model generates code token by token just like any text generation. The only difference is that the token generated by LLMs are code tokens. As code is also text.

Refer below image for a high-level Model Architecture.

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

(Image Source mentioned in credits below)

Now if you see the DeepSeek-Coder model architecture it is a decoder only Transformer Model which generates Tokens. It uses Rotary Position Embeddings and Group Query Attention. Why Rotatory Position Embeddings? -This helps to understand the relative position of the token better in the code sequence and this is key in Code Generation. Why Group Query Attention? - This helps in faster computation and uses less memory space than Self-Attention.

Overall, the difference between this LLM is that Tokens to train here is code and output tokens is also code. Straight and simple. But the whole concept is revolutionary.

Key to this code generating LLM is data set used for pre-training and then instruct fine tuning.

Before we dive into next section of the article covering pre-training, instruction fine tuning, let's cover the type of code generated

Code generation can be categorized into three main categories.

Code generation like Python Code, SQL code etc.
Complex Reasoning code generation
Fill-in-the-Middle (FIM). Fill the code between two code blocks

Under code generation category, the generated code can be of complex code for Linear Regression Algorithm.

Having covered what are code generation LLMs. Let's now cover the datasets used for pre-training, pre-training and instruction tuning as well.

DeepSeek-Coder training dataset has 87% source code, 10% English code-related natural language corpus and 3% code-unrelated Chinese natural language corpus. English corpus has GitHub and StatExchange Markdowns.

DeepSeek-Coder Data Pipeline for Data Used in Training the Model - Refer below image.

The key in this data pipeline are the stages of filtering such as Rule based etc., that are used to create the right code data set for the Model.

DeepSeek-Coder model is trained on following languages.

Let's cover Pre-Training.

Model is pre-trained on the Next Token Prediction and FIM tasks.

For FIM the following construct is used for training. Where "fim_start" is a sentinel token or mask token

<｜fim_start｜> ????????<｜fim_hole｜> ?????? ??<｜fim_end｜> ??????????????<|eos_token|>

领英推荐

Artificial Intelligence #185

Andriy Burkov 1 年前

Issue #289 - The ML Engineer ??

Alejandro Saucedo 8 个月前

??Top ML Papers of the Week

DAIR.AI 6 个月前

As training steps.

The first step model is code-pretrained on 4K Window and 1.8T tokens.
followed by second step trained on a 16K window size and 200B Tokens to lead to DeepSeek-Coder-Base Model
then instruct tuned on 16K window and 2B Tokens to lead to DeepSeek-Coder-Instruct Model

Having covered the datasets used for pre-training, pre-training and instruction tuning let's cover a sample example of code generated by DeepSeek-Coder LLM.

Refer below Python Code developed by human to invoke DeepSeek-Coder for generating code for - "write a quick sort algorithm in python"

===========================================

from transformers import AutoTokenizer, AutoModelForCausalLM

import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()

messages= [

{'role': 'user', 'content': "write a quick sort algorithm in python."}

]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)

# tokenizer.eos_token_id is the id of <|EOT|> token

outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)

print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

===========================================

Refer below for code generated by "deepseek-coder-6.7b-instruct" for "write a quick sort algorithm in python"

Sure, here is a simple implementation of the Quick Sort algorithm in Python:

def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        less_than_pivot = [x for x in arr[1:] if x <= pivot]
        greater_than_pivot = [x for x in arr[1:] if x > pivot]
        return quick_sort(less_than_pivot) + [pivot] + quick_sort(greater_than_pivot)

# Test the function
arr = [10, 7, 8, 9, 1, 5]
print ("Original array:", arr)
print ("Sorted array:", quick_sort(arr))

This code works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The pivot element is then in its final position. The process is then repeated for the sub-arrays.

Amazing isn't it.

Code Generation LLM are disrupting the entire landscape of software development and software engineering.

The business use case is Co-Pilot.

Thanks All. Hope you have a good read.

Disclaimer: Opinion / Views expressed above are the author's personal and has no bearing or affiliation to the authors current employer or any earlier/past employers.

Credit:

https://arxiv.org/pdf/2401.14196

https://github.com/deepseek-ai/deepseek-coder?tab=readme-ov-file

Image Credit:

https://medium.com/@tanalpha-aditya/deepseek-coder-when-the-llm-meets-programming-better-than-gpt-3-5-054cf85e3493

https://github.com/deepseek-ai/deepseek-coder

??Robin Ayoub

9 个月

Nikhil, Very interesting, thanks for sharing!

Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

9 个月

You mentioned exploring AI, LLMs, and code generation, illustrating a deep interest in leveraging advanced technologies for innovative solutions. Drawing a parallel to historical advancements, similar enthusiasm for automation arose during the industrial revolution, leading to transformative changes in manufacturing processes. However, it also sparked debates on job displacement and ethical considerations. Considering the potential impact of AI-driven code generation on software development practices, how do you envision addressing concerns regarding code quality, maintainability, and ethical implications? Additionally, how might regulatory frameworks evolve to ensure responsible deployment of AI in code generation while fostering innovation?

查看更多评论

要查看或添加评论，请登录

Nikhil Goel的更多文章

Code Generation LLM's and how businesses are using these for improving productivity

2025年3月16日

Code Generation LLM's and how businesses are using these for improving productivity

Today's article in on a very unique topic of Code Generation LLM and how businesses are using these LLMs for improving…
Qwen2.5B Coder LLM and how transformative is for Business

2025年3月9日

Qwen2.5B Coder LLM and how transformative is for Business

Today's post is on Qwen2.5B Code LLM and how Qwen2.
AI Agents(smolagents) and how these are transforming business

2025年2月23日

AI Agents(smolagents) and how these are transforming business

Today's article is on AI Agents and specifically on "smolagents" and how agents are transforming businesses. As part of…
DeepSeekMath - LLM for mathematical reasoning and how it transforms Mathematical Problem-Solving

2025年2月16日

DeepSeekMath - LLM for mathematical reasoning and how it transforms Mathematical Problem-Solving

Today's article about DeepSeekMath an LLM that is pushing the limits of mathematical reasoning and how this aspect is…
deepseek R1 LLM - What is and why this LLM is game changer for business

2025年2月2日

deepseek R1 LLM - What is and why this LLM is game changer for business

Today's article is on deepseek R1, what is deepseek R1 - the model architecture, how the model was trained (RLHF) and…
AI Agents What are AI Agents and why AI Agents are transformative to business

2025年1月19日

AI Agents What are AI Agents and why AI Agents are transformative to business

Today I am going to write about AI Agents, what are AI Agents and why AI Agents are transformative to business. We are…

2 条评论
"Text2SQL" how LLM's enable this and why this is transformative for Businesses

2025年1月12日

"Text2SQL" how LLM's enable this and why this is transformative for Businesses

Today I am going to write about "Text2SQL", what exactly is text2sql is, why this is needed and how this is hugely…
Self-Instruct and Instruction Tuning of LLM and applying to solve a business use case

2024年12月29日

Self-Instruct and Instruction Tuning of LLM and applying to solve a business use case

Today I am going to write what Self-Instruct is, importance of Self-Instruct, what is Instruction tuning and how…

4 条评论
Retrieval Augmented Generation (RAG) vs Fine Tuning of LLMs - What is right for business

2024年12月15日

Retrieval Augmented Generation (RAG) vs Fine Tuning of LLMs - What is right for business

Today's article is about comparing RAG Vs Fine Tuning of LLM's, and what of these two is more apt for businesses if the…

1 条评论
Contextual RAG - What it is and the value over simple RAG

2024年11月24日

Contextual RAG - What it is and the value over simple RAG

Today's article is about Contextual RAG, what it is and the value over simple RAG. In the article I am going to cover…

See all articles

Gen AI - Generating Code using Advanced Large Language Models

Nikhil Goel

AI | Machine Learning | AI SAAS B2C Platform Leader

领英推荐

Nikhil Goel的更多文章

社区洞察

其他会员也浏览了

Llama 3.2: On-device 1B/3B and Multimodal 11B/90B Models – Access via API ??

Langchain

New flagship and advanced LLM from MistralAI with a 32K context window ??

Introducing CodeLlama 70B: A 70 billion-parameter model achieving SOTA performance in code generation.

Interesting Content in AI, Software, Business, and Tech- 5/31/2023

30 interview questions for each of the specified roles: AI/ML Developer (Python), AI/ML Designer, and AI/ML Solution Expert.

Top 5 Open-Source LangChain Alternatives to Use in 2024

Code Generation with Large Language Models (LLMs)

NewMind AI Journal #21

LangChain Models

领英推荐

Nikhil Goel的更多文章

Code Generation LLM's and how businesses are using these for improving productivity

Qwen2.5B Coder LLM and how transformative is for Business

AI Agents(smolagents) and how these are transforming business

DeepSeekMath - LLM for mathematical reasoning and how it transforms Mathematical Problem-Solving

deepseek R1 LLM - What is and why this LLM is game changer for business

AI Agents What are AI Agents and why AI Agents are transformative to business

"Text2SQL" how LLM's enable this and why this is transformative for Businesses

Self-Instruct and Instruction Tuning of LLM and applying to solve a business use case

Retrieval Augmented Generation (RAG) vs Fine Tuning of LLMs - What is right for business

Contextual RAG - What it is and the value over simple RAG

社区洞察

其他会员也浏览了

Llama 3.2: On-device 1B/3B and Multimodal 11B/90B Models – Access via API ??

Langchain

New flagship and advanced LLM from MistralAI with a 32K context window ??

Introducing CodeLlama 70B: A 70 billion-parameter model achieving SOTA performance in code generation.

Interesting Content in AI, Software, Business, and Tech- 5/31/2023

30 interview questions for each of the specified roles: AI/ML Developer (Python), AI/ML Designer, and AI/ML Solution Expert.

Top 5 Open-Source LangChain Alternatives to Use in 2024

Code Generation with Large Language Models (LLMs)

NewMind AI Journal #21

LangChain Models