登录查看更多内容

Exploring GenAlverse - Part III

Abhishek Srivastava

Solutions Architect

发布日期: 2024年9月18日

In part I, we discussed the LLM models at a high level, and in part II, we had done a Retrieval Augmented Generation (RAG) implementation. In this part we’ll understand how the compute requirements of machine learning and Gen AI applications have evolved the chip design.?

Processing is the core of GenAI applications. From training Large Language Models to inferencing the models, GenAI applications need a lot of processing power. New chip architectures are being evolved to address the high compute requirements of the GenAI applications and? specialized chips are being designed and fabricated to address the growing compute requirements.

Evolution of compute requirements

In this section, we’ll cover how the need for processing led to the design and development of new chips - CPUs, GPUs, TPUs and LPUs.?

Central Processing Units - CPUs

These are ubiquitous and are present in almost all of the servers, laptops and other devices. CPUs are able process multiple types of operations and hence they are widely used across enterprises and personal devices.?

CPUs have evolved over time (from single core to multi core) to address the growing transaction volume and extremely low latency requirements. The computing needs of the n-tier applications (web sites, enterprise applications etc.) were user experience - centric. The response time in typical client server architecture was optimized to keep the end user engaged. The applications should be able to scale up to handle millions of requests per second while providing the same experience to all end users. Some of the architectural patterns to address high volume of transactions and reduce latency are -? horizontal and vertical scaling, load balancing, Content delivery networks, distributed databases to keep the data “near” to the end users.

Additionally, enterprises of all sizes are able to leverage the “on demand compute” through the cloud providers.?

With these developments, modern applications are able to provide an engaging experience to users. Millions of users are able to concurrently stream the latest shows in 4K. Retail events like Amazon Prime Day, break the record number of transactions processed on the day every year Visa's global processing network, VisaNet, can handle more than 65,000 transactions per second (TPS).?

The degree of parallelism a server can perform is proportional to the number of the cores a CPUs has. Most of the enterprise servers can handle high volume of work loads with 16+ cores (along with horizontal and vertical scaling and other techniques). However the CPUs were not able to keep up with requirements of a different set of users - Gamers.?

Graphics Processing Unit - GPUs

When the gaming industry evolved from mono chrome hand held devices to 4K gaming, the compute needs exponentially scaled up. In 4K gaming, millions of pixels have to be evaluated in every frame through large matrix multiplications and vector transformations. This needed immense processing power and the chips needed to have way more parallelism than that of a 8, 16 or even 32 core CPU. This led to the design and development of Graphics Processing Units - GPUs.?

A GPU can have thousands of cores (NVIDIA RTX 4090 has more than 16,000 cores). They are built on a hub and a spoke architecture, wherein the work load is distributed through the hub and then combined back leading to high degree of parallelism. They work in conjunction with the CPU to do the processing.?

Since they can do hyper multi threading, GPUs were leveraged at scale for bitcoin mining as well. Server farms were set up across the globe to monetize mining of bitcoins. The demand of GPUs for mining Etherum had surged the cost of GPU more than three fold in 2020.

The advantage of GPUs was not restricted to gaming and mining and they were also leveraged to train the machine learning models.?

The turning point came in 2008 when Andrew Ng, then a researcher at Stanford, along with his team had trained an AI model with 100 million parameters using two Nvidia GPUs in a single day. They were able to achieve 70X speed over CPU processing. This led to widespread use of GPUs for training the neural networks and large LLM models. To this day, big tech players are setting up GPU data centers for developing and training large language models and all the large language models running on GPUs. Nvidia has developed its own AI inferencing platform - Nvidia NIM, that hosts AI models that can be inference through microservices powered by GPUs. We’ll do a sample implementation on Nvidia NIM as well in later sections.

Tensor Processing Units - TPUs:

Since the Machine learning and neural network workloads involve a lot of large matrix multiplications and additions, Google designed specialized chips - Tensor Processing Units (TPUs) specifically to optimize for machine learning workloads. The performance of these chips was optimized by reducing the precision to 8 bit integers (CPUs have 32 bit floating point numbers) as the neural networks workloads do not require high precision. With this more processing power was packed in a chip thereby improving the performance significantly.?

Additionally, the chips were designed to perform only a limited number of operations more quickly (whereas a CPU is for general purpose computing and is designed to perform multiple types of operations). The TPU has also evolved with every version (TPU v1, TPU v2 and TPU v3) and the efficiencies have been further optimized with every release.?

However,

The processing requirements of developing and training large language models for natural language processing is different from deep learning workloads, leading to a new category of chips - LPU

领英推荐

What Is NVMe RAID Mode? Should I Have NVMe RAID Mode On

The-Next-Tech 4 个月前

Scalable and modular – but can it be software-defined?

congatec 1 个月前

Down the Rabbit Hole: Optimizing AWS F1 Direct Memory…

Ingonyama 1 年前

Language Processing Unit - LPUs

While, in gaming, the requirement is to process millions of pixels every second, the process can be done in parallel as the sequence of the processing of the pixel does not matter. But in LLMs, to retain the context, the sequence of words is important. The context can drastically change if the sequence of the words change. In synthetic text generation, the next word in a sentence is identified based on the probability of the occurrence of the word in a sequence.?

Although GPUs can be leveraged to train the LLMs and neural networks, there are inefficiencies when it comes to processing sequential data. Since large AI models run on a large number of interconnected GPUs and memory chips, moving data across impacts the overall utilization, leading to idle time of the GPUs.?Moreover, GPUs were primarily designed for processing graphics.

This led to the design of a new category of chips - Language Processing Units (LPU) from Groq. These chips are based on assembly line and conveyor belt architecture, where sequential data is processed in an assembly line. The assembly line can move across the chips reducing the to-fro trips of the data across chips. The memory is also included in the chip itself to further reduce latencies. With LPUs, the AI inference has been significantly improved.?

AI Referencing implementations on LPUs and GPUs.

In this section we’ll do sample implementations on Groq and Nvidia NIM through LangChain. As a prerequisite, we need to create API keys on Groq and Nvidia to access the models through the python code.?

In the below example, we’ll inference the “llama-3.1-8b-instant” model through Groq LPUs

from langchain_groq import ChatGroq

from langchain_core.prompts import ChatPromptTemplate
import os


from dotenv import load_dotenv
load_dotenv()
os.environ['GROQ_API_KEY']=os.getenv("GROQ_API_KEY")

groq_api_key = os.getenv("GROQ_API_KEY")



llm = ChatGroq(groq_api_key=groq_api_key, model_name="llama-3.1-8b-instant")


prompt=ChatPromptTemplate.from_template(
    """
    Answer the question briefly in about 200 words
    Question: {question}
    """    
)

chain = prompt | llm

response = chain.invoke({"question": "Briefly give an overview of neural network"})

print(response)

Following response is received from the LLM model through Grow referencing:

content='**Overview of Neural Networks**\n\nA neural network is a computer system inspired by the structure and function of the human brain. It is a machine learning model consisting of layers of interconnected nodes or "neurons" that process and transmit information. Each neuron receives one or more inputs, performs a computation on those inputs, and then sends the output to other neurons.\n\n**Key Components:**\n\n1. Artificial Neurons (Nodes): Process inputs, perform computations, and transmit outputs.\n2. Connections (Edges): Link neurons, allowing information to flow between them.\n3. Activation Functions: Determine the output of a neuron based on its inputs.\n4. Training: The process of adjusting the network\'s weights and biases to minimize error.\n\n**How Neural Networks Work:**\n\n1. Input Layer: Receives input data.\n2. Hidden Layers: Process and transform the input data.\n3. Output Layer: Produces the final output.\n\nNeural networks are used for a wide range of tasks, including image and speech recognition, natural language processing, and predictive modeling. They are particularly effective at learning complex patterns in data and making accurate predictions or classifications.' response_metadata={'token_usage': {'completion_tokens': 238, 'prompt_tokens': 59, 'total_tokens': 297, 'completion_time': 0.317333333, 'prompt_time': 0.013999549, 'queue_time': 0.0013055789999999994, 'total_time': 0.331332882}, 'model_name': 'llama-3.1-8b-instant', 'system_fingerprint': 'fp_f66ccb39ec', 'finish_reason': 'stop', 'logprobs': None} id='run-f7ee1145-5e24-441c-b9d0-547becccc94d-0' usage_metadata={'input_tokens': 59, 'output_tokens': 238, 'total_tokens': 297}

In the below example, we’ll inference the “llama-3.1-405b-instruct” model through Nvidia platform on GPUs

from langchain_nvidia_ai_endpoints import ChatNVIDIA

import os
from dotenv import load_dotenv
load_dotenv()

openai_api_key = os.getenv("OPENAI_API_KEY")


client = ChatNVIDIA(
  model="meta/llama-3.1-405b-instruct",
  api_key=openai_api_key,
  temperature=0.2,
  top_p=0.7,
  max_tokens=1024,
)

for chunk in client.stream([{"role":"user","content":"Can you provide and overview of neural networks in 200 words"}]):
  print(chunk.content, end="")

The response from the LLM is below:

Here is a 200-word overview of neural networks: What is a Neural Network? A neural network is a machine learning model inspired by the structure and function of the human brain. It consists of layers of interconnected nodes or "neurons" that process and transmit information. How Does it Work? Here's a simplified overview: 1. Input Layer: Data is fed into the network through the input layer. 2. Hidden Layers: The data is processed through multiple hidden layers, where each node applies a non-linear transformation to the input data. 3. Output Layer: The final output is generated by the output layer, based on the transformations applied by the hidden layers. Key Concepts *Activation Functions**: Each node applies an activation function to the input data, introducing non-linearity to the model. *Backpropagation**: The network is trained using backpropagation, an optimization algorithm that adjusts the weights and biases of each node to minimize the error between predicted and actual outputs. *Training**: The network is trained on a dataset, adjusting the weights and biases to improve its performance on a specific task. Applications Neural networks have many applications, including: Image and speech recognition Natural language processing Predictive modeling Robotics and control systems I hope this provides a helpful introduction to neural networks! Let me know if you have any further questions.

Thinking out of the Chip

The process to optimize AI model inferencing has led to new ideas. Cerebras has included a whopping 900,000 cores along with 44 GB of on-wafer memory in one large chip the size of , to avoid any latencies from inter-chip connectivity. The WSE-3 chip from Cerebra is 57 times the size of an Nvidia GPU (Nvidia H100) and packs in 44 GB of RAM.

Below is visual comparison of the WSE-3 with GPU (from Cerebra’s website)

Summary

We got a view of the GenAIverse through the three posts. The models and the inferencing frameworks are evolving at a very fast pace and new ones are being released. Convergence of cloud platforms and the GenAI ecosystem have abstracted the need for hosting specialized and expensive infrastructure and training humongous models on prem. We can now explore the GenAIverse with a few lines of code. However, as with any technology, the innovative solutions would steal the show.

Vikas Prasad

Solution Architect @ AWS | Cloud Contact Center Solutions

5 个月

Excellent article!

要查看或添加评论，请登录

Abhishek Srivastava的更多文章

Exploring GenAlverse - II

2024年9月4日

Exploring GenAlverse - II

Retrieval Augmented Generation In part I, we’d done basic inference of the LLM models through LangChain and had an…

1 条评论
Exploring GenAlverse - Part I

2024年8月21日

Exploring GenAlverse - Part I

Large Language Models, LangChain and Prompt Engineering The momentum that started with the launch of ChatGPT from…
Decentralized applications (dApps), NFTs, Blockchain — Demystified (Part 1)

2022年2月14日

Decentralized applications (dApps), NFTs, Blockchain — Demystified (Part 1)

A study has found that the average attention span has dipped to 8 seconds*. Yes.

5 条评论

Exploring GenAlverse - Part III

Abhishek Srivastava

Solutions Architect

Evolution of compute requirements

Central Processing Units - CPUs

Graphics Processing Unit - GPUs

Tensor Processing Units - TPUs:

领英推荐

Language Processing Unit - LPUs

AI Referencing implementations on LPUs and GPUs.

Thinking out of the Chip

Summary

Abhishek Srivastava的更多文章

社区洞察

其他会员也浏览了

#StridingTowardsTheIntelligentWorld-The CPU-Centric Architecture Is Evolving into a Data-Centric Composable Architecture

exFAT2-IP: CPU-Free File System with Two-User for NVMe

‘top’ reporting accurate metrics within containers?

Introduction of NVMeTCP25G-IP 4 sessions with DMA on Alveo U50 Card

SDN and traditional network--an old problem under the new concept！

Tearing Down the Memory Wall

Large Language Models and Hardware: A Comparative Study of CPUs, GPUs, and TPUs

Maximizing Efficiency in Digital Media: The Power of GPU and CPU Render Servers

Computex Chronicles Part 3: Arm Unveils New Architectures and AI Libraries

Demystifying Memory Sub-systems Part 2: Virtual Memory

Evolution of compute requirements

Central Processing Units - CPUs

Graphics Processing Unit - GPUs

Tensor Processing Units - TPUs:

领英推荐

Language Processing Unit - LPUs

AI Referencing implementations on LPUs and GPUs.

Thinking out of the Chip

Summary

Abhishek Srivastava的更多文章

Exploring GenAlverse - II