Leveraging Generative AI & Language Models for Businesses - How To Build Your Own Large Language Model
Prem Naraindas
Founder & CEO at Katonic.ai | The Australian Top Innovator 2023 | LinkedIn TOP VOICE 2024
Generative AI is actually a simple concept despite its intimidating label. It pertains to AI algorithms that produce an output, like text, photo, video, code, data, and 3D renderings, from the data they are trained on. It focuses on generating content instead of other AI forms with different uses, such as analysing data or assisting with self-driving car control.
Why is generative AI a hot topic right now?
Generative AI programs like OpenAI's ChatGPT and DALL-E are gaining popularity, leading to buzz around the term "generative AI". These programs can quickly produce a wide range of content, including computer code, essays, emails, social media captions, images, poems, and raps. This has drawn much attention from people.
Generative AI and language models promise to change how businesses approach to design, support, development, and?more.?The blog discusses why you should run your own large language models (LLM) instead of relying on new API providers. The blog discusses the evolving tech stack for cost-effective LLM fine-tuning and serving, which combines HuggingFace, Pytorch, and Ray. It also shows how Katonic, a leading MLOps Platform, addresses these challenges and enables data science teams to rapidly develop and train generative AI models at scale, including ChatGPT.
Why would I want to run my own LLM??
There are?many,?many?providers?of LLM APIs online. Why would you want to run your own? There are a few reasons:?
OK, so how do I run my own??
The LLM space is an incredibly fast-moving space, and it is currently evolving very rapidly. What we are seeing is a particular technology stack that combines multiple technologies:
Recent results on?Dolly?and?Vicuna?(both trained on Ray or trained on models built with Ray like?GPT-J) are small LLMs (relatively speaking – say the open source model GPT-J-6B with 6 billion parameters) that can be incredibly powerful when fine-tuned on the right data. The key is fine-tuning and the right data parts. So you do not always need to use the latest and greatest model with 150 billion-plus parameters to get useful results. Let’s get started!
Challenges in Generative AI infra
Generative AI infrastructure presents new challenges for distributed training, online serving, and offline inference workloads.
Distributed training
Distributed training is different from your normal training workflows. Distributed training always applies to comprehensive training scenarios like NLP and computer vision models, for which the normal dataset or model training can’t be fit on a single machine. In the distributed training world, the strategy is to distribute both the data and the model onto different machines so that they can parallel execute the training request.
Common challenges for distributed training for generative models include:
Some of the largest scale generative model training is being done on Ray today:
Fig.?Alpa?uses Ray as the underlying substrate to schedule GPUs for distributed training of large models, including generative AI models.
Online serving and fine-tuning.
Generative AI requires medium-scale workloads (e.g., 1-100 nodes) for training and fine-tuning. Typically, users at this scale are interested in scaling out existing training or inference workloads they can already run on one node (e.g., using?DeepSpeed,?Accelerate, or a variety of other common single-node frameworks). In other words, they want to run many copies of a workload for purposes of deploying an online inference, fine-tuning, or training service.
Fig. A100 GPUs, while providing much more GRAM per GPU, cost much more per gigabyte of GPU memory than A10 or T4 GPUs. Multi-node Ray clusters can hence serve generative workloads at a significantly lower cost when GRAM is the bottleneck.
领英推荐
Doing this form of scale-out itself can be incredibly tricky to get right and costly to implement. For example, consider the task of scaling a fine-tuning or online inference service for multi-node language models. There are many details to get right, such as optimizing data movement, fault tolerance, and autoscaling of model replicas. Frameworks such as?DeepSpeed?and?Accelerate?handle the sharding of model operators, but not the execution of higher-level applications invoking these models.
However, it is challenging to scale deployments involving many machines. It is also difficult to drive high utilization out of the box.
Offline batch inference
On the offline side, batch inference for these models also has challenges in requiring data-intensive preprocessing followed by GPU-intensive model evaluation. Companies like?Meta?and?Google?build custom services (DPP,?tf.data service)?to perform this at scale in heterogeneous CPU/GPU clusters. While in the past such services were the rarity, we are more and more often seeing users ask how to do this in the context of generative AI inference. These users now also need to tackle the distributed systems challenges of scheduling, observability, and fault tolerance.
How Katonic addresses these challenges
Distributed processing is the best way to scale machine learning. Apache Spark is an easy default option. Spark is a popular distributed framework; it works well for data processing and "embarrassingly parallel" tasks. For machine learning, however, Ray is a better option.?
Ray is an open-source, unified computing framework that simplifies scaling AI and Python workloads. Ray is great for machine learning as it can leverage GPUs and handle distributed data. It includes a set of scalable libraries for distributed training, hyperparameter optimization, and reinforcement learning. In addition, its fine-grained controls let you adjust processing to the workload using distributed actors and in-memory data processing.??
Today, Ray is used by leading AI organizations to?train?large language models (LLM) at scale (e.g., by?OpenAI to train ChatGPT?,?Cohere to train their models,?EleutherAI?to train?GPT-J, and?Alpa?for multi-node training and serving). However, one of the reasons why these models are so exciting is that open-source versions can be fine-tuned and deployed to address particular problems without needing to be trained from scratch. Indeed, users in the community are increasingly asking how to use Ray for the orchestration of their own generative AI workloads, building off foundation models trained by larger players.
A Challenge
However, getting from point A to a Ray cluster may not be so simple:?
To use Ray, many companies look to provision and manage dedicated clusters just for Ray jobs. Your team doesn't have a lot of extra cycles for DevOps, nor does IT right now. But you will end up paying for that cluster while it sits idle between Ray jobs.
Alternatively, you can subscribe to a Ray service provider. That eliminates the DevOps problem. But you'll have to copy your data to the provider's datastore or go through the treacherous process of connecting it to your data. It also means multiple logins and collaboration platforms to manage. You want to use Ray for some of your projects, not all of them.
A Solution
Katonic offers a cost-effective and secure solution. Katonic 4.0 now supports Ray open-source framework, which enables data science teams to rapidly develop and train generative AI models at scale, including ChatGPT.
This solution involves configuring and orchestrating a Ray cluster directly on the infrastructure that supports the Katonic platform. With Katonic, your users can spin up Ray clusters when needed. Katonic automates the DevOps away; your team can focus on delivering quality work.
The integration with Katonic on-demand, auto-scaling compute clusters streamlines the development process while also supporting data preparation via Apache Spark and machine learning and deep learning via XGBoost, TensorFlow, and PyTorch.
That means your Ray clusters can be on-prem or on any major cloud provider without waiting for IT, DevOps, or the cloud provider to catch up with industry innovation. As always with Katonic, your data is connected centrally, and access controls and audit trails are built-in. Best of all, you get a head start on the competition.?
Conclusion
We have shown how Katonic Platform combines Ray, HuggingFace, and PyTorch to offer a solution that:?
Our upcoming blog post will provide a step-by-step guide on how to efficiently use Hugging Face and Ray in combination with the katonic MLops platform. This will enable you to create a system for fine-tuning and serving LLMs, regardless of model size, in under 40 minutes and at a cost of less than $7 for a 6 billion parameter model. Stay Tuned !!
Technical Account Manager | Solutions Consultant | Cloud Practitioner | Digital Transformation Evangelist
1 年Very well articulated and insightful Prem!
| Data Scientist - Renewable Energy | Strategic Business Analyst | Freelancer | Machine Learning | Deep Learning | Statistics | R & Python | RShiny | Tableau & PowerBI
1 年Thanks Prem Naraindas for sharing this post. The LLM models can be very resource-intensive. This is because the models need a lot of computational power to learn from the data. Another drawback of large language models is that they can be very difficult to interpret
Building an LLM is rather costly. Maendeleo Solutions in Kenya is here to help....
Thank you for sharing!