登录查看更多内容

Leveraging Generative AI & Language Models for Businesses - How To Build Your Own Large Language Model

Prem Naraindas

Founder & CEO at Katonic.ai | The Australian Top Innovator 2023 | LinkedIn TOP VOICE 2024

发布日期: 2023年4月20日

Generative AI is actually a simple concept despite its intimidating label. It pertains to AI algorithms that produce an output, like text, photo, video, code, data, and 3D renderings, from the data they are trained on. It focuses on generating content instead of other AI forms with different uses, such as analysing data or assisting with self-driving car control.

Why is generative AI a hot topic right now?

Generative AI programs like OpenAI's ChatGPT and DALL-E are gaining popularity, leading to buzz around the term "generative AI". These programs can quickly produce a wide range of content, including computer code, essays, emails, social media captions, images, poems, and raps. This has drawn much attention from people.

Generative AI and language models promise to change how businesses approach to design, support, development, and?more.?The blog discusses why you should run your own large language models (LLM) instead of relying on new API providers. The blog discusses the evolving tech stack for cost-effective LLM fine-tuning and serving, which combines HuggingFace, Pytorch, and Ray. It also shows how Katonic, a leading MLOps Platform, addresses these challenges and enables data science teams to rapidly develop and train generative AI models at scale, including ChatGPT.

Why would I want to run my own LLM??

There are?many,?many?providers?of LLM APIs online. Why would you want to run your own? There are a few reasons:?

Cost, especially for fine-tuned inference: For example, OpenAI charges 12c per 1000 tokens (about 700 words) for a fine-tuned model on?Davinci. It’s important to remember that many user interactions require multiple backend calls (e.g. one to help with the prompt generation, post-generation moderation, etc), so it’s very possible that a single interaction with an end user could cost a few dollars. For many applications, this is cost prohibitive.?
Latency:?using these LLMs is especially slow. A GPT-3.5 query for example can take up to 30 seconds. Combine a few round trips from your data center to theirs and it is possible for a query to take minutes. Again, this makes many applications impossible. Bringing the processing in-house allows you to optimize the stack for your application, e.g. by using low-resolution models, tightly packing queries to GPUs, and so on. We have heard from users that optimizing their workflow has often resulted in a 5x or more latency improvement.?
Data Security & Privacy:?In order to get the response from these APIs, you have to send them a lot of data for many applications (e.g. send a few snippets of internal documents and ask the system to summarize them). Many of the API providers reserve the right to use those instances for retraining. Given the sensitivity of organizational data and also frequent legal constraints like data residency, this is especially limiting. One, particularly concerning recent development, is the ability to?regenerate training data from learned models, and people?unintentionally disclosing secret information.

OK, so how do I run my own??

The LLM space is an incredibly fast-moving space, and it is currently evolving very rapidly. What we are seeing is a particular technology stack that combines multiple technologies:

No alt text provided for this image — Generative AI Technology Stack

Recent results on?Dolly?and?Vicuna?(both trained on Ray or trained on models built with Ray like?GPT-J) are small LLMs (relatively speaking – say the open source model GPT-J-6B with 6 billion parameters) that can be incredibly powerful when fine-tuned on the right data. The key is fine-tuning and the right data parts. So you do not always need to use the latest and greatest model with 150 billion-plus parameters to get useful results. Let’s get started!

Challenges in Generative AI infra

Generative AI infrastructure presents new challenges for distributed training, online serving, and offline inference workloads.

Distributed training

Distributed training is different from your normal training workflows. Distributed training always applies to comprehensive training scenarios like NLP and computer vision models, for which the normal dataset or model training can’t be fit on a single machine. In the distributed training world, the strategy is to distribute both the data and the model onto different machines so that they can parallel execute the training request.

Common challenges for distributed training for generative models include:

How to effectively partition the model across multiple accelerators?
How to setup your training to be tolerant of failures on preemptible instances?

Some of the largest scale generative model training is being done on Ray today:

OpenAI?uses Ray to coordinate the training of ChatGPT and other models.
The?Alpa project?uses Ray to coordinate training and serving of data, model, and pipeline-parallel computations with JAX as the underlying framework.
Cohere?and?EleutherAI?use Ray to train their large language models at scale along with PyTorch and JAX.

Fig.?Alpa?uses Ray as the underlying substrate to schedule GPUs for distributed training of large models, including generative AI models.

Online serving and fine-tuning.

Generative AI requires medium-scale workloads (e.g., 1-100 nodes) for training and fine-tuning. Typically, users at this scale are interested in scaling out existing training or inference workloads they can already run on one node (e.g., using?DeepSpeed,?Accelerate, or a variety of other common single-node frameworks). In other words, they want to run many copies of a workload for purposes of deploying an online inference, fine-tuning, or training service.

Fig. A100 GPUs, while providing much more GRAM per GPU, cost much more per gigabyte of GPU memory than A10 or T4 GPUs. Multi-node Ray clusters can hence serve generative workloads at a significantly lower cost when GRAM is the bottleneck.

Brij kishore Pandey 3 周前

Emerging Trends Shaping Future of Generative AI

Analytics Insight? 3 个月前

Almost Timely News: Improving the Performance of…

Christopher Penn 10 个月前

Doing this form of scale-out itself can be incredibly tricky to get right and costly to implement. For example, consider the task of scaling a fine-tuning or online inference service for multi-node language models. There are many details to get right, such as optimizing data movement, fault tolerance, and autoscaling of model replicas. Frameworks such as?DeepSpeed?and?Accelerate?handle the sharding of model operators, but not the execution of higher-level applications invoking these models.

However, it is challenging to scale deployments involving many machines. It is also difficult to drive high utilization out of the box.

Offline batch inference

On the offline side, batch inference for these models also has challenges in requiring data-intensive preprocessing followed by GPU-intensive model evaluation. Companies like?Meta?and?Google?build custom services (DPP,?tf.data service)?to perform this at scale in heterogeneous CPU/GPU clusters. While in the past such services were the rarity, we are more and more often seeing users ask how to do this in the context of generative AI inference. These users now also need to tackle the distributed systems challenges of scheduling, observability, and fault tolerance.

How Katonic addresses these challenges

Distributed processing is the best way to scale machine learning. Apache Spark is an easy default option. Spark is a popular distributed framework; it works well for data processing and "embarrassingly parallel" tasks. For machine learning, however, Ray is a better option.?

Ray is an open-source, unified computing framework that simplifies scaling AI and Python workloads. Ray is great for machine learning as it can leverage GPUs and handle distributed data. It includes a set of scalable libraries for distributed training, hyperparameter optimization, and reinforcement learning. In addition, its fine-grained controls let you adjust processing to the workload using distributed actors and in-memory data processing.??

Today, Ray is used by leading AI organizations to?train?large language models (LLM) at scale (e.g., by?OpenAI to train ChatGPT?,?Cohere to train their models,?EleutherAI?to train?GPT-J, and?Alpa?for multi-node training and serving). However, one of the reasons why these models are so exciting is that open-source versions can be fine-tuned and deployed to address particular problems without needing to be trained from scratch. Indeed, users in the community are increasingly asking how to use Ray for the orchestration of their own generative AI workloads, building off foundation models trained by larger players.

A Challenge

However, getting from point A to a Ray cluster may not be so simple:?

Setup: Setting up the configuration for a cluster can be complex, and the required skillset becomes more demanding as the number of nodes increases.
Hardware: To maximise its benefits, Ray may need access to Robust infrastructure and GPUs to work efficiently.
Data access: To ensure efficient data transfer between your storage tools and Ray cluster, it is essential to establish a seamless and swift connection. However, creating a data pathway between the boxes on a whiteboard and achieving data connectivity can be complex.
Security and governance: The Ray clusters must satisfy the access control requirements and adhere to the internal data encryption and auditing guidelines.
Scalability: To avoid overspending on infrastructure, it is important to plan carefully so that clusters can handle increasing workloads without issues.

To use Ray, many companies look to provision and manage dedicated clusters just for Ray jobs. Your team doesn't have a lot of extra cycles for DevOps, nor does IT right now. But you will end up paying for that cluster while it sits idle between Ray jobs.

Alternatively, you can subscribe to a Ray service provider. That eliminates the DevOps problem. But you'll have to copy your data to the provider's datastore or go through the treacherous process of connecting it to your data. It also means multiple logins and collaboration platforms to manage. You want to use Ray for some of your projects, not all of them.

A Solution

Katonic offers a cost-effective and secure solution. Katonic 4.0 now supports Ray open-source framework, which enables data science teams to rapidly develop and train generative AI models at scale, including ChatGPT.

This solution involves configuring and orchestrating a Ray cluster directly on the infrastructure that supports the Katonic platform. With Katonic, your users can spin up Ray clusters when needed. Katonic automates the DevOps away; your team can focus on delivering quality work.

The integration with Katonic on-demand, auto-scaling compute clusters streamlines the development process while also supporting data preparation via Apache Spark and machine learning and deep learning via XGBoost, TensorFlow, and PyTorch.

That means your Ray clusters can be on-prem or on any major cloud provider without waiting for IT, DevOps, or the cloud provider to catch up with industry innovation. As always with Katonic, your data is connected centrally, and access controls and audit trails are built-in. Best of all, you get a head start on the competition.?

Conclusion

We have shown how Katonic Platform combines Ray, HuggingFace, and PyTorch to offer a solution that:?

Makes it simple and quick to deploy as a service.?
Can be used to cost-effectively fine-tune and is actually most cost-effective when using multiple machines without the complexity.?
How fine-tuning – even a single epoch – can change the output of a trained model.?
Deploying a fine-tuned model is only marginally harder than deploying a standard one.?

Our upcoming blog post will provide a step-by-step guide on how to efficiently use Hugging Face and Ray in combination with the katonic MLops platform. This will enable you to create a system for fine-tuning and serving LLMs, regardless of model size, in under 40 minutes and at a cost of less than $7 for a 6 billion parameter model. Stay Tuned !!

Pralhadrao Patil

Technical Account Manager | Solutions Consultant | Cloud Practitioner | Digital Transformation Evangelist

1 年

Very well articulated and insightful Prem!

Gunasekar Veeramani

1 年

Thanks Prem Naraindas for sharing this post. The LLM models can be very resource-intensive. This is because the models need a lot of computational power to learn from the data. Another drawback of large language models is that they can be very difficult to interpret

Maendeleo Solutions

1 年

Building an LLM is rather costly. Maendeleo Solutions in Kenya is here to help....

Revolution.AI

1 年

Thank you for sharing!

查看更多评论

要查看或添加评论，请登录

查看全部

Leveraging Generative AI & Language Models for Businesses - How To Build Your Own Large Language Model

Prem Naraindas

Founder & CEO at Katonic.ai | The Australian Top Innovator 2023 | LinkedIn TOP VOICE 2024

Why is generative AI a hot topic right now?

Why would I want to run my own LLM??

OK, so how do I run my own??

Challenges in Generative AI infra

Distributed training

Online serving and fine-tuning.

领英推荐

Offline batch inference

How Katonic addresses these challenges

A Challenge

A Solution

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

AutoML-GPT; Causal Reasoning and LLMs; MetaGPT; Free Access to GPT-4; Weekly Concept; To Handle Increased Stress, build resilience; and more.

Multimodal Retrieval Augmented Generation, Text-to-Video Generative AI Platforms, and the Future of AI Investments

ChatGPT vs Gemini; Uncertainty Quantification in GenAI; GPT-4 vs. GPT-4V vs. Humans On Abstraction and Reasoning; Private vs Public LLMs; and More.

The future of advanced AI is simple

Build Your Own AI Tool With Google Gemma

Or maybe finally harness AI to productive action? The emergence of LAM, a new frontier in the development of Artificial Intelligence.

Ai vs Hard Copies

Managing the Cost of AI

How OpenAI's New Model o1's Enhanced Reasoning Capabilities Propel Compound AI Systems to New Levels

Spending on AI is most likely different than you thought.

Why is generative AI a hot topic right now?

Why would I want to run my own LLM??

OK, so how do I run my own??

Challenges in Generative AI infra

Distributed training

Online serving and fine-tuning.

领英推荐

Offline batch inference

How Katonic addresses these challenges

A Challenge

A Solution

Conclusion

Mind-Blowing Ways LLMs Are Revolutionizing Businesses (You Won't Believe How Zillow Uses them!)

2024年8月26日

Embracing the Future: Generative AI in Media & Entertainment

2023年9月15日

Harnessing the Power of Large Language Models for Data Extraction: A Deep Dive into Use Cases

2023年8月21日

Prompt Engineering: The Key to Unlocking the Genie of Generative AI

2023年6月1日

Emerging Patterns for LLMs in Production

2023年5月12日

A Simple Playbook for Deriving Value Out Of Generative AI

2023年5月10日

Going Beyond CHATGPT With 90% Conversation Quality - Meet Vicuna

2023年5月3日

What is Responsible AI and Why Is It Important? Key Takeaways from the AI Leadership Summit

2023年3月17日

The Differences between Unicorns and Popcorns: How to Avoid Being Fooled

2022年12月14日

THE PERFECT MLOPS TEAM: HOW TO CREATE AND MAINTAIN A SUCCESSFUL IMPLEMENTATION

2022年12月4日

社区洞察

其他会员也浏览了

AutoML-GPT; Causal Reasoning and LLMs; MetaGPT; Free Access to GPT-4; Weekly Concept; To Handle Increased Stress, build resilience; and more.

Multimodal Retrieval Augmented Generation, Text-to-Video Generative AI Platforms, and the Future of AI Investments

ChatGPT vs Gemini; Uncertainty Quantification in GenAI; GPT-4 vs. GPT-4V vs. Humans On Abstraction and Reasoning; Private vs Public LLMs; and More.

The future of advanced AI is simple

Build Your Own AI Tool With Google Gemma

Or maybe finally harness AI to productive action? The emergence of LAM, a new frontier in the development of Artificial Intelligence.

Ai vs Hard Copies

Managing the Cost of AI

How OpenAI's New Model o1's Enhanced Reasoning Capabilities Propel Compound AI Systems to New Levels

Spending on AI is most likely different than you thought.