登录查看更多内容

Deploying LLMs in Production: The Anatomy of LLM Applications

XenonStack

Data and AI Foundry for Autonomous Operations #agenticworkflow #aiagents #decisionintelligence #causalai

发布日期: 2023年10月10日

Introduction

Large Language Models (LLMs) are deep learning algorithms that can be used for a wide range of NLP (Natural Language Processing). Large language models use transformer models and are trained using massive datasets — hence, large. This enables them to recognize, translate, predict, or generate text or other content.?

Large language models are also referred to as neural networks (NNs), which are computing systems inspired by the human brain. These neural networks work using a network of nodes that are layered, much like neurons.

Subscribe the Newsletter

LLM applications and use cases

LLMs have opened a new class of enterprise use cases. Being trained on such a large amount of information, these models provide a strong foundation to build off. This makes them incredibly versatile and easily adaptable to suit a variety of applications. Whether leveraging the built-in knowledge of the model or fine-tuning to your use case, there are many ways to provide a useful and intuitive user experience. Some LLM use applications are as follows: -?

Chatbots

It provides customer support, product recommendations and conversational search.?

Document Understanding

Summarize, exact key information and analyse sentiments.?

Code completion

Explain and document code, suggest improvement and auto generate code.?

Content generation

Write a blog post, emails, ad copies and outline based on prompts.?

Answer natural language questions on top of a large collection of texts.?

Translation

Translate between languages and perform style transfer to change the tone and reading level.?

LLM components and reference architecture

LLM Models

The LLM model is the heart of the LLM application and selecting a model depends on a variety of factors. The primary elements are the size of the model (# of parameters) and whether it is open sourced or behind an API. Some models may have tokenizers and embedding models built in, while others will require you to run these steps yourself. Tokenizers split input text into smaller chunks while embedding models convert the text into numeric vectors which an LLM can understand.

The easiest place to begin prototyping is with OpenAI’s GPT-4, as it is quick to get up and running given how adaptable the model is. In some cases, a long context window may be required if performing a lot of in-context learning, making Claude 2 an excellent choice. However, when using such large models, both the cost and latency could become prohibitive.?

Prompt Template

Much like the experimentation process that has evolved in traditional Machine Learning model training, prompt engineering has emerged as a core activity in LLM application development. There are several prompting techniques that lead to better results.

The first technique is to write clear instructions and specify the steps required to complete the task. Most tasks can be done using a zero-shot prompt but adding some examples with a few-shot prompt can further improve and tailor the response.

领英推荐

Steps to Become a LLM Developer

Blockchain Council 2 个月前

LLM vs. LQM

Sanjiv Goyal 2 个月前

Decoding the AI Giants: GPT-3 vs GPT-3.5 vs GPT-4

Ido Meiron - ???? ?????? 1 年前

Even simply including phrases such as show your work or let’s think step by step leads the model to iteratively develop a solution iteratively, increasing the chances of a correct answer.

This chain-of-thought prompting is even more powerful if a few examples of reasoning logic are provided. Prompts can then be templated and reused by an LLM application.

They can be integrated into your code with a simple f-string or str. format(), but there are also libraries like LangChain, Guidance, and LMQL that offer more control. For chat completion APIs, there is generally a system prompt to assign the task, followed by alternating user and assistant messages.

You can seed the chat with examples of user questions and assistant answers before running it. Experimenting and iterating on these prompt templates in a structured way will lead to improved model performance. The outputs of the model should be evaluated by either a scoring function or human feedback.?

Vector Database

Many use cases will require access to information that the LLM has not been trained on. This may be a private knowledge base (e.g., a company wiki), recent information (e.g., events this weekend), or a specific document (e.g., a research paper). There might be other cases where the context window of the model is too small to include everything in the prompt (e.g., an entire book).

Adding an external database can provide access to supplementary information in a technique known as retrieval augmented generation (RAG). Vector databases have been around for some time but have surged in popularity recently as they make their way into LLM applications.

Relevant external information is first chunked into blocks (e.g., sentences or paragraphs), tokenized, and run through an embedding model. These are then fed into a vector database, such as Pinecone, Chrome, Qdrant, or pgvector (open source). When a prompt is made, it is also vectorized and a similarity search is performed to retrieve relevant entries from the database.

These entries are then fed in alongside the original prompt, providing the context needed to provide a coherent response. The LLM may cite which chunks it used in its response, which adds a degree of trust to counteract potential hallucinations.?

Agents and Tools

Using LLMs alone can enable powerful applications, but there are inherent limitations. For example, they cannot continually prompt themselves, make external API calls, nor retrieve a web page. An LLM agent has access to tools that can perform actions beyond text generation.

For example, an LLM agent could perform a web search, execute code, perform a database lookup, or do maths calculations. OpenAI language models can decide which tools to use and return a JSON object with the arguments to call the function. This enables a whole new suite of use cases, such as booking a flight, generating an image, or sending an email.??

Some frameworks enable agents to execute iteratively to complete a task. This could be as simple as breaking down a task into several subtasks or “self-asking” to continue gathering relevant information for the task. These can then be executed in a chain to work towards a correct answer iteratively.

The idea of combining this reasoning alongside taking actions using tools led to the emergence of a framework called React. ReAct enters a loop of thinking about which action to take, taking that action, observing the result, and deciding on a subsequent action until a solution is found.

Although this approach has proved to outperform baseline LLMs, it should be noted that evaluating the performance of these systems and achieving reliable results is still a major challenge. Security is also a concern as LLMs are given the ability to take actions, such as posting on the web.?

Orchestrator

Orchestrators provide a framework to tie these components together. They create abstractions on top of LLMs, prompt templates, data sources, agents, and tools. Templating frameworks like Guidance and LMQL enable complex prompts that specify inputs, outputs, and constraints.

In addition, these tools can improve performance by providing memory management, token healing, beam search, session management, error handling, and more. Being able to create a logical control flow on top of LLMs enables developers to build differentiated, specialized, and hardened use cases.?

Monitoring

Given the stochastic nature and rapid evolution of LLM models, your application should be monitored in production. Standard monitoring of CPU, GPU, memory usage, latency, and throughput should be tracked. Drift and outlier detectors can also be deployed to alert you of changing or anomalous inputs over time. Requests and responses should also be logged so that unexpected or harmful outputs can be evaluated.

With complex chains, or graphs, of LLM agents strung together, it can be difficult to track exactly how the prompt is evolving. Tools like LangSmith and Seldon Core v2 provide the ability to trace the flow of data, providing visibility into the behaviour of your LLM application. Not only that, but Seldon Core v2 provides a data-centric deployment graph that enables advanced monitoring with drift detectors and explainers.?

Conclusion

Deploying and managing LLMs in a production environment can be challenging due to resource management, model performance, model versioning, and infrastructure issues. LLMs are simple to deploy and administer in a production setting using MLflow’s tools and APIs for managing the model lifecycle. In this blog, we discussed how to use MLflow to deploy and manage LLMs in a production environment, along with support for Hugging Face transformers, Open AI, and Lang Chain models. The collaboration between data scientists, engineers, and other stakeholders in the machine learning lifecycle can be improved by using MLflow.?

Deploying LLMs in Production: The Anatomy of LLM Applications

XenonStack

Data and AI Foundry for Autonomous Operations #agenticworkflow #aiagents #decisionintelligence #causalai

Introduction

LLM applications and use cases

LLM components and reference architecture

领英推荐

Agents and Tools

Conclusion

AutonomousOps

9,789 位关注者

XenonStack的更多文章

社区洞察

其他会员也浏览了

Fine-Tuning Large Language Models (LLMs) with Transfer Learning in a Spring Data Pipeline:

Leveraging the Potential of Large Language Models

GPT-3 and the rise of foundation models

List of 100+ Notable Large Language Model (LLMs) ??

Large Language Models - How are the OpenAI GPT models trained?

Using Language Models

Generative Pre-trained Transformer: Revolutionizing Language Generation and Creativity

Unlocking the Potential of Representation: The Impact of Embeddings in Machine Learning

Introduction

LLM applications and use cases

LLM components and reference architecture

领英推荐

Agents and Tools

Conclusion

AutonomousOps

9,789 位关注者

XenonStack的更多文章

Enhancing Data Quality Management Through Automation

?? Unlock the Power of Your Data: 5 Essential Features of Data Catalogs To Look For in 2024

A Guide to Managing Multi-Cloud Billing with FOCUS

Edge AI: Transforming Smart Cities for a Smarter Tomorrow Exploring the Role of Edge AI in Urban Innovation

Your Weekly Update on Key Tech Trends and Emerging Innovations

Understanding Behavioral Analytics: A Key Component of SOC Automation

Leveraging eBPF for Secure Managed Services: Key Use Cases and Solutions

Agentic Analytics: Transforming Data Handling with AI-Driven Intelligence ??

Integrating GreenOps KPIs into FinOps for Sustainable Growth

Weekly Wrap-Up: Key Highlights & Takeaways | October 25 - October 29, 2024 ??

社区洞察

其他会员也浏览了

Fine-Tuning Large Language Models (LLMs) with Transfer Learning in a Spring Data Pipeline:

Leveraging the Potential of Large Language Models

GPT-3 and the rise of foundation models

List of 100+ Notable Large Language Model (LLMs) ??

Large Language Models - How are the OpenAI GPT models trained?

Using Language Models

Generative Pre-trained Transformer: Revolutionizing Language Generation and Creativity

Unlocking the Potential of Representation: The Impact of Embeddings in Machine Learning