LLMs in Action: A Practical Guide for Software Architects and Developers
Picture generated with DreamStudio

LLMs in Action: A Practical Guide for Software Architects and Developers

Generative AI, particularly Large Language Models (LLM), has gained immense popularity among various groups of people, including AI professionals, data scientists, tech enthusiasts, business executives, politicians, and the general public. It's hard to miss the buzz surrounding this technology, which has been gaining attention for its potential to revolutionize numerous industries and aspects of our lives.

It's interesting to note that despite the growing interest in LLMs among software architects and developers, there seems to be a lack of resources tailored specifically to their needs. Most of the technical LLM content (at least the one I read) is geared towards the AI research community that produces the algorithms at the foundation of LLM models, and AI engineers assembling together different AI / LLM components to engineer what I call a "Complicated LLM Subsystem".

But, these "Complicated LLM Subsystems" are not standalo,e software products you can directly present to an end-user. For instance, while PageRank is a "Complicated AI Subsystem", it is not a product. Google Search is the actual product. It means that at some point, developers had to interact with and integrate this subsystem into the overall Google Search architecture.

As the use of LLMs becomes more prevalent in software development, many developers will need to integrate these powerful tools into their existing or new applications. This process can be challenging, especially for those who are new to working with LLMs. That's why I think it is the perfect timing to share what I consider to be some helpful advice to help software architects not make the same mistakes that I did and level-down the (steep) LLM learning curve.

Which LLM(s) should I use in my context?

Selecting the appropriate LLM(s) for your specific use case is crucial for achieving optimal results. Indeed, various LLMs have demonstrated remarkable capabilities in various natural language processing tasks, such as content generation, chit chat, summarization, role playing, information extraction from documents, classification, and sentiment analysis (but they are not very reliable at doing even basic math, so don't use them for that). Nonetheless, they are not all equally good at all of these tasks. To choose the right LLM, you should consider the following factors:

  1. Tasks at stake: Identify the primary language tasks of your project: is it generating text, answering questions, summarizing reports, extracting information from documents, or something else? As we'll see in the following section, online benchmarks and leaderboards are available for free to help you target candidate LLMs.
  2. Language and localization performance: Some LLMs are proficient in multiple languages, while others may be more focused on English. For example, BLOOM is proficient not only in English but also in other languages like French and German. Consider the language and regional requirements of your project and choose an LLM that can handle the desired languages effectively.
  3. Throughput and response time: Consider how your throughput and response time requirements can impact your LLM choice. If your use case is interactive, you'll probably want to rely on a fast LLM or choose an LLM provider that has a sufficiently powerful infrastructure to provide acceptable response times. Interestingly enough, I have not seen yet LLM SaaS providers providing a "throughput SLA" (e.g., N tokens/sec guaranteed) nor a "response time SLA" (e.g., time-to-first-token < N seconds) for their service. This is definitely concerning for critical applications and probably a good reason for thinking carefully whether or not you are willing to take a hard dependency on LLM SaaS providers for critical functions.
  4. Budget and pricing model: You basically have the between two pricing models with LLMs. You can leverage a SaaS LLM provider that will probably use a PAYG (pay-as-you-go) model where you will be billed per token sent to and generated by the LLM. Or you can leverage an open source LLM that will be deployed on a GPU-equipped cluster, either in the public cloud or on-premises. With the former option you have no upfront cost but potentially large variable costs, while with the latter option you have a substantial upfront cost to acquire and/or rent the GPUs, but you have no variable costs involved.

By thinking through these four questions, you should already be able to reduce the list of candidate LLMs for your project, but that might not be sufficient. Which brings us to the second section of this article.

Welcome to the LLM battle arena and choose your Pokemon (oops... I meant LLM!)

What is hard with LLMs is that there is so much interest and contributions from the AI community, that almost every week a new model comes out of nowhere and tops the various leaderboards before being replaced again after a few weeks by a new contender.

Hopefully, since a few months, most of the released models have animal names, which makes them far easier to remember than the strange acronyms the LLM models had a few years ago...

Picture generated with DreamStudio

So, while it is important to know the basics of certain significant animals of the LLM bestiary (e.g., Llama, Alpaca, Vicuna, Orca, Falcon, Platypus), it's equally crucial to recognize that these models may not endure, and having a deeper knowledge of the critical benchmarks for assessing LLMs is more valuable since they tend to remain consistent and reliable over time.

  • Question answering benchmarks, such as BoolIQ, NarrativeQA, HellaSwag, TruthfulQA, and MMLU assess an LLM's ability to understand and answer questions based on given context or knowledge. These benchmarks often involve various question types, including multiple-choice, open-ended, and yes/no questions. By evaluating LLMs on these benchmarks, you can gain insights into their comprehension and reasoning abilities, which are crucial for tasks like content generation, chit-chat, and role playing.
  • Summarization benchmarks like SAMsum measure how well an LLM can condense lengthy texts into shorter, coherent summaries.
  • Classification benchmarks such as SST-2 or AGNews test an LLM's ability to categorize text based on predefined categories, such as sentiment analysis or topic classification. By comparing LLMs on these benchmarks, you can determine which models excel at information classification.
  • Logic and mathematics benchmarks like MultiArith, AddSub, AQUA-RAT or GSM8K assess an LLM's ability to solve problems that require logical reasoning or mathematical calculations. These benchmarks can help you understand an LLM's problem-solving capabilities and its potential for tasks like code generation, data analysis, and scientific research.

Having said that, it would be a total waste of time and very costly if you had to evaluate your candidate LLMs by yourself against the benchmarks that are the closest to the language tasks you are willing to accomplish. Hopefully, several leaderboards, such as the Open LLM Leaderboard and Stanford's Holistic Evaluation of Language Models (HELM), regularly evaluate LLMs against these benchmarks. These leaderboards provide a comprehensive view of a model's performance across various tasks and benchmarks, making it easier to identify the best LLM for your specific context.

Besides leaderboards that can be somewhat static, there are interactive "LLM battle arenas" where you can compare the performance of various LLMs by yourself. One notable platform is the Chatbot Arena at lmsys.org which uses the Elo rating system, similar to that used in chess, to rank the LLMs against each other. This allows for a more comprehensive and nuanced comparison of their capabilities.

By using all of these resources, you can definitely gain a deeper understanding of the strengths and weaknesses of each LLM and make a more informed choice for your project.


Interaction and integration of your software components with your LLM(s)

Now that you've selected your target LLM(s), the natural next step is to integrate your application with them effectively to achieve the desired outcome. To ensure a successful integration, let's review a few key items together. By addressing these points, you'll be well on your way to realizing the full potential of your LLMs.

Prompt engineering

Effective prompt engineering is a vital component of working with LLMs to yield the desired outputs. By skillfully crafting prompts, you can direct the LLM to generate more precise and relevant responses. There are numerous online resources available to help you customize your prompts, but here are some tried-and-true principles to keep in mind:

  1. Invest in your System Message: A well-crafted system message is essential for optimizing the performance of your LLM. This instruction or context, provided before the user's prompt, guides the LLM in understanding its tasks and adapting to your unique application. Clear and concise system messages enable the LLM to effectively configure its behavior, leading to improved results. For instance, if you desire a chatbot with succinct responses, your system message might read: "You are a laconic assistant. Provide brief, straightforward answers without additional explanation." By investing time and effort into creating effective system messages, you can significantly enhance the LLM's capabilities and tailor them to your specific requirements.
  2. Be specific in your task descriptions: When creating prompts, it's essential to be very specific about the task you want the LLM to perform. Avoid vague or open-ended requests; instead, provide clear and concise instructions that guide the model towards the desired outcome. Include specific examples or templates to help the LLM understand what you're looking for. For example, instead of asking the LLM to "Write a summary", you could ask it to "Write a summary of the following text in 100 words or less, focusing on the main points and conclusions." By doing so, you'll increase the likelihood of getting relevant and accurate responses from your LLM.
  3. Leverage One-shot of Few-shot learning techniques: One-shot and few-shot learning techniques involve providing the LLM with one or a few examples of the desired output. This can help the model understand the task and generate more accurate results. For instance, you might provide the LLM with a single example of a well-written summary or a few examples of correct classifications.
  4. Structure your input with delimiters: Using delimiters to structure your input can help the LLM better understand the task at stake and produce more accurate results. For example, you might use a format like "Task: [task description] | Input: [input text] | Output: [expected output]." This structured approach makes it easier for the LLM to parse the prompt and generate the desired output.
  5. Define the output structure you expect: Clearly outline the structure of the output you expect from the LLM. This can help the model generate more organized and coherent responses. For example, if you want the LLM to generate a list of items, you might specify that the output should be formatted as a numbered or bulleted list. If you need the LLM to provide a step-by-step solution to a problem, you could ask it to "Provide a step-by-step solution to the following problem, with each step clearly explained and justified."
  6. Breakdown your task in smaller steps: Encourage the LLM to think step-by-step by asking it to break down the task into smaller, more manageable steps. This can help the model generate more accurate and coherent responses, especially for complex tasks. For example, you might ask the LLM to "First, identify the main points of the text, and then summarize each point in one sentence. Finally, combine the sentences into a coherent summary."
  7. Apply the ReAct framework: The ReAct framework is a general paradigm that combines reasoning and acting with LLMs. ReAct prompts LLMs to generate verbal reasoning traces and actions, which can help improve the model's performance on tasks like question answering and decision making. By following the ReAct framework, you can create prompts that help the LLM better understand the task and generate more accurate results.


Key LLM inference parameters

When working with LLMs, it's essential to understand the key parameters used at inference and how to configure them to achieve the desired output. Three parameters are important to consider:

  1. Temperature: Temperature is a parameter that controls the randomness or creativity of the generated text in a LLM. A higher temperature value typically makes the output more diverse and creative but might also increase its likelihood of straying from the context. Conversely, a lower temperature value produces more predictable and conservative output. For example, if you want the LLM to generate more creative text, you might set the temperature to 1.0, whereas if you prefer more focused and coherent output, you could set the temperature to 0.5.
  2. Max length (or Max tokens): Max length (or max tokens) is a parameter that defines the maximum number of tokens the LLM can generate in its responses. LLMs typically have a maximum token limit due to computational constraints, as processing very long sequences can be memory-intensive and may lead to increased computational complexity. When working with long sequences of text, you might need to employ techniques like truncation, where only a portion of the text is considered, or sliding window approaches that process the text in segments.
  3. Sampling: Sampling is the process of randomly selecting the next token based on the probability distribution over the entire vocabulary given by the LLM. There are several sampling techniques, but the two most popular are temperature sampling and top-k sampling. Temperature sampling is inspired by statistical mechanics and adjusts the probability distribution of the next token based on the temperature parameter. Top-k sampling, on the other hand, selects the next token from the top-k most probable tokens, which can help reduce the likelihood of generating low-probability or irrelevant tokens.

By understanding and configuring these LLM inference parameters, you can better control the behavior and output of your LLM, ensuring that it generates the desired results for your specific use case.


LLM integration modes

Currently, there are three primary integration modes for connecting your application code to the "LLM Complicated Subsystem". Each mode offers its own unique benefits and drawbacks which are discussed below:

  • Integration via API: This method is particularly suitable for SaaS LLMs, allowing you to simply call the REST API exposed by the LLM provider. For instance, you might use the OpenAI API to integrate GPT-3.5-turbo with your application, enabling you to send prompts and receive generated text through a straightforward REST API. This approach is ideal for relatively simple scenarios where you don't need too much prompt orchestration nor integration with complementary data sources, and there is no need for developers to learn a complex SDK.
  • Integration via Langchain: Langchain is an open-source framework designed to simplify the creation of applications using LLMs. It provides developers with the tools to build applications powered by LLMs and offers a structured approach for connecting LLMs to other data sources and orchestrating a series of prompts to achieve desired outcomes. Although it is an excellent choice for intricate scenarios and building sophisticated conversational agents, keep in mind that the learning curve is a bit steep and Langchain doesn't offer enterprise-level support for the SDK at present.
  • Integration via LMQL: LMQL (Language Model Query Language) is a query language designed for working with LLMs, allowing you to interact with the model using a familiar syntax. For example, you might use LMQL to query an LLM for specific information, such as '''argmax """Review: {review}\n Q: What is the underlying sentiment of this review and why?\n A:[ANALYSIS]\n Based on this, the overall sentiment of the message can be considered to be [CLASSIFICATION]""" from "openai/text-davinci-003" WHERE CLASSIFICATION in ["positive", "neutral", "negative"] '''While it is clearly an experimental project at the moment, and I wouldn't recommend it for production, the SQL-friendly interface is an opportunity for development teams not at ease with Python or JavaScript, and maybe for scenarios requiring a strong integration with RDBMS if in the future this project adds features like federated queries where you are basically able to easily inject data from SQL tables into the prompts.


Leverage the Decorator pattern

The Decorator is a structural design pattern well known from software developers, that allows you to add functionality to an object dynamically by wrapping it with one or more decorator objects. These decorators can modify the behavior of the original object by adding new functionality before or after its execution, or by modifying its inputs and outputs. I recommend to leverage this pattern when you integrate with "Complicated LLM Subsystems".

Indeed, the behavior of these complicated systems can be a bit erratic or at least non-deterministic, and you probably want to do at least two things:

  • Control the inputs sent to the LLM
  • Control the outputs generated by the LLM

Input checks can be done either in the system message or prompts. This helps ensure that the LLM receives valid input and reduces the risk of errors or unexpected behavior. For example, you can include checks for input length, format, or content to prevent the LLM from processing inappropriate or irrelevant data. But in scenarios where you want to prevent the data from ever reaching the LLM (like PII or insider information) you need to have an Input Interceptor to preprocess the input before it reaches the LLM. For instance, you might use an input interceptor to remove sensitive information, correct spelling errors, or filter out irrelevant content before passing the input to the LLM. You probably also want your Input Interceptor to detect prompt injection attacks. While this is still a nascent domain, you can leverage the Rebuff project to improve your security posture.

Output checks can also be implemented at different levels. First of all, some models like Llama 2 come with built-in safety features such as Reinforcement Learning from Human Feedback (RLHF), which filters violent, illegal, or hateful content. This ensures that the LLM's output aligns with your content policies and maintains a safe user experience. Secondly, you can also implement output checks in the system message or prompts. These checks should be based on your intended use, audience, and content policies. For example, you might implement checks to ensure that the LLM's output does not contain sensitive information or content that violates your company's guidelines. Finally, you can also implement Post-processing Interceptor, to further refine the LLM's output and ensure its quality and relevance. For instance, you might use a content classifier to categorize the LLM's output based on topic, sentiment, or other criteria, and then apply additional processing or filtering based on the classification results.


Conclusion

I think we are at the end of this short article. Hopefully you learned useful stuff you'll be able to apply in your own context. As a reminder, we went through the following items:

  1. What questions should I answer to constitute a short-list of candidate LLMs for my application.
  2. How I can leverage benchmarks, leaderboards and battle arenas to narrow down the short-list to the definitive choice(s)
  3. What are the key design and implementation decisions I should take when interacting and integrating with a Complicated LLM Subsystem.

You also have to keep in mind that the LLM ecosystem is growing and innovating super fast and that what is true today might not be tomorrow. So, my final advice to you is to regularly invest time to monitor the LLM ecosystem evolution and take advantage of it. It's worth it.

Carole Benichou

Global Saas GTM, Alliances and M&A Operations / Ex-Microsoft Data & AI

1 年

super clair as always! merci Sébastien Brasseur

Carlos Mateus

Managing Director - Enterprise | ExCom member France @ServiceNow

1 年

Toujours au top ! Bravo pour cet article

Rakib Hossain

Graphic Designer at Fiverr

1 年

Are you looking for flyer, brochure, one pager, business card, t- shirt design for your company then order me now without any delay and get your?desired design made in a very short time. Please contact with?me: cutt.ly/awlpr7gX

回复
Jér?me Campo

Head of Gen AI Center of Excellence

1 年

Excellent one, re-sharing it ! Congrats Sébastien Brasseur

Benoit Houdet

Digital Transformation Consultant | Innovation Manager | Project Manager

1 年

Cyril VERBRUGGHE as OFFOLIO is boosted with AI, it might be interesting for you and your team ??

要查看或添加评论,请登录

Sébastien Brasseur的更多文章

社区洞察

其他会员也浏览了