Our approach on LLMs Development
Intelliverse.ai
We aim to empower AI researchers, innovators, and organizations to build scalable AI and Data solutions
How we approach LLM Development.
We have worked extensively with LLMs and their applications in production in recent projects and we have come across several challenges that need to be addressed. One of the major challenges is the lack of technical rigor in quick engineering which can exacerbate LLMs' flaws, that already exist. It’s not just about creating something that looks cool, making an LLM application ready for production is far more challenging.
There are three sections to this article:
The main problems when productionizing LLM applications are covered in Part 1 along with answers.
In the second section, we'll cover how to combine different activities using control flows (such as if statements and for loops) and tools (including SQL executors, shell, web browsers, and third-party APIs) to create more sophisticated and potent applications.
The third section discusses some of the intriguing use cases businesses develop on top of LLMs and how to build them up from simpler jobs.
Part 1. Challenges of productionizing prompt engineering:
Compared to computer languages, natural languages are more adaptable, which can cause ambiguity and quiet failures. The ambiguity of natural languages in comparison to programming languages underlines one of the difficulties in employing language models for prompt engineering. While the stochastic nature of language models might provide inconsistent user experience, the flexibility of user-defined prompts can result in quiet failures and unclear output formats. For example, the following shows same prompt but different score displaying inconsistency in user experience:
Solution: While some practitioners have structured their workflows around the ambiguity and accepted it, OpenAI is actively attempting to address these problems. Prompt engineering can be made more methodical, if not deterministic, by using engineering rigor.
Prompt Evaluation
It is typical practice for prompt engineering to employ prompt evaluation, which is giving a few instances in the prompt and counting on the LLM to draw conclusions from them.
In prompt engineering, prompt evaluation is where examples are given to the language model (LLM) to aid in its understanding and the ability to draw generalizations from them—is a crucial stage. To determine whether the LLM comprehends the prompt and whether it overfits the few-shot instances, utilize the examples. One method is to enter the exact examples from the prompt and see if the model produces the anticipated results. If the model doesn't perform well on the examples, it could be necessary to rewrite the challenge or divide the task into more manageable pieces if it's unclear. A different strategy is to assess the model using different instances to look for overfitting. To check for overfitting, another strategy is to assess the model using different cases.
Prompt Versioning
Prompt versioning is particularly crucial since even minor modifications to a prompt might produce unexpected outcomes. It can be useful to monitor each prompt's performance using version control technologies like git.
Prompt Optimization
Techniques for prompt optimization include breaking down a large prompt into smaller, easier prompts, asking the model to explain or explain step-by-step, producing many outputs for the same input and choosing the best one. There are systems that claim to automatically optimize prompts, but they are pricey and frequently use the same techniques. Due to the fact that they don't require coding knowledge, these products appeal to non-coders.
Cost and Latency
Cost: The price associated with utilizing the OpenAI API is based on how many input and output tokens were utilized to make the inference. Longer prompts with more tokens can cost more, and adding extra context or internet-sourced data can raise the token count. However, rapid engineering is typically more affordable and quicker for experimentation than standard machine learning expenditures for data collection and model training.
Latency: Since input tokens can be processed concurrently, input length has little impact on latency. Nevertheless, output length does affect latency; longer output sequences cause higher delay since tokens are generated sequentially. The GPT-3.5-turbo has a low latency of 500ms for short input and output sequences, but for output sequences with more than 20 tokens, the minimal latency rises to almost 1 second.
Challenges of productionizing LLM applications
Productionizing LLM applications can be difficult because APIs like OpenAI can be unreliable, and service level agreements (SLAs) are not yet available. As the industry is continually evolving, predictions regarding latency and costs for LLM applications may fast become out of date. Teams may need to periodically reevaluate their estimates of feasibility and decisions regarding the use of open-source models vs premium APIs like OpenAI.
?
?
?
Finetuning versus alternatives versus prompting:?
Prompting is quick and simple for some cases, but it has a restriction on input token length.
Finetuning can boost model performance and cut down on prompt instruction, but it calls for more instances.
Prompt tuning
A promising middle ground between prompting and finetuning is prompt tuning. Prompt tuning can catch up with model tuning and perform better than prompt engineering.
Finetuning with distillation
Distillation:
Distillation can reduce the model size and cost while maintaining performance.
Distillation is a promising method for optimizing the behavior of smaller models based on bigger models.
Embeddings + vector database
Using LLMs to generate embeddings for ML applications such as search and recommendation systems is another promising direction. At $0.0004/1k tokens as of April 2023, the price for embeddings using the more compact model text-embedding-ada-002 is reasonable. The OpenAI API makes it simple to create embeddings for queries and new things in real time, and embeddings only need to be generated once for each item. Loading embeddings into a vector database for low-latency retrieval is the main cost of embedding models for real-time use cases.
Backward and forward compatibility
Foundational models can work out of the box for many tasks without retraining, but they do need to be retrained or finetuned from time to time. Prompt rewriting is necessary when using a newer model and unit-testing all prompts using evaluation examples is important. Prompt patterns are not robust to changes, and updates may be needed if the underlying model or API behavior changes. This can be challenging in terms of understanding the original prompt logic and updating prompts in complex applications with multiple prompts and changing team members.
领英推荐
?
Part 2 Task Composability
Applications that consist of multiple tasks can be composed using agents, tools, and control flows. Control flows can be sequential, parallel, if, or for loop. Most applications consist of multiple tasks. A program can perform a sequence of tasks, such as converting natural language input to SQL query, executing SQL query, and converting SQL result into natural language response. LLM agents can be used for control flow with LLMs. An agent is an application that can execute multiple tasks according to a given control flow. ?Testing an agent is important for ensuring its reliability. Conditions for control flows can be determined by prompting. Testing each task separately before combining them is important.
There are two major types of failure modes:
1.?????One or more tasks fail, or all tasks produce correct results but the overall solution is incorrect.
2.?????Composability gap: this is the fraction of compositional questions that the model answers incorrectly out of all the compositional questions for which the model answers the sub-questions correctly. This is a challenge that needs to be addressed in the development of language models.
Unit Testing and Integration Testing
Like with software engineering you can and should unit test each component as well as the control flow. For each component, you can define pairs of input, and expected output as evaluation examples, which can be used to evaluate your applications every time you update your prompts or control flows. You can also do an integration test for the entire application.
?
Part 3 ???????????Promising Use Cases
The internet has been flooded with cool demos of applications built with LLMs. Some of the most common and promising applications are AI?assistants, chatbots, programming and gaming, learning, talk-to-you-data, search and recommendation, sales, and SEO.
AI Assistant
The most common consumer use case is without a doubt this one. The ultimate goal is obviously an assistant that can help you with everything. There are AI assistants built for different tasks for different groups of users, such as AI assistants for scheduling, making notes, pair programming, responding to emails, helping with parents, making reservations, booking flights, shopping, etc.
All major corporations have been vying towards this goal for years, including Google with Bard and Google Assistant, Facebook with M and Blender, and OpenAI (and consequently, Microsoft) with ChatGPT. With a very high likelihood of being replaced by AIs, Quora has introduced Poe, a conversation software that supports numerous LLMs. Surprisingly, neither Apple nor Amazon have entered the fray yet.
Chatbots
In terms of APIs, chatbots are comparable to AI assistants. Character.ai is perhaps the most intriguing business in the consuming-chatbot market. If there is a revenue-sharing arrangement so that chatbot designers can be compensated, things may become much more intriguing.
Programming and gaming
LLMs end up being quite skilled at writing and debugging code. A pioneer is GitHub Copilot. There have been some really fascinating demonstrations of how to write code, make web apps from natural languages, identify security issues, and make games using LLMs.
Learning
Exploration of ChatGPT is gaining momentum across the board for EdTech companies.
Creating tests automatically, grading essays and providing feedback, guiding users through math problems, and serving as a debate partner are a few examples of use cases.
With the popularity of homeschooling growing, ChatGPT will probably see a lot of use in supporting parents who homeschool their children.
Talk-to-you-data
Building solutions for enterprise users to query their internal data and regulations using natural language or Q&A format is now the most common enterprise application. Startups are concentrating on industries including resumes, financial data, legal contracts, and customer support. The conventional method entails putting internal data into a database, translating input into the database's query language, running the query to get results, and then translating those results into normal language. Although amazing, the viability of this category is questionable because it might be added as a feature to already-existing systems like Google Drive or Notion. An OpenAI tutorial explains how to communicate with vector datasets.
?
Many people have this question Can LLMs analyze my data for me?
Recently during the project we tried feeding gpt-3.5-turbo some data, and it seems to be able to recognize some trends. This, however, only applies to modest amounts of data that will fit into the input box. The majority of production data are bigger than that.
?
?
Search and recommendation
The foundation of enterprise use cases has always been search and recommendation.
With LLMs, it is seeing a resurgence. Even if consumers aren't yet aware of what they need, LLMs can still assist them in finding it.
SEO for LLMs
LLMs excel at creating material, which might saturate search engines with information that is SEO-optimized for any given keyword. As search engines develop new algorithms to identify material produced by artificial intelligence, SEO may turn into more of a cat-and-mouse game.
?
Conclusion
In conclusion, LLMs applications are in their early stages of development; everything is evolving so fast. New APIs, apps, and infrastructure are constantly being found and improved.
Not all modifications will be significant, and some prompt-tweaking techniques might not be worthwhile in the long term.
Methods for staying current in the field
?????????i.???????????Wait six months and ignore the majority of the hype to see what sticks.
???????ii.???????????Using technologies like BingChat, read only the summaries.
?????iii.???????????Build using each new tool as it is released in an effort to stay current with new developments in the field.