How to Prevent a Failed GenAI Implementation Through Improved Search
Dr Dave G.
Chief Technology Officer | Digital Transformation Specialist | Experienced Business Leader and Consultant | Thought Leader in AI, Data, Cloud (AWS and Azure), RPA and GenAI
Having done many Generative AI (GenAI) based projects now, I have learnt a bit about what makes using GenAI successful and not successful. Based on that experience, there are a few common failure modes. One of the most common failure modes for a GenAI pilot is the poor configuration of the supporting Search technology. It has been a truism since the dawn of IT that "Garbage In = Garbage Out". Not designing the Search technology that supports your GenAI solution correctly leads to "Garbage-In", which means that the results your Large Language Model (LLM) provides will be very poor, and you get "Garbage Out".
This blog article provides a very non-technical explanation of basic search tooling and how getting it right can significantly increase your chances of a successful GenAI pilot. It should take you less than 10 minutes to read and, as a business sponsor or senior technical leader, allow you to ask your delivery folks the right questions when they deploy a GenAI solution to your organisation, and it doesn't seem to work. Hopefully, those undertaking a GenAI pilot will find this article useful.
Background: What is GenAI, a Foundation Model, and an LLM? Why are they important? How are they used?
Traditionally, Artificial Intelligence (AI) has required organisations to build a specific AI model for each individual use case, whether it was machine vision, natural language processing or other. These models, trained on specific use case data, needed to be tuned and constantly monitored for performance, updating the models regularly where performance degrades. Organisations needed to pay for the people and tools necessary to build and manage these individual AI models. As a result, AI was only really economical for higher volume use cases where the costs of AI model development and management were offset by the benefits of using AI.
In 2017, the University of Toronto, in partnership with Google, wrote a research paper called "Attention is all you need". See here if you'd like to read this paper. Without going into detail, this research paper fundamentally changed AI by giving organisations a computationally efficient way of building extremely large AI models. These extremely large models, trained on a variety of data, could be used for various use cases, not just one. As a result, the concept of the "Foundation Model" was born. The term "Foundation Model" was first coined by the Stanford AI team (see here ) to refer to AI models that could be applied across various use cases. These Foundation Models allow organisations to adopt a build once-use many times approach. This radically changes the economics of AI by making even lower volume use cases economical. It also allows organisations to use models built by other organisations, reducing even further the investments necessary and improving the economics of AI even further.
Generative AI (GenAI) is any time you use AI to generate content, whether it is text, images, or voice. Large Language Models (LLMs) are a form of Foundational, Generative AI used specifically for text generation. I recently published a video on what generative AI is, which you can watch here if you'd like further information and explanation.
ChatGPT, which was released for public consumption a little over a year ago by OpenAI (see here ) and which most people have played with, is a Large Language Model (LLM), which is a form of Foundational, Generative AI. ChatGPT demonstrated the capability of these Foundational GenAI models, and ever since, organisations have been racing to adopt this new technology because of its benefits.
The adoption rate of these Foundational, Generative AI solutions has been so fast that the use of ChatGPT has surpassed the adoption of Facebook, the Mobile Phone, and even the Internet (see here ). Hence, I have so many clients experimenting with this technology, and many organisations are asking big tech companies like Microsoft, AWS, Google, and IBM to help them deploy Foundational, Generative AI into their organisations.
This increased adoption is largely due to the perceived benefits of Generative AI, from content retrieval to content generation to decision-making. These new, more powerful, and economic AI models mean that the use of AI in the last 12 to 18 months as a tool for automation, cost savings and improved customer experience has accelerated exponentially.
However, GenAI solutions do have challenges, one of which is Hallucination, which I will discuss in the next section.
LLM Hallucination: What is it? Why it Happens? and Why is it important?
So now that you are clear on what a Foundation Model is, what Generative AI is, and what a large language model (LLM) is, let's discuss Hallucination. Hallucination is a term that has been made popular in the press regarding the use of GenAI tools like ChatGPT. It refers to the idea that LLMs can literally create or fabricate totally incorrect information in their responses. Because Hallucination infers a human-conscious quality that is inappropriate to apply to a machine, those of us in the industry prefer to avoid the term hallucination and typically use terms like "fabrication" to describe this phenomenon. If you want to read a really good article on the topic of GenAI Hallucination, see here .
In practical terms, GenAI hallucinations can occur for a variety of reasons. These can include the GenAI model being trained on incorrect data, the question or prompt being put to the LLM model being improperly formatted, or the LLM itself being a stochastic algorithm that predicts the wrong response even when it has insufficient information to provide the right response.
When ChatGPT first came on the scene, there were some very famous cases of it hallucinating, and users inexperienced in the use of GenAI got into trouble as a result of basing decisions on those hallucinations. This was largely because ChatGPT was trained on the internet, and with the internet having a lot of incorrect or non-factual content, a lot of what ChatGPT produced was not factual. Because ChatGPT produced fluent responses and because people tend to misinterpret fluency with accuracy, people got into trouble when relying on its responses.
The technology has come a long way, and the propensity for LLMs like ChatGPT to hallucinate has been reduced significantly, partially because of the technology and partially because the content these LLMs are trained on is better curated and therefore more accurate. However, hallucination is still a challenge that has to be addressed in most GenAI implementations.
One way to avoid Hallunication in your LLM is to "Ground" it by feeding your LLM information using a technique called Retrieval Augmented Generation or RAG. Using RAG is very common, and by some estimates, more than 60% of GenAI solutions will involve some form of RAG to improve content accuracy and avoid Hallucination. I will talk about RAG in the next section and why Search is a critical component of all RAG solutions.
What is Retrieval Augmented Generation (RAG), how does it address Hallucination, and why is Search Important to RAG?
In simple terms, Retrieval Augmented Generation (RAG) refers to the idea that you can reduce "Hallucination" and increase the "Accuracy" of your LLM by "Grounding it." This is where you provide the LLM with your own curated data upon which it can base its responses.
RAG is essentially a two-step process. When the user provides the LLM with either a question or an instruction, the GenAI solution first searches for relevant content in a curated database that has been provided to find content for the question or instruction to the LLM to be based on. This content is then fed to the LLM, and the LLM is instructed to base its answer or response on that content. The degree to which the LLM uses the content provided or uses information from its own training is controllable. Typically this results in a trade-off of accuracy versus fluency of the response. This process is illustrated in the two diagrams below:
RAG is used not only to reduce Hallucinations and improve accuracy but also to provide the LLM with content very specific to the application it is being used for and, therefore, unlikely to be data that was included in the LLM's original training. So, it is a way of tailoring your LLM to your business needs. As a result, RAG is a very common technique.
But here in lies the problem. If you are asking the LLM to base its response on information that it has found by searching a database of curated data, then the quality of that data and the ability to search and find the right data from that database upon which to base a response becomes extremely important.
I will leave the topic of Data Quality for another blog article. The point of this article is to focus on the issues of Search and finding the right information. We will assume for the moment then that all the information in your database is correct and up to date. But if your search returns the wrong information for your LLM to base its answer on, then that response's quality will be poor. So, having good search functionality engineered into your GenAI solution is important for the overall solution to work effectively.
Since IT began, the ability to do a good information search has been challenging. Whole companies like Google have been built on the concept of providing more accurate and effective search algorithms. Interestingly, even though this challenge has been going on for many years, most corporate IT organisations are still relatively immature in their understanding and use of search, leading to problems when using GenAI or LLMs that employ the RAG technique.
Keyword or Text Search vs Semantic-based Search
When most people think of search, they think of keyword or text search, sometimes called lexical search. In simple terms, this means matching the words in your query to the words in the content of your database. For example, doing a search on a cooking database for "Really good recipes for lasagna" might cause you a lot of matches on the word "lasagna". Modern keyword search is a little more complex and advanced nowadays because it usually filters out what they call stop words like "The" and "a", as these are common and may cause incorrect matches. It also addresses the issues of synonyms like "cold" being the same as "chilly" and word stemming such as "cold" vs "colder". But still, at the end of the day, it is a text-based lexical match. There is no understanding of the semantic meaning of the query when the content is retrieved.
Then there is semantic-based search, which is a search based on meaning. Semantic-based search seeks to improve search accuracy by understanding the searcher's intent and contextual meaning and comparing this to the meaning of the content within the curated database. For example, the following two sentences: “The weather today is sunny” and “It's a bright day outside”, would match based on semantic similarity of meaning yet would have no keyword match. The benefits of Semantic-based Search include but are not limited to:
Although not all vendors/ technologies provide for semantic-based forms of search. Also, implementing a semantic-based form of search over a large amount of content can be challenging. This is where vector search comes in.
领英推荐
What is a Vector Search/ Query?
Vector Search is a technique for retrieving search results from large volumes of content based on semantic meaning. In simple terms, the items of content within the database are translated into a multi-dimensional mathematical representation of their semantic meaning. This process is called embedding, and the output is vectors (a series of numbers) representing the semantic meaning of each piece of content. The query string used to search for the content is also similarly converted into a vector for meaning. The query vector and the database content vectors are compared using statistical techniques such as ANN (Approximate Nearest Neighbours), and the database content with the closest matching vector is returned. This overall process is illustrated below graphically:
There are many different embedding techniques and variants in the vector search process. All of these are designed to provide better semantic-based matches when searching through large volumes of content.
What is a Hybrid Search/ Query?
While vector search based on semantic matching will generally provide better results, it does not excel in all search use cases. It falls short when searching for specific people's names or the names of objects, searching for acronyms or searching for IDs. A simple text-based search will provide better results in these cases. Text-based search is also better at precise matching, matching with just a few characters or for low-use vocabulary. It is not uncommon then to combine forms of both text-based and vector searches together in an attempt to improve overall search results. This is referred to as Hybrid Search.
Reranking Search to Improve Results
When multiple types of search are combined to improve a search result, as in hybrid search, a system of combining and prioritising the search results from the different query types becomes necessary. This is where reranking comes in. There are multiple rerank techniques, all of which attempt to prioritise the retrieved content based on its relevance to the user's question or instruction.
Chunking and why it is important?
All LLMs, whether proprietary or open source, have a programmed limit on the amount of content you can pass to the LLM as a prompt. This has to do with the underlying mathematical model. This means that a document or piece of content has to be broken up into chunks when stored in your database. The search, whether text-based, vector, or hybrid, then returns the most appropriate chunks to be passed to the LLM.
The size of the chunk can be important to the search result if the piece of relevant information the LLM needs to address the query or action is actually far away from the part of the content that the search query matched on. This is common for content such as operational procedures. So, smaller chunks can lead to poorer LLM response. But then, passing larger chunks can be more costly because most LLMs operate on a cost-per-token charge where the number of tokens is proportional to the amount of content passed to the LLM. Also, larger chunks can impact the query results depending on the type of query used. Different techniques can be used to identify the optimal chuck size. But often times there is a degree of trial and error involved in this process. For a really good article on the concept of chunking, see here .
How the Content is stored is also important.
Avoiding again the discussion around data quality, I have seen a more basic issue with Search, which has impacted the ability to conduct Retrieval Augmented Generation properly and provide an effective GenAI solution. This has to do with content type. I have come to realise that many technology vendors restrict the document formats their search tools can effectively ingest and parse. For example, some technology vendors don't support .aspx formatted pages or other document formats. Therefore, how the content is stored can limit your ability to use RAG to ground your LLM.
Using Chat to improve your search results
Using something like Langchain to maintain the context of a conversation can also improve the search results by providing the search with more content upon which to match semantic meaning. The amount of previous responses in a chat to use to improve the search is often identified by trial and error.
How to incorporate Feedback and Continuous Process Improvement
So far, I've discussed several different options for using search with GenAI. These have included ...
There are many more options in terms of Search for GenAI being developed constantly. Working through all this optionality often involves a process of trial and error. To begin the process of GenAI solution design, it is important to have a large set of question and answer/ action pairs to test the the GenAI solution against. It is also important to have an ongoing process of feeding back user satisfaction with the LLM so that GenAI developers and Prompt Engineers can continuously adjust and improve their techniques to optimise the response. Some consideration to this has to be given, particularly around the search design and the beginning of the pilot process.
Conclusions
I began this blog by saying that I've often seen a GenAI pilot fail because of ineffective Search use. If Hallucination is a problem or if the content you want the LLM to use in its responses is very specific to your organisation, then you often have to use Retrieval-Augmented Generation in your GenAI solution. A core component of RAG is the Search design that is used to retrieve the content to augment your LLM. Many organisations don't understand this and attempt to use very simple keyword or Text-Based Search when building their GenAI solution and never consider all the optionality I talk about above. They then wonder why their LLM performs poorly, and their pilot is perceived as a failure.
Dr David Goad is the CTO and Head of Advisory for IBM Consulting Australia and New Zealand. He is also a Microsoft Regional Director. David is frequently asked to speak at conferences on the topics of Generative AI, AI, IoT, Cloud and Robotic Process Automation. He teaches courses in Digital Strategy and Digital Transformation at a number of universities. David can be reached at [email protected] if you have questions about this article.
Branding | Digital Strategy | Channel Partner Marketing | Ex-Lenovo | Ex- HID
5 个月Insightful article! As you mentioned, "Garbage In = Garbage Out" is a fundamental principle in IT, and ensuring high-quality inputs is critical for successful GenAI outputs. At Globalution, we've found that merging AI with human expertise significantly enhances business operations and innovation. This approach has streamlined our processes and delivered remarkable results for our clients. I'm looking forward to reading your blog and learning more about optimizing search tooling to ensure successful GenAI pilots.
Chief Information Officer at Arnold Bloch Leibler | AGSM MBA(Tech)
6 个月Great article Dr Dave G. thanks for sharing your knowledge on this.
Principal Cloud Data & AI Architect | Principal Consultant | IBM ANZ Data & AI Azure Lead
6 个月Sam Poursoltan ,PhD