Tuning Information Retrieval in Agent Builder Search applications with Google Search?Adaptor.

Tuning Information Retrieval in Agent Builder Search applications with Google Search?Adaptor.


Introduction:

We’ve all experienced the frustration of a search application returning a list of seemingly related results, yet none quite hitting the mark. Behind the scenes, a complex dance of algorithms determines what information makes the cut. At the heart of this process lies Information Retrieval, stage tasked with scanning the vast ocean of information to select the most relevant chunks to be injected into the context of large language models like Gemini?—?the essence of Retrieval Augmented Generation (RAG) systems. It’s a delicate balance of speed and precision, where the quality of retrieved information directly impacts the relevance of generated responses.

So what happens when the returned list of chunks or documents does not hit the mark? This is the moment when you start looking for another solution or you delve under the hood of the existing one and try to replace or tune its elements.?

Alternative approach would be to keep the system as is and use solutions like Search Adaptor from Google Research which relies on adding a module on top of fixed, pre-trained embedding model. This module is customized for customer data and effectively modifies embeddings generated from pre-trained model to better capture the nuances of your specific data domain.

Similar technique has been implemented into Agent Builder Search and in this blog post we want to uncover how designers of Search applications can leverage it to tune their private Google Search experience and then evaluate impact of Search tuning on generated responses.?


Vast ocean of information requires a consistent representation of different information types so that it can be efficiently stored and scanned. In the modern search landscape, embeddings have emerged as the most effective solution. By transforming diverse data types?—?text, images, videos?—?into numerical vectors, embeddings capture the semantic essence of information, allowing for easy comparisons of diverse things. The distance between embeddings carries semantic meaning, i.e., similar items are closer together.

To answer a query with this approach, the system must first map the query to the embedding space. It then must find, among all database embeddings, the ones closest to the query; this is the nearest neighbor search problem.?

One of the most common ways to define the query-embedding similarity is by their inner product; this type of nearest neighbor search is known as maximum inner-product search (MIPS). Because the database size can easily be in the millions or even billions, MIPS is often the computational bottleneck to inference speed, and exhaustive search is impractical. This necessitates the use of approximate MIPS algorithms that exchange some accuracy for a significant speedup over brute-force search. One of the most efficient algorithms in this space is ScANN(Scalable Nearest Neighbors) open-sourced by Google: https://github.com/google-research/google-research/tree/master/scann

However, speed alone is not the sole challenge. Ensuring information retrieval results include the precise chunks you think are essential adds another layer of complexity. Picture this: your information retrieval system, diligently scanning the database of embeddings, presents you with a cluster of blue dots representing documents deemed most similar to the user’s query (the red dot). However, you know the ideal answer lies further afield, encapsulated within the information represented by a lone orange dot.

While it might be tempting to simply crank up the number of returned chunks, hoping the desired information will magically appear in the expanded list, this approach isn’t always the most efficient. Blindly increasing the retrieval count can introduce noise and irrelevant information, potentially diluting the quality of the results.

Alternative would be to modify how things are represented in embedding space. If you think of an embedding model as a cartographer, meticulously mapping information into a navigable digital space then modifying embedding space would be like redrawing the map:

There are methods to do it! If you’re working in a specialized field like medicine, law, or finance, etc. you can fine-tune the embedding model. This means training it on domain-specific data or molding the embedding space to better capture the nuances of your specific data domain.

Google Cloud provides a robust suite of pre-trained embedding models accessible via API, catering to a variety of language and semantic understanding needs. For English-centric tasks, the state-of-the-art “text-embedding-004” model delivers high-quality vector representations for text data. If your project demands multilingual support, including Polish, you should use the “text-multilingual-embedding-002” model.?

In our next article, we’ll dive deep into the practical side of things and show you how to fine-tune Google’s embedding models on Google Cloud.

In this article, however we will answer another question: what if you’re working with an embedding model that’s only accessible through an API or when embedding model is not exposed to app developers?

This scenario poses a unique challenge, but fear not! Google Research has pioneered innovative techniques to tackle this very situation.

One of them is explained in a research paper from Google entitle: “Search-Adaptor: Embedding Customization for Information Retrieval”. Search-Adaptor relies on adding a low capacity adapter module on top of fixed pre-trained embedding model. This adapter module is customized for customer data and effectively modifies embeddings generated from pre-trained embedding model. Additional advantage of this approach is that the resulting adapter can be integrated with any embedding model, including those only available via prediction APIs. On multiple English, multilingual, and multimodal retrieval datasets, we show consistent and significant performance benefits for Search-Adaptor?—?e.g., more than 5% improvements for Google Embedding APIs in nDCG@10 averaged over 14 BEIR datasets.

How is Search Adaptor method (also known as Search Tuning) different from embedding-fine-tuning:

A key advantage of Search Tuning with Search-Adaptor over embedding fine-tuning is its non-invasive nature. The base embedding model remains untouched, with the adaptor dynamically applied during the search process. This means that when you update your Search-Adaptor, there’s no need for the time-consuming and resource-intensive task of reindexing your entire document collection.

Contrast this with traditional fine-tuning, where the base embedding model is effectively replaced with a new, retrained model. Such an approach often necessitates a complete reindexing, potentially disrupting your search workflow and incurring additional computational costs. Search Tuning, on the other hand, offers a seamless and efficient way to adapt your embeddings on the fly, ensuring your search system stays up-to-date and responsive to evolving information needs.

Search-Adaptor isn’t just theoretical?—?it’s been put into practice to enhance Google Cloud’s Vertex AI Search. This managed service allows you to build a private, customized Google Search experience on your company’s data.

Training Search-Adaptors is available through Search application -> Configurations -> Tuning:?

Once search tuning job is completed you can specify whether you want to continue using base model or pass user queries through Search-Adaptor (tuned model):

The improvements are expressed as normalized discounted cumulative gain at first 10 positions (nDCG@10). Let’s decode this metric step by step. Normalized discounted cumulative gain is more advanced variant of cumulative gain (CG) so let’s first try to clarify this metric. Imagine you’re searching for the best restaurants in your city. You type your query into Google Search, and it gives you a list of 10 restaurants. Cumulative Gain is a simple way to measure how good those propositions are overall.

  • Imagine that each restaurant in the list has a “relevance score”. Let’s say the most relevant restaurant gets a score of 3, the next most relevant gets a 2, and so on. How would system know the relevance score? Someone needs to build evaluation set and define relevancy score. This set will be considered a ground truth data for training and testing.?
  • You iterate through test dataset and calculate CG from returned results for every question. You calculate it by just adding up these relevance scores for all the results. So if the top 3 results have scores of 3, 2, and 3, the CG would be 3 + 2 + 3 = 8. Because we check first three positions it is CG@3. If we were checking first 10 results it would be CG@10. As simple as that.
  • Higher CG means the search engine returned more relevant results overall.

Now, CG has a problem: it treats all positions in the list equally. But in reality, position of items counts! Having your favorite restaurant (high relevancy score) on the list is fine, but having it on first position is even better. Discounted Cumulative Gain (CDG) fixes this by giving more weight to the higher positions. It does this by dividing each relevance score by a factor that increases as you go down the list. So the top result’s score is divided by 1 (no change), the second result’s score is divided by something a bit bigger than 1, the third result’s score is divided by something even bigger, and so on. This means that if a very relevant result is buried deep in the list, its contribution to the DCG will be small. Higher DCG means the search engine returned more relevant results at the top of the list, where it matters most.

Finally, DCG has one more issue: its value depends on the specific query and the relevance scores you assigned. This makes it hard to compare DCGs across different searches. nDCG solves this by normalizing the DCG. It does this by dividing the DCG by the “ideal DCG”?—?the DCG you’d get if the results were perfectly ordered from most to least relevant. This gives you a score between 0 and 1, where 1 means the results are perfectly ordered, and 0 means they’re completely irrelevant. nDCG lets you compare the quality of search results across different queries, even if they have different relevance scores. Search tuning in Agent Builder Search applications calculates NDCG@10 for you.?

So what you need to start Search Tuning? You need data that will be used to train Adapter so that it knows how modify original embedding space to better represent your data. Training set consists of examples of queries, corpus (documents), and relations between queries and documents provided as training and testing sets:

Queries: List of questions. The recommendation is to provide 100 questions.?

Corpus: is a list of chunks representing your knowledge database. Ideally you need 10 000 chunks and out of these 100 must correspond to questions from Queries dataset.?

Training: is a file with pairs query:chunk and corresponding relevancy score. If you do not provide negative examples, tuning job will create negative examples from long list of documents that are not paired with questions.?

While the Search Tuning job can automatically split your training set into training, validation, and testing subsets, providing your own dedicated testing set is strongly advised. This ensures consistency in evaluation and prevents potential discrepancies in nDCG@10 results across multiple training runs due to variations in data splits.

Building datasets for Search Tuning can be a time-consuming undertaking. To expedite the process and gain preliminary insights, consider starting with synthetic data generated directly from the documents uploaded to your Agent Builder Search application. This approach offers a convenient starting point, allowing you to explore the potential benefits of Search Tuning before investing significant resources in manual dataset creation.

To embark on your Search Tuning journey, follow this streamlined procedure:

  • Prepare your documents: Organize and refine the documents you’ve uploaded to your Agent Builder Search application.
  • Apply Document AI Layout Parser: Layout Parser extracts document content elements like text, tables, and lists, and creates context-aware chunks:

  • Leverage Gemini: Employ Gemini to generate question-answer pairs from the chunks, creating both positive and negative examples.

  • Export data: Structure your data into the required formats, including queries.jsonl, documents.jsonl, training.tsv, and testing.tsv files.
  • Initiate Search tuning job: Kick-start the Search Tuning process, utilizing the prepared datasets.

** I will be sharing command line tool to accelerate this process. Stay tuned.?

When your Search tuning job is done you will have a chance to estimate impact of search tuning by checking pre and post tuning nDCG@10 metric.?

However, it is important to remember that while benchmarks like nDCG@10 offer valuable guidance, they don’t capture the full picture of search performance. After all, every user has unique preferences and expectations. True success lies in delivering a search experience that resonates with your users and satisfies their information needs.?

To gain a deeper understanding of search effectiveness from a user’s perspective, consider comparing generated summaries with and without the Search Adaptor. The Agent Builder SDK provides a convenient way to send queries programmatically, allowing you to specify whether to use the base model or the adapted one through the ‘customFineTuningSpec’ section:

This configuration instructs the Search app to generate an answer (summary) to our query by considering only the first item (summaryResultCount: 1) from the list of 10 relevant search results (pageSize: 10). We just want to generate Summary from a single, the most relevant chunk which we hope will be on top of the list of search results. Summary is generated from so-called extractive answers that come with search results. An extractive answer is a verbatim text extracted from the original document to provide a contextually relevant answer to the query. We can ask to have many extractive answers per result but here we want to test if the system find the most relevant chunk in documents. This is why we set maxExtractiveAnswerCount to 1 and expect exactly 1 extractive answer to be returned by search result.

We also want to disable citations (includeCitations: False). Citations is a very a powerful mechanism to highlight parts of the generated answer that are backed by our data and you should definitely keep it enabled when used in production because this will help your users identify parts of answer your users can trust (these will include citations) and those that may be just hallucinations (systems was not confident to assign any citation to such statement). Here we disable it because during evaluation we want to just compare plain text summaries.

To gain deeper insights into the impact of Search-Adaptor, I’ve implemented python script that automates sending queries to my search app. Then we materialize search results into BigQuery table. This table includes the original question, the synthetic answer generated by Gemini during question-answer pair creation, and, crucially, the search answers generated both with and without the Search-Adaptor applied.

This structured data allows for in-depth analysis and comparison to understand the nuanced ways in which the adaptor influences search results. By examining these outputs side-by-side, we can identify patterns, strengths, and areas for further improvement, ultimately driving towards a search experience that truly resonates with users.

However, comparing a multitude of search results manually can be a daunting task. To efficiently assess the impact of Search-Adaptor on quality of generated answers, I turned to Vertex AI Evaluation Service and its powerful AutoSxS mechanism. AutoSxS uses an Autorater?—?language model fine-tuned to work as judge to compare two responses and decide which is better. These can be responses generated by two different language models but we will use it to compare responses generated by our search system with and without tuning.?

We will run evaluation from Vertex AI Colab:

Ensure the Evaluation library is installed:

!pip install -q google-cloud-aiplatform[evaluation]==1.63.0        

The next step involves setting up a prompt template for Autorater. It will receive a question and an ideal (synthetic) answer. Autorater’s task is to analyze responses generated with and without Adaptor, determining which aligns better with the ideal answer for a given question. It can also declare a TIE if both responses are equally aligned or unaligned. We prioritize evaluating alignment with the ideal answer instead of solely focusing on the overall quality of generated responses. This approach allows us to specifically assess the impact of the Information Retrieval stage. Our underlying assumption is that the Information Retrieval stage should ideally retrieve the exact document used to generate the corresponding synthetic answer. The Evaluation service offers a collection of pre-defined templates and metrics for quality assessment, which I strongly recommend exploring. Overall, I believe Vertex AI Evaluation Service empowers you to make informed, data-driven choices when selecting from an expanding array of models available in registries like Vertex AI Model Garden or Hugging Face.

prompt_template = """# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models.
We will provide you with the user input and a pair of AI-generated responses (Response A and Response B).
User input will consist of two parts: question (Question:) and ideal response (Ideal Response:)
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on based on the Criteria provided in the Evaluation section below.
You will first judge responses individually, following the Rating Rubric and Evaluation Steps.
Then you will give step-by-step explanations for your judgement, compare results to declare the winner based on the Rating Rubric and Evaluation Steps.


# Evaluation
## Metric Definition
You will be assessing alignment with ideal response, which measures the accuracy of facts and consistency of statements between AI generated response and ideal response.

## Criteria
Alignment: The response includes facts and statements that are consistent with ideal response. The response does not bring any information not available in ideal response.

## Rating Rubric
"A": Response A is more aligned with ideal response than Response B.
"SAME": Both response A and B are equally aligned, or unaligned.
"B": Response B is more aligned with ideal response than Response A.

## Evaluation Steps
STEP 1: Analyze Response A based on the alignment criteria: Identify any information in the response not present in the prompt and provide assessment according to the criterion.
STEP 2: Analyze Response B based on the alignment criteria: Identify any information in the response not present in the prompt and provide assessment according to the criterion.
STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
STEP 5: Output your assessment reasoning in the explanation field.


# User Inputs and AI-generated Responses
## User Inputs
### Prompt
{prompt}

## AI-generated Responses

### Response A
{baseline_model_response}

### Response B
{response}
"""        

Before delegating the evaluation job to Vertex AI Evaluation Service, set up a Vertex AI session:

import vertexai

vertexai.init(
    project=PROJECT_ID,
    location=LOCATION
)        

Since the evaluation job expects a pandas DataFrame as input, we need to convert our generated search responses, currently stored in a BigQuery table, into this format. The BigQuery BigFrames library will enable us to do this seamlessly, translating DataFrame transformations directly into BigQuery code behind the scenes.

import bigframes.pandas as bf

bf.options.bigquery.location = "europe-west4"
bf.options.bigquery.project = "genai-app-builder"

df = bf.read_gbq("genai-app-builder.searchtuning.qzbYbPrPAR_search_results")

## we just need few columns from BigQuery table where we keep search results
columns_to_select = ['question', 'search_answer_without_adapter', 'search_answer_with_adapter', 'synthetic_answer', 'rid']
eval_dataset=df.loc[:, columns_to_select] 
eval_dataset_as_pd = eval_dataset.to_pandas()
eval_dataset_as_pd.head()        

We’re not quite ready to proceed. The Autorater prompt necessitates both a question and an ideal response. To accommodate this, we’ll need to create a virtual column named prompt:

rapid_eval_dataset = eval_dataset_as_pd[eval_dataset_as_pd[['synthetic_answer']].notnull().all(1)]
rapid_eval_dataset["prompt"]='Question: \n'+ eval_dataset_as_pd['question'] + '\nIdeal Response: \n' + eval_dataset_as_pd['synthetic_answer']
rapid_eval_dataset["baseline_model_response"]=eval_dataset_as_pd['search_answer_without_adapter']
rapid_eval_dataset["response"]=eval_dataset_as_pd['search_answer_with_adapter']
rapid_eval_dataset.head()        

We’re now prepared to transform the Autorater prompt template into a PairwiseMetric metric object. We’ll assign the alias ‘auto-sxs-4-search-tuning’ to this metric for easy reference.

from vertexai.evaluation import (
    EvalTask,
    PairwiseMetric
)

pairwise_alignment_metric = PairwiseMetric(
  metric="auto-sxs-4-search-tuning",
  metric_prompt_template=prompt_template,
)        

An EvalTask is used to instantiate the evaluation job. This task necessitates three key inputs: the evaluation dataset we've prepared, a collection of metrics (in this case, we'll utilize a single custom metric defined from our prompt template), and the name of the Vertex AI Experiment where we intend to capture and store the evaluation results:

pairwise_alignment_eval_task = EvalTask(
  dataset=rapid_eval_dataset,
  metrics=[pairwise_alignment_metric],
  experiment=EXPERIMENT_NAME,
)        

Our evaluation dataset contains over 150 questions. The evaluation service utilizes Gemini 1.5 Pro as the Autorater and depending on your account settings there’s a possibility of encountering quota limitations. To manage the number of requests executed concurrently by the Autorater, you have the option to adjust the evaluation_service_qps parameter when initiating the evaluation job through the evaluate() function:

pairwise_text_quality_result = pairwise_alignment_eval_task.evaluate(
  evaluation_service_qps=1,
)        

You may also want to request quota increase:

Evaluation results can be viewed within Colab:

However, I’d rather materialize the results in BigQuery.

results.to_gbq("genai-app-builder.searchtuning.qzbYbPrPAR_autosxs_results", 
if_exists='append'
)        

This table provides a wealth of information for understanding the impact of Search-Adaptor. For each question, we get:

  • Autorater Choice: which answer, generated with (CANDIDATE) or without the adaptor (BASELINE), is deemed superior.
  • Explanation: concise explanation justifying its choice, highlighting specific strengths or weaknesses in each answer.

QUESTION: Czy w przypadku kredytu hipotecznego, oprócz kosztów wliczonych w ca?kowity koszt kredytu, mog? wyst?pi? dodatkowe op?aty? Je?li tak, to jakie i czy mo?na oszacowa? ich maksymaln? wysoko???
BASELINE: Tak, oprócz kosztów wliczonych w ca?kowity koszt kredytu hipotecznego mog? wyst?pi? dodatkowe op?aty. Bank mo?e pobiera? op?aty za aneksowanie umowy kredytu, aby wprowadzi? do niej nowy wska?nik i korekt? (spread koryguj?cy). Bank mo?e równie? pobiera? op?aty za przedterminow? sp?at? kredytu. Jednak w przypadku przedterminowej sp?aty kredytu, bank mo?e odst?pi? od pobrania prowizji lub op?aty, je?li wynikaj? one z umowy kredytu. Niestety, maksymalna wysoko?? tych dodatkowych op?at nie jest okre?lona w podanych ?ród?ach.
CANDIDATE: W przypadku kredytu hipotecznego, oprócz kosztów wliczonych w ca?kowity koszt kredytu, mog? wyst?pi? dodatkowe op?aty. Jednak?e, od 22 lipca 2017 roku, kredytobiorca nie ponosi dodatkowych kosztów zwi?zanych z wcze?niejsz? cz??ciow? lub ca?kowit? sp?at? kredytu hipotecznego. W przypadku Bezpiecznego Kredytu 2%, przedterminowa sp?ata skutkuje utrat? dop?at, z wyj?tkiem okre?lonych sytuacji. Maksymalna wysoko?? dodatkowych op?at nie jest okre?lona, poniewa? ich wyst?powanie jest uzale?nione od indywidualnych warunków umowy kredytowej.
EXPLANATION: Response A mentions additional fees that may not be included in the total cost, like annexation fees, which are not mentioned in the ideal response. Response B correctly identifies that early repayment fees are no longer applicable as of July 22, 2017, but focuses heavily on early repayment, which is a specific aspect and not the main point of the ideal response. Both responses fail to perfectly align with the ideal response, but Response B provides more relevant information overall.

We can also check aggregated summary metrics to learn that in majority of cases Autorater called a TIE considering both answers are equally good or bad. Autorater considered answers generated with Adaptor to be better for 8% of cases.

Summary:?

For information retrieval systems, the pursuit of precision and relevance is an ongoing endeavor. While pre-trained embedding models provide a robust foundation, the ability to tailor their output to specific domains or user needs is crucial for achieving optimal search performance.?

This article has illuminated a helpful tool in this pursuit: Search Adaptor. We’ve explored how Search-Adaptor, with its non-invasive approach, helps you to modify embeddings on the fly, eliminating the need for time-consuming reindexing and ensuring your system remains agile and responsive. The integration of Search-Adaptor into Vertex AI Search applications showcases its practical application in building private, customized search experiences tailored to enterprise requirements.

Remember, the key takeaway is this: The embedding space can be sculpted and refined. By employing the strategies and tools discussed here, you can tune precision and relevance in your information retrieval system, ultimately delivering a better search experience to your users. Whether you’re working with specialized domains, multilingual data, or seeking to optimize performance within API constraints, the path to enhanced search lies in understanding and mastering the art of embedding customization.

This article is authored by Lukasz Olejniczak? —?Customer Engineer at Google Cloud. The views expressed are those of the authors and don’t necessarily reflect those of Google.

Please clap for this article if you enjoyed reading it. For more about google cloud, data science, data engineering, and AI/ML follow me on LinkedIn.

要查看或添加评论,请登录

Lukasz Olejniczak, PhD的更多文章

社区洞察

其他会员也浏览了