Accelerate Search Testing with?AI
LLMs Are Surprisingly Good at User?Testing?
I have a story for you about a fascinating new way to use large language models in your search design.
If you’re already an AI fan like me, you might think this is going to be about semantic search that uses a vector database to index content… or retrieval augmented generation (RAG) that pulls your own custom data into the prompt of an LLM for question answering.
I’m not talking about any of that. That’s all very well documented already. I want to tell you something new. I hope this story gives you a little dopamine hit of new insight like it did for me…
The Complexity of Federated Search
A friend of mine leads product and digital experience for a major US hospital system, and he was asking me about a large redesign they’re doing for their website search.
This is a very sophisticated search function, not only because of the large volume of information it must index, but because it is “federated,” meaning it has results coming from several distinct data sources. A user might ask a query about “skin cancer” and you might want to return:
Within any ONE of these separate sources, you can use a number of established ranking methods to match a query; scoring attributes like relevance, recency, or popularity.
Existing Methods to Design Federated Search
If you’re trying to find the best way to mix all these different scoring methods…from multiple data sources…with different kinds of metadata and indexes?—?this gets tricky! In many ways, it’s more of a design problem than an engineering problem.
Luckily, it’s very much an empirical question, meaning we can test and refine. Say hypothetically your initial search design is the following:
This plan assumes the knowledge base articles are the most desirable, and is not putting much emphasis on the other sources. The design space has a lot of parameters and rules to tune, and is clearly very large and flexible.
Traditionally, you would use UX best practices to set this up for launch, and then collect data on how it performs. This involves deploying analytics on the search results such as:
Once the search is deployed and instrumented, it’s a matter of waiting for enough search traffic to analyze and iterate on possible improvements. It’s also a good idea to A/B test multiple versions of a search design to compare performance and iterate faster.
Human Pre-Testing
With sufficient time and budget, you can also have the gold standard: You can run qualitative testing to refine your search design BEFORE launch.
领英推荐
Here, users ask a battery of test questions and provide a subjective judgment on the value or correctness of the given search results.
This allows you to do substantially more design refinement to optimize your launch.
NEW: AI Pre-Testing
This is where it gets really interesting…because now, you can do that qualitative testing programmatically with a large language model.
This is the punchline and major finding of my story: an AI is now sufficiently capable of judging the quality of search results for essentially any topic. It can do this with all the comparative advantages that machine automation has over humans: cheaper, faster, more reliable, more repeatable, and less mind-numbing.
AI Pre-Testing Methodology
How might you approach this practically?
LLM_prompt_string = f '''
Act as a user who is looking for information
on {your_website_here}.
Also consider the company's "about us"
information:{about_your_company}
You have just asked the following query in
the website search:
{test_query[i]}
And you've received the following search results:
{search results[i]}
Your task: For each of the results and their
summary snippets, rate the quality and
responsiveness from one (worst) to five (best)
and give a concise explanation for your answer.
Also include any suggestions for improvement
of the search design.
Provide your results in JSON format with
data keys for:
["test_query", "result", "rating", "explanation",
"improvement_suggestions"]
'''
You’ll want to do this all programmatically by an api, hitting GPT, Llama, or whatever LLM you are using. Feel free to play around with the prompt until you get it just the way you want.
Conclusion
Now you have an AI agent that programmatically grades your search design!
You can sum up the score ratings across your different design plans and see which ones score better, refining them to get something that is much closer to optimal before you launch.
I think that’s really exciting. You can run this over and over, every time you deploy a new change to the search algorithm.
This is just one example of a number of ways you might be able to automate all kinds of user testing and evaluation if you lean in and think creatively about it.
Thanks for reading, and Happy Innovating!
***
Dave Costenaro is Chief Data Officer at CSG Solutions | Helping businesses thrive with data-driven strategies | Contact Us