Accelerate Search Testing with?AI
Image created with OpenAI's Dall-E-3

Accelerate Search Testing with?AI

LLMs Are Surprisingly Good at User?Testing?

I have a story for you about a fascinating new way to use large language models in your search design.

If you’re already an AI fan like me, you might think this is going to be about semantic search that uses a vector database to index content… or retrieval augmented generation (RAG) that pulls your own custom data into the prompt of an LLM for question answering.

I’m not talking about any of that. That’s all very well documented already. I want to tell you something new. I hope this story gives you a little dopamine hit of new insight like it did for me…

The Complexity of Federated Search

A friend of mine leads product and digital experience for a major US hospital system, and he was asking me about a large redesign they’re doing for their website search.

This is a very sophisticated search function, not only because of the large volume of information it must index, but because it is “federated,” meaning it has results coming from several distinct data sources. A user might ask a query about “skin cancer” and you might want to return:

  • Articles from a knowledge base on general symptoms and treatments,
  • Top hits from a location search of healthcare providers,
  • Relevant insurance providers from a partners database, and
  • Results from a database of pharmaceuticals.

Within any ONE of these separate sources, you can use a number of established ranking methods to match a query; scoring attributes like relevance, recency, or popularity.

Existing Methods to Design Federated Search

If you’re trying to find the best way to mix all these different scoring methods…from multiple data sources…with different kinds of metadata and indexes?—?this gets tricky! In many ways, it’s more of a design problem than an engineering problem.

Luckily, it’s very much an empirical question, meaning we can test and refine. Say hypothetically your initial search design is the following:

  • Return top 10 knowledge base articles by relevance score
  • Re-rank these by popularity score (number of views in the last 30 days)
  • Insert a callout box between article results 3 and 4 called “Other Resources” and randomly shuffle the top #1 result by relevance score from the three other data sources.

This plan assumes the knowledge base articles are the most desirable, and is not putting much emphasis on the other sources. The design space has a lot of parameters and rules to tune, and is clearly very large and flexible.

Traditionally, you would use UX best practices to set this up for launch, and then collect data on how it performs. This involves deploying analytics on the search results such as:

  • Click-through rate for each search result
  • Time spent reading the page after a click-through (More time equates to more value, quality, and goodness for that query+result pair)
  • And finally, you would measure any Conversion you might be targeting downstream of a click-through. (For my friend’s use case, he was trying to convert queries and clicks into appointments scheduled with their healthcare providers.)

Once the search is deployed and instrumented, it’s a matter of waiting for enough search traffic to analyze and iterate on possible improvements. It’s also a good idea to A/B test multiple versions of a search design to compare performance and iterate faster.

Human Pre-Testing

With sufficient time and budget, you can also have the gold standard: You can run qualitative testing to refine your search design BEFORE launch.

Here, users ask a battery of test questions and provide a subjective judgment on the value or correctness of the given search results.

This allows you to do substantially more design refinement to optimize your launch.

NEW: AI Pre-Testing

This is where it gets really interesting…because now, you can do that qualitative testing programmatically with a large language model.

This is the punchline and major finding of my story: an AI is now sufficiently capable of judging the quality of search results for essentially any topic. It can do this with all the comparative advantages that machine automation has over humans: cheaper, faster, more reliable, more repeatable, and less mind-numbing.

AI Pre-Testing Methodology

How might you approach this practically?

  1. First, assemble all of the different search design parameters, rules, and plans you want to test against each other.
  2. Second, compose a battery of test queries. (You can pull these from your recent users’ search history, or you can even ask an LLM to give you N test queries for your website, given its context and purpose.) Note that searches tend to have a Pareto-like distribution, with a small number of very frequent queries and a very long tail of infrequent and unique queries. You’ll want to test a representative sample that covers all the important aspects of your website and user journey. Depending on your use case, this might be a test set in the low hundreds of queries for a small to medium sized website, up to a few thousand queries for a highly trafficked website.
  3. Third, run each test query through each search design in order to obtain the search results.
  4. And fourth, evaluate the quality of each result with an LLM by looping over them with a prompt template like the following:

LLM_prompt_string = f '''
Act as a user who is looking for information
on {your_website_here}.

Also consider the company's "about us" 
information:{about_your_company}

You have just asked the following query in 
the website search:
{test_query[i]}

And you've received the following search results:
{search results[i]}

Your task: For each of the results and their 
summary snippets, rate the quality and 
responsiveness from one (worst) to five (best) 
and give a concise explanation for your answer. 
Also include any suggestions for improvement 
of the search design.

Provide your results in JSON format with 
data keys for:
["test_query", "result", "rating", "explanation", 
"improvement_suggestions"]
'''        

You’ll want to do this all programmatically by an api, hitting GPT, Llama, or whatever LLM you are using. Feel free to play around with the prompt until you get it just the way you want.

Conclusion

Now you have an AI agent that programmatically grades your search design!

You can sum up the score ratings across your different design plans and see which ones score better, refining them to get something that is much closer to optimal before you launch.

I think that’s really exciting. You can run this over and over, every time you deploy a new change to the search algorithm.

This is just one example of a number of ways you might be able to automate all kinds of user testing and evaluation if you lean in and think creatively about it.

Thanks for reading, and Happy Innovating!

***

Dave Costenaro is Chief Data Officer at CSG Solutions | Helping businesses thrive with data-driven strategies | Contact Us

要查看或添加评论,请登录

Dave Costenaro的更多文章

  • Migrating Code with LLMs

    Migrating Code with LLMs

    Much of what has been written about AI code generation deals with the authoring of new code. However, tremendous…

    3 条评论
  • RAGs to?Riches

    RAGs to?Riches

    Retrieval Augmented Generation with Query Expansion If you haven’t heard of retrieval augmented generation (“RAG”), it…

    1 条评论
  • AI Regulation and Risk for St. Louis Businesses

    AI Regulation and Risk for St. Louis Businesses

    I recently had a thought provoking conversation about artificial intelligence regulations with David Nicklaus, St…

  • “IBM Project Debater” Squares Off Against Human Debate Champion

    “IBM Project Debater” Squares Off Against Human Debate Champion

    This month, IBM's "Project Debater", an Artificial Intelligence system out of Big Blue's AI research labs, squared off…

  • AI in the Midwest

    AI in the Midwest

    The following are my opening remarks from our recent Prepare.ai Conference.

    2 条评论
  • GDPR and Machine Learning Black Boxes

    GDPR and Machine Learning Black Boxes

    General Data Privacy Regulation (GDPR) in Europe has a provision that any AI or ML algorithm that is used for…

  • Game of Thrones, AI, and Family?Legacy

    Game of Thrones, AI, and Family?Legacy

    My wife and I just binge-watched all available seasons of Game of Thrones. One thing that struck me was the show’s deep…

  • Preparing for Artificial Intelligence

    Preparing for Artificial Intelligence

    What Should You Know About Prepare.Ai? We recently announced our first annual conference at Prepare.

    1 条评论
  • Blockchain and Bitcoin - What Energy Companies Should Know

    Blockchain and Bitcoin - What Energy Companies Should Know

    Abstract: Blockchains are distributed, digital record keeping systems with no central administrator or database. They…

    8 条评论
  • Create Your Own U.S. Government Budget

    Create Your Own U.S. Government Budget

    People talk plenty about government budgets, but use surprisingly little data and holistic context. I built the Google…

    1 条评论

社区洞察

其他会员也浏览了