登录查看更多内容

Accelerate Search Testing with?AI

Dave Costenaro

Chief AI Officer at Invisibly

发布日期: 2023年11月15日

+ 关注

LLMs Are Surprisingly Good at User?Testing?

I have a story for you about a fascinating new way to use large language models in your search design.

If you’re already an AI fan like me, you might think this is going to be about semantic search that uses a vector database to index content… or retrieval augmented generation (RAG) that pulls your own custom data into the prompt of an LLM for question answering.

I’m not talking about any of that. That’s all very well documented already. I want to tell you something new. I hope this story gives you a little dopamine hit of new insight like it did for me…

The Complexity of Federated Search

A friend of mine leads product and digital experience for a major US hospital system, and he was asking me about a large redesign they’re doing for their website search.

This is a very sophisticated search function, not only because of the large volume of information it must index, but because it is “federated,” meaning it has results coming from several distinct data sources. A user might ask a query about “skin cancer” and you might want to return:

Articles from a knowledge base on general symptoms and treatments,
Top hits from a location search of healthcare providers,
Relevant insurance providers from a partners database, and
Results from a database of pharmaceuticals.

Within any ONE of these separate sources, you can use a number of established ranking methods to match a query; scoring attributes like relevance, recency, or popularity.

Existing Methods to Design Federated Search

If you’re trying to find the best way to mix all these different scoring methods…from multiple data sources…with different kinds of metadata and indexes?—?this gets tricky! In many ways, it’s more of a design problem than an engineering problem.

Luckily, it’s very much an empirical question, meaning we can test and refine. Say hypothetically your initial search design is the following:

Return top 10 knowledge base articles by relevance score
Re-rank these by popularity score (number of views in the last 30 days)
Insert a callout box between article results 3 and 4 called “Other Resources” and randomly shuffle the top #1 result by relevance score from the three other data sources.

This plan assumes the knowledge base articles are the most desirable, and is not putting much emphasis on the other sources. The design space has a lot of parameters and rules to tune, and is clearly very large and flexible.

Traditionally, you would use UX best practices to set this up for launch, and then collect data on how it performs. This involves deploying analytics on the search results such as:

Click-through rate for each search result
Time spent reading the page after a click-through (More time equates to more value, quality, and goodness for that query+result pair)
And finally, you would measure any Conversion you might be targeting downstream of a click-through. (For my friend’s use case, he was trying to convert queries and clicks into appointments scheduled with their healthcare providers.)

Once the search is deployed and instrumented, it’s a matter of waiting for enough search traffic to analyze and iterate on possible improvements. It’s also a good idea to A/B test multiple versions of a search design to compare performance and iterate faster.

Human Pre-Testing

With sufficient time and budget, you can also have the gold standard: You can run qualitative testing to refine your search design BEFORE launch.

Emmanuel Ramos 7 个月前

All About Google's Search Generative Experience

Shamim Azaz 1 年前

Integrating AI with APIs: Key Considerations for…

Eric PETIOT 3 个月前

Here, users ask a battery of test questions and provide a subjective judgment on the value or correctness of the given search results.

This allows you to do substantially more design refinement to optimize your launch.

NEW: AI Pre-Testing

This is where it gets really interesting…because now, you can do that qualitative testing programmatically with a large language model.

This is the punchline and major finding of my story: an AI is now sufficiently capable of judging the quality of search results for essentially any topic. It can do this with all the comparative advantages that machine automation has over humans: cheaper, faster, more reliable, more repeatable, and less mind-numbing.

AI Pre-Testing Methodology

How might you approach this practically?

First, assemble all of the different search design parameters, rules, and plans you want to test against each other.
Second, compose a battery of test queries. (You can pull these from your recent users’ search history, or you can even ask an LLM to give you N test queries for your website, given its context and purpose.) Note that searches tend to have a Pareto-like distribution, with a small number of very frequent queries and a very long tail of infrequent and unique queries. You’ll want to test a representative sample that covers all the important aspects of your website and user journey. Depending on your use case, this might be a test set in the low hundreds of queries for a small to medium sized website, up to a few thousand queries for a highly trafficked website.
Third, run each test query through each search design in order to obtain the search results.
And fourth, evaluate the quality of each result with an LLM by looping over them with a prompt template like the following:

LLM_prompt_string = f '''
Act as a user who is looking for information
on {your_website_here}.

Also consider the company's "about us" 
information:{about_your_company}

You have just asked the following query in 
the website search:
{test_query[i]}

And you've received the following search results:
{search results[i]}

Your task: For each of the results and their 
summary snippets, rate the quality and 
responsiveness from one (worst) to five (best) 
and give a concise explanation for your answer. 
Also include any suggestions for improvement 
of the search design.

Provide your results in JSON format with 
data keys for:
["test_query", "result", "rating", "explanation", 
"improvement_suggestions"]
'''

You’ll want to do this all programmatically by an api, hitting GPT, Llama, or whatever LLM you are using. Feel free to play around with the prompt until you get it just the way you want.

Conclusion

Now you have an AI agent that programmatically grades your search design!

You can sum up the score ratings across your different design plans and see which ones score better, refining them to get something that is much closer to optimal before you launch.

I think that’s really exciting. You can run this over and over, every time you deploy a new change to the search algorithm.

This is just one example of a number of ways you might be able to automate all kinds of user testing and evaluation if you lean in and think creatively about it.

Thanks for reading, and Happy Innovating!

***

Dave Costenaro is Chief Data Officer at CSG Solutions | Helping businesses thrive with data-driven strategies | Contact Us

要查看或添加评论，请登录

Dave Costenaro的更多文章

Migrating Code with LLMs

2024年3月5日

Migrating Code with LLMs

Much of what has been written about AI code generation deals with the authoring of new code. However, tremendous…

3 条评论
RAGs to?Riches

2024年2月19日

RAGs to?Riches

Retrieval Augmented Generation with Query Expansion If you haven’t heard of retrieval augmented generation (“RAG”), it…

1 条评论
AI Regulation and Risk for St. Louis Businesses

2023年8月8日

AI Regulation and Risk for St. Louis Businesses

I recently had a thought provoking conversation about artificial intelligence regulations with David Nicklaus, St…
“IBM Project Debater” Squares Off Against Human Debate Champion

2019年2月19日

“IBM Project Debater” Squares Off Against Human Debate Champion

This month, IBM's "Project Debater", an Artificial Intelligence system out of Big Blue's AI research labs, squared off…
AI in the Midwest

2018年8月13日

AI in the Midwest

The following are my opening remarks from our recent Prepare.ai Conference.

2 条评论
GDPR and Machine Learning Black Boxes

2018年7月11日

GDPR and Machine Learning Black Boxes

General Data Privacy Regulation (GDPR) in Europe has a provision that any AI or ML algorithm that is used for…
Game of Thrones, AI, and Family?Legacy

2018年1月17日

Game of Thrones, AI, and Family?Legacy

My wife and I just binge-watched all available seasons of Game of Thrones. One thing that struck me was the show’s deep…
Preparing for Artificial Intelligence

2018年1月9日

Preparing for Artificial Intelligence

What Should You Know About Prepare.Ai? We recently announced our first annual conference at Prepare.

1 条评论
Blockchain and Bitcoin - What Energy Companies Should Know

2017年2月27日

Blockchain and Bitcoin - What Energy Companies Should Know

Abstract: Blockchains are distributed, digital record keeping systems with no central administrator or database. They…

8 条评论
Create Your Own U.S. Government Budget

2016年10月10日

Create Your Own U.S. Government Budget

People talk plenty about government budgets, but use surprisingly little data and holistic context. I built the Google…

1 条评论

See all articles

Accelerate Search Testing with?AI

Dave Costenaro

Chief AI Officer at Invisibly

LLMs Are Surprisingly Good at User?Testing?

The Complexity of Federated Search

Existing Methods to Design Federated Search

Human Pre-Testing

领英推荐

NEW: AI Pre-Testing

AI Pre-Testing Methodology

Conclusion

Dave Costenaro的更多文章

社区洞察

其他会员也浏览了

The Easiest OpenAI Realtime API Integration You'll Ever See [demo]

Multitenant Conversational AI Bot Application

Why I am using Claude Artifacts.

The Evolution of Search: Navigating the Generative AI Landscape

Unveiling Cloudtheapp 2024: Redefining Configurability with Extreme AI-Powered Capabilities & New Cutting-Edge User Interface Experience

The Influence of Artificial Intelligence on Ecommerce Product Search and Discovery

In-Depth Product Analysis of Devin, the Hottest AI Developer

Optimizing Your Mobile Site's Performance through AI: The Role of ChatGPT

The Impact of AI on WooCommerce – Enhancing Customer Experience Through Technology

agents on the web

LLMs Are Surprisingly Good at User?Testing?

The Complexity of Federated Search

Existing Methods to Design Federated Search

Human Pre-Testing

领英推荐

NEW: AI Pre-Testing

AI Pre-Testing Methodology

Conclusion

Dave Costenaro的更多文章

Migrating Code with LLMs

RAGs to?Riches

AI Regulation and Risk for St. Louis Businesses

“IBM Project Debater” Squares Off Against Human Debate Champion

AI in the Midwest

GDPR and Machine Learning Black Boxes

Game of Thrones, AI, and Family?Legacy

Preparing for Artificial Intelligence

Blockchain and Bitcoin - What Energy Companies Should Know

Create Your Own U.S. Government Budget

社区洞察

其他会员也浏览了

The Easiest OpenAI Realtime API Integration You'll Ever See [demo]

Multitenant Conversational AI Bot Application

Why I am using Claude Artifacts.

The Evolution of Search: Navigating the Generative AI Landscape

Unveiling Cloudtheapp 2024: Redefining Configurability with Extreme AI-Powered Capabilities & New Cutting-Edge User Interface Experience

The Influence of Artificial Intelligence on Ecommerce Product Search and Discovery

In-Depth Product Analysis of Devin, the Hottest AI Developer

Optimizing Your Mobile Site's Performance through AI: The Role of ChatGPT

The Impact of AI on WooCommerce – Enhancing Customer Experience Through Technology

agents on the web