Query sampling for relevancy testing
Best practice question:?When testing for relevancy, is it best to test against the most frequently asked queries, or to choose a subset of queries that may not include those that are most frequently asked?
(Thanks Mike H. for the question on the Slack relevancy group this morning. Right as the coffee was kicking in, too!)
There's an important inflection point in anyone's search relevancy journey. I think it's the moment when you look at a graph that shows the frequency distribution of all users' queries. (If you work on search and don't have one of those, go ship one; I'll wait…)
What most sites will show is that search query frequency follows an exponential distribution. There are a relatively small handful of queries that are super common—the "head" of the distribution. That tapers off into a "long tail" of less-frequent queries, with some amount of "shoulder" or "torso" in the middle, depending on where you'd like to squint and draw the lines.
If you're new to these kinds of terms and want more statistics jargon, Wikipedia can help scratch that itch for you with its article on?exponential distribution. I find the article on?long tail?to be closer to the intersection of statistics and business that search relevancy tends to explore.
That process of squinting-and-drawing-lines quickly leads to an important revelation: we can't possibly test every one of these queries. Experienced product managers and relevancy engineers know that this is one of the intrinsic challenges of search. When the user can express their intent in a free-form way, the system has to parse and interpret that intent from a practically infinite domain of possibility.
And not only are there a lot of different queries. In a long-tail distribution, some of these queries your system has never seen before, and will never see again. How do you even test for those?
So what's the best practice here? Test against the most frequent queries? Or choose a broader subset?
The answer, this time, is simple: Yes.
You have to nail the common queries
As a group, head queries are the stuff that end users, universally, really want.?
Whatever the business goals for search, there is by definition a budget for it here, and that budget can definitely afford to focus some human attention on these queries. That may cover the number one most common query, or the top ten, or the top thousand.
Perhaps a bit counter to conventional wisdom, I think head queries may show product-level problems that are?bigger?than search. Any product manager with access to query logs for a search services should absolutely be digging through that data for insights into what to work on next.
It may be that the number one query represents a need that can be promoted out of the search workflow entirely! Perhaps some prominent real-estate somewhere else on the site will solve what brought the user to the search bar. The correct treatment may be to prevent the query from happening in the first place!
As one embarks on relevancy, this kind of observation is an important one to keep first in mind: we are not optimizing search results, we are optimizing for the end-user experience. And that is a holistic product concern.
There are other, similar, treatments for these kinds of queries. One is called, in the biz, a "navigational" query. If an end-user searches your ecommerce catalog for help or orders then it's not hard to deduce that they wanted an easy way to skip straight to some other location in the site, and found the search bar to be a convenient place to express that need.
We search relevancy engineers like to build a model that fits all possible inputs. But for your head queries, don't be afraid to get a lot more specific about how something is handled!
Quantifying the head
There may be a stats-nerd approach to this. (What's your go-to? Let me know!)
To recap, I think of the head as?"everything that the business can devote manual human attention toward."?This attention might range from a single engineer tasked with all-things-search, or a whole team of domain experts maintaining bespoke relevancy labels in their particular vertical.
Helpful here is to have people creating judgment lists which can be used for offline testing — or, better, continuous integration on every tweak to the model; or periodic checks against new and changing data. Once you know what the?right?answer looks like, your practices here will be protecting against what may go?wrong.
I'll touch on the structure of a judgment list again a little bit later. But as we start to get prescriptive here, I think it's worth observing that the original question here may be less about where to get queries from, and how to possibly approach scaling the testing of them.
领英推荐
The torso
The torso and tail queries start to give us this sense of scale.
By contrast from the head, I think of the "torso" as anything that doesn't quite merit the level of human attention. But it's still a large amount of traffic, and valuable, and these queries need quality results!
This is what I tend to think of as the real substance of a search system. We have left the domain of the well-understood special-snowflake queries, and it is our task to build a model of the corpus and the query from first principles. It's about identifying the intent and features of a query, populating an index designed to answer those queries, and iterating on the structure of the search request to bridge the gap and make a match.
Framed this way, I hope it's very clear that the queries we sample and test against well beyond the head queries! Indeed I think sampling only from head as developers build the index and search implementation will over-fit the relevancy for those common cases that can be manually tuned anyway.
So, yes: sample from torso queries. But if the question is "how do I scalably test relevancy on torso and tail queries," then, friend, read on. (And let me welcome you to the domain of search relevancy.)
The tail
The torso definition was broad, but I have one more pragmatic line for the tail: queries your system has never seen before, and may never see again.
This diversity of expressed need is an intrinsic challenge of search. When end-users are given a free-form way to express their needs, you are exploring a huge domain of possible expressions and parsing that into something a computer will understand.
It's tempting to write this group off - why bother? We'll never see that query again. Or they'll rephrase it if they don't get hits. Or we can just sample something random. And yes, it's really hard to hit this much of a moving target.
As a group, the long tail may still be quite large. When Google published their seminal blog post announcing the Transformer model BERT, they shared that around 15% of queries on Google —?every day?— had never been seen before.
That's a huge amount of traffic and cannot be ignored. Of course, Google's solution to that involved a lot of PhD research, and produced the Transformer model. Which completely incidentally kicked off this tiny little no-big-deal revolution that we today call "Large Language Models" and "AI." No big deal, really; I'm not sure why I even mention it.
Okay so I can—and definitely?will—get into some nitty gritty about how you the pragmatic search relevancy team with Questions? can make use of this tech. Indeed it is applicable to improving the relevancy of the long tail.
For a teams on their search journey, it's sufficient to say yes, sample a handful of queries from the tail and put them into the test suite alongside your head and torso queries, and just keep iterating.
How you test informs where you sample
This question at its core, to me, comes down to a question about?testing methods?and limited time and budget and tools.?If it was trivially easy to test the results of every search query, then this wouldn't be a question, would it?
So let's give a brief tour of testing methods.
Direct human attention?lends itself to a technique called?judgment lists.?Given an index, and the shape of a search query, you run a user query and gather results. You then label each of these results with a score of how relevant that result is for the query. You can do a bit of math on these —?jargon alert - what is ndcg and why should I care?— and voila! you have a score to optimize. Tweak your index, and your query, and make the charts go up and to the right.
You can do this in a spreadsheet! Or a notebook! Don't get too hung up on the structure, just get started and fit what you can into the time and resources that you have.
Scaling up beyond humans?enters into the domain of machine learning for search. You'll start here by collecting clicks — or whatever other signals of user engagement that your app can gather. A similar setup to the judgment lists that came before, and a bit of math on the clicks will give you labels that are derived from user engagement. Now you can grind through your server logs to build relevancy models that you can test against.
Interestingly, you already have access to a lot of training and evaluation data to help automate all of this. There is an interesting application of Large Language Models to help bootstrap and automate, to a degree, relevancy testing.
But that's about what I could fit into a coffee break sketching out some of the "intro to relevancy" questions that might kick start someone down the path of optimizing user interactions with their site.
Did you get this far and are you still looking for some more concreteness on okay-really-how-many to sample? Check out this post on probability-proportional-to-size sampling from Nate Day at OpenSource Connections .
Last but not least, I've been supporting search deployments for 15 years now and am happy to write more on this topic! Message me if there's anything in here you'd be interested to read more about, or if you have other questions about search relevancy or scaling.
Leading expert in site & enterprise search & AI: consultant, author, speaker & blogger.
1 个月Great article! I find the most important thing is just to get started: creating a repeatable process for testing relevancy is often quite a struggle for organisations. Picking 50 queries, some head, some mid, some tail is usually good enough. The other thing is to start classifying queries to identify overall issues: you might then be able to 'roll up' lots of low-volume, problem tail queries into one.
Retired Naval Officer | Project Manager helping organizations maximize efficiency through streamlined processes and innovative solutions
1 个月Great stuff, Nick! How do you strike a balance between ensuring relevancy for frequently asked queries versus exploring edge cases that might reveal blind spots in your search algorithms?
Founder-engineer; search nerd; platform scaler
1 个月I'm going to have to find some ethically-sourced free-range non-slop header images for these things, aren't I?