Query Understanding, Divided into Three Parts
Julius Caesar started his famous text on the?Gallic wars ?with the sentence “Gallia est omnis divisa in partes tres.”
Like Gaul,?query understanding ?is, as a whole, divided into three parts:
Let’s walk through this break-down, starting with Caesar’s?“Gallia est omnis divisa in partes tres”?as an example of a search query. We’ll assume that the search application provides access to a broad set of educational materials (textbooks, literature, etc).
Holistic Understanding
Holistic understanding is the first step in query understanding. The goal of holistic understanding is to broadly — but not deeply — classify the query.
Here are some typical classes of holistic query understanding:
Implementing holistic query understanding requires building classifiers, either through rules (e.g.,?regular expressions ) or by training a?machine learned classification model . A machine learning approach is generally more accurate, flexible, and scalable.
To train a classifier, each holistic query understanding component (for language identification, query categorization, etc.) requires a collection of labeled training data that maps queries to their associated classes. The labels can come from explicit human judgements or from historical search behavior (e.g., mapping a query to a category based on clicks). Since search queries are strings — and often short string — it’s a good idea to use a character-level embedding like?fastText ?to represent each query as a?feature vector .
To summarize, holistic understanding looks at the query as a whole. It aims to be broad rather than deep, laying the foundation for later query processing.
Reductionist Understanding
The second step in query understanding is a reductionist understanding that breaks down the query into parts and tries to understand those parts.
Reductionist query understanding performs two related tasks:?query segmentation ?and?entity recognition . Query segmentation divides the search query into a sequence of semantic units, each of which consists of one or more?tokens . Entity recognition classifies each segment into an entity type.
Our example query doesn’t yield an interesting segmentation, so let’s use this one instead (assuming the same search application):?roman empire poetry.
This query should be segmented into two segments:?roman empire?and?poetry. Then the first segment can be recognized as a subject area familiar from the previous example, while the second segment can be recognized as a genre. The segmented, recognized query is?Subject: “roman empire”,?Genre: “poetry”.
Like classification, segmentation and entity recognition generally depend on machine learned models, which in turn depend on labeled training data.
领英推荐
It’s possible to train a single model for this task, but a model that covers all queries is likely to be very complex — since entity types vary significantly across categories. Instead, we can take advantage of holistic query understanding coming before reductionist query understanding and build a collection of models for segmentation and entity recognition. Holistic understanding makes it possible to select the right model for reductionist understanding — one that corresponds to the right language, category, etc.
As with classification, labels for segmentation and entity recognition can come from explicit human judgements or from historical search behavior. But inferring segments and entity types from clicks is a bit trickier than for whole-query classification. In order to directly infer segmentation and entity recognition label from a query-document pair, each query token has to uniquely match one structured document field, and each multi-word segment has to correspond to a phrase in the matching field. It’s possible to relax this requirement, but doing so generally leads to a more complex approach.
Traditionally, people used?hidden Markov models (HMM) ?and?conditional random fields (CRF) ?for segmentation and entity recognition. A more modern approach uses?deep learning — specifically, a?Bidirectional LSTM-CRF ?model.
To summarize, reductionist understanding breaks down the query into parts and tries to understand those parts. It often relies on holistic understanding to select an appropriate machine learning model, and then applies that model to perform segmentation and entity recognition.
Resolution
Together, holistic and reductionist query understand should yield a?precise understanding of the searcher’s intent. The last step in query understanding is resolution. Resolution uses the results of the previous two steps to assemble a query for the back-end search engine.
Resolution has two parts. The first maps the recognized entities to query elements. The second is assembles these elements into a query.
The first part ideally maps each recognized entity maps to an entity in a structured data knowledge base, which is typically a?taxonomy, an ontology , or a?faceted classification . These representations and variants of them are sometimes called knowledge graphs. A modern search engine indexes documents by their structured data entities, assigning each structured data entity a unique identifier. In most cases, the combination of an entity type and a string should be enough to uniquely match an entity. In other cases, the matching may require building a classifier — but that’s beyond the scope of this post.
Returning to our?roman empire poetry?example, each of the two segments,?Subject: “roman empire”?and?Genre: “poetry”, should map to an entity in the structured data knowledge base.
The second step assembles the entities — as well as any segments that couldn’t be recognized as entities — into a query that is executed against the search engine. This query may be a simple conjunction — that is, an AND of all of the entities and unmatched keywords. In our example, that would be an AND of the structured data entities corresponding to?Subject: Roman Empire?andGenre: Poetry.
For many queries, that’s all that’s necessary. But query assembly may be more complicated. It might?expand ?or?relax ?some of the entities to increase?recall — which is especially useful for long queries that might otherwise return no or few results. Query assembly can also make decisions based on high-level intent, such as picking a ranking model or targeting a particular segment of the document collection. It can even determine other aspects of the search experience, such as which?facets ?to present to the searcher.
In summary, resolution isn’t so much about understanding the query as about translating that understanding into a strategy for retrieving, ranking, and presenting results.
Rome ne fut pas faite toute en un jour
Like Rome, query understanding can’t be built in one day. Implementing holistic understanding, reductionist understanding, and resolution is a lot of work, and as a search team you can always find room to improve all of these. But if you’re not already looking at query understanding in this framework — or if you’re not looking at query understanding at all — I urge you to consider it. It won’t reduce the challenges, but it will help you tackle them in stages.
Bonam fortunam!
ps. Many thanks to my colleagues?Prathyusha ?and?Jon ?at eBay, whose discussions were key for this framing, and to?Dr. David Murphy , who taught me all the Latin I know.
Experienced Specialist in Machine Learning and Data Science / Analytics
6 年way to go:"A more modern approach uses deep learning — specifically, a Bidirectional LSTM-CRF model."