登录查看更多内容

Documents, Queries, and Categories

Daniel Tunkelang

Query Understanding

发布日期: 2024年11月25日

I have published a number of posts and presentations about the bag-of-documents model, which essentially represents query intent as a distribution in a document vector space. Conversely, I have written about the bag-of-queries model, a dual that represents a document as a distribution over the queries for which it is relevant. More recently, I have argued that categories are fundamental for search applications and described ways to obtain them.

Documents, queries, and categories are all key ingredients for building successful search applications. This post aims to tie them together.

Queries represent the distribution of documents they target.

To review, the bag-of-documents model represents a query as a distribution over the vectors of documents relevant to the query. For frequent queries, it is possible to simply aggregate documents based on a query’s engagement history (i.e., clicks and conversions) and compute the mean of their vectors. This process not only produces bag-of-documents representations for frequent queries, but also provides training data to build a model that computes bag-of-documents representations for infrequent queries.

Implementation details aside, the key insight is that a query is a partial specification of a document. While a query with high specificity might map to an individual document, most queries have lower specificity and map to a subset of documents. Moreover, while the set of available documents may vary over time, the meaning of a query does not necessarily change. A query represents an information need, which defines a distribution of relevant search results.

Documents represent the distribution of queries that target them.

There is a duality between queries and documents: if a query is a bag of documents, then a document is a bag of queries. Specifically, the bag-of-queries model offers a sparse document representation.

领英推荐

Mastering QLineEdit in PyQt5

Yamil Garcia 11 个月前

Displaying a Country Flag Image In a Record Field with…

Gidi Abramovich 4 年前

Understanding Different Types of File Formats

Nayla Reina 7 个月前

While the bag-of-documents model represents a query as a distribution over the vectors of relevant documents, the bag-of-queries model represents a document as a distribution over the queries to which the document is relevant.

In other words, just as a query is a partial specification of a document, a document is a partial specification of a query. Some documents may only have a single query — or a set of queries that express equivalent search intent — that target them. Other documents are targets for a variety of search intents. Thus, a document can be represented as a distribution over one or more information needs.

Categories are a unifying abstraction for documents and queries.

While robust document and query representations are essential, it is important to establish an abstraction layer that unifies them.

Categories optimized for coverage, coherence, and distinctiveness relate documents and queries to their most similar neighbors, which also serve as their best substitutes. Such categories help ensure the 3 Rs of search: relevance, recall, and ranking. Moreover, a great way to obtain categories is to mine frequent queries.

Summary

Understanding the relationship between documents, queries, and categories is essential for building effective search applications. The bag-of-documents and bag-of-queries models illustrate the duality between queries and documents, with each serving as a partial specification of the other. Categories serve as a crucial abstraction layer, ensuring relevance, recall, and ranking. By integrating all three, we can build more robust search applications.

要查看或添加评论，请登录

Daniel Tunkelang的更多文章

Precision, Recall, and Desirability: A Deep Dive

2025年3月27日

Precision, Recall, and Desirability: A Deep Dive

This post expands on my previous discussion of “Precision, Recall, and Desirability,” diving deeper into defining…
ChatGPT, Are You Just Telling Me What I Want to Hear?

2025年3月3日

ChatGPT, Are You Just Telling Me What I Want to Hear?

These days, the Turing Test — which Turing originally called the “imitation game” — feels hopelessly outdated. With…
Not All Recall is Created Equal

2025年2月24日

Not All Recall is Created Equal

Search application developers constantly navigate tradeoffs, particularly between precision and recall. Precision…

1 条评论
To Bot or Not to Bot: It Depends on the Question

2025年1月31日

To Bot or Not to Bot: It Depends on the Question

I was one of Quora’s earliest users. I earned Top Writer status for several years and even made some money through…
Ground Truth: A Useful Fiction

2025年1月14日

Ground Truth: A Useful Fiction

A key concern about AI is that models “hallucinate” — technical jargon for saying that they make up things that look…

5 条评论
Conjunction, Disjunction, What’s Your Function?

2025年1月6日

Conjunction, Disjunction, What’s Your Function?

Like many folks of my generation, I grew up on Schoolhouse Rock, a series of animated educational shorts that aired…
Modeling Queries as Bags of Documents

2024年12月2日

Modeling Queries as Bags of Documents

Last week, I had the honor of presenting “Modeling Queries as Bags of Documents” at Search Solutions 2024 with Aritra…
Where Do Categories Come From?

2024年11月20日

Where Do Categories Come From?

In my previous post, I argued that categories are fundamental for search applications. I characterized a robust set of…

1 条评论
Categories are Fundamental for Search

2024年11月18日

Categories are Fundamental for Search

As a search consultant, I have learned to be flexible about structured data. However, I do insist on content being…

5 条评论
Quo Vadis Nunc, Quora?

2024年9月25日

Quo Vadis Nunc, Quora?

I was one of Quora’s earliest users, earned Top Writer status for a few years, and topped the leaderboard as a 9-time…

2 条评论

See all articles

Documents, Queries, and Categories

Daniel Tunkelang

Query Understanding

Queries represent the distribution of documents they target.

Documents represent the distribution of queries that target them.

领英推荐

Categories are a unifying abstraction for documents and queries.

Summary

Daniel Tunkelang的更多文章

社区洞察

其他会员也浏览了

A Beginner's Tutorial on Implementing IEnumerable Interface and Understanding yield Keyword

Working with Templates and Data Binding – One-Way, Two-Way Binding, and Interpolation

How to add 'About' and 'Mentions' schema types in your article?

Controlled vs Uncontrolled Components

Getting Started with DeepSeek: A Step-by-Step Installation and Usage Guide

Blazor How-To: Validate Individual Fields in Forms

Which Pagination approach Should We Use? Offset-based pagination / Cursor-based pagination.

Using Positional Frequency to Identify High & Low-Value Words

Data Binding Angular

How to use fluent api to improve readability and discoverability of your code

Queries represent the distribution of documents they target.

Documents represent the distribution of queries that target them.

领英推荐

Categories are a unifying abstraction for documents and queries.

Summary

Daniel Tunkelang的更多文章

Precision, Recall, and Desirability: A Deep Dive

ChatGPT, Are You Just Telling Me What I Want to Hear?

Not All Recall is Created Equal

To Bot or Not to Bot: It Depends on the Question

Ground Truth: A Useful Fiction

Conjunction, Disjunction, What’s Your Function?

Modeling Queries as Bags of Documents

Where Do Categories Come From?

Categories are Fundamental for Search

Quo Vadis Nunc, Quora?

社区洞察

其他会员也浏览了

A Beginner's Tutorial on Implementing IEnumerable Interface and Understanding yield Keyword

Working with Templates and Data Binding – One-Way, Two-Way Binding, and Interpolation

How to add 'About' and 'Mentions' schema types in your article?

Controlled vs Uncontrolled Components

Getting Started with DeepSeek: A Step-by-Step Installation and Usage Guide

Blazor How-To: Validate Individual Fields in Forms

Which Pagination approach Should We Use? Offset-based pagination / Cursor-based pagination.

Using Positional Frequency to Identify High & Low-Value Words

Data Binding Angular

How to use fluent api to improve readability and discoverability of your code