Elasticsearch Filters and BM25 Relevancy Scoring

In today's data-driven world, finding the needle in the digital haystack is more crucial than ever. As a search engineer who's spent years optimizing Elasticsearch implementations, I wanted to share some insights into how ES actually determines what's "relevant" in your search results.

Elasticsearch Filters

Elasticsearch (ES) is a powerful search and analytics engine that provides fast and efficient ways to search through large volumes of data.

Filters in Elasticsearch are binary decisions that determine whether a document matches certain criteria. Unlike queries that calculate relevance scores, filters simply include or exclude documents. This makes them significantly faster and more efficient.

Common types of filters include:

  • Term filters (exact matches)
  • Range filters (for numeric or date ranges)
  • Exists filters (field must exist)
  • Bool filters (combining multiple filters with AND/OR logic)

Filters are cached automatically by Elasticsearch, making them ideal for frequently used filtering criteria like status fields, categories, or date ranges. Filters do not take part in score calculations.

How does ES calculate relevancy score?

While filters help us narrow down our result set, they don't tell us which documents are most relevant to a user's query. This is where Elasticsearch's scoring mechanism comes into play. Let's explore how Elasticsearch determines which results should rank higher than others. Elasticsearch uses the BM25 algorithm to calculate the relevancy score. BM25 (Best Matching 25) is an improvement over the older TF-IDF scoring model and has been the default in Elasticsearch since version 5.0.

Understanding BM25

BM25 calculates scores based on three main components:

  1. Term Frequency (TF): How often a term appears in a document
  2. Inverse Document Frequency (IDF): How rare or common a term is across all documents
  3. Field Length Normalization: Adjusting for document length

Term Frequency Formula

The formula used by BM-25 to calculate the TF:

TF = (freq)(k1+1) / (freq + k1 * (1 - b + b *(dl/avg_dl)))        

If we use raw term frequency, it gives higher weightage to longer documents, as naturally the longer documents might contain those terms more number of times than shorter documents. But that does not mean if a certain doc contains the same terms a large number of times, that makes it more relevant.

Terms in the formula:

  • freq = frequency of that term in that particular document
  • k1 = term saturation
  • b = length normalization
  • dl = length of a document
  • avgdl = average length of documents in that index

How do these terms help?

The purpose of k1 is to control how quickly the importance of term saturates as its frequency increases.

  • k1 => controls how fast or slow it takes to saturate
  • k1: It's like a "brake" on how quickly TF saturates

Intuitive Explanation with Examples

Let's use an example to make this more concrete. Suppose we have a query term "apple" and three documents:

  • Document A: "apple" appears 1 time.
  • Document B: "apple" appears 5 times.
  • Document C: "apple" appears 10 times.

We'll calculate TFBM25 for each document using different k1 values.

Case 1: k1 = 0.5

TFBM25 = (0.5+1)?tf / (0.5+tf) = 1.5?tf / (0.5+tf)
Document A: 1.5?1 / (0.5+1) = 1.0
Document B: 1.5?5 / (0.5+5) = 1.36
Document C: 1.5?10 / (0.5+10) = 1.43
        

Here, the score increases only slightly from Document A to Document C. Saturation happens quickly.

Case 2: k1 = 1.2 (default)

TFBM25 = (1.2+1)?tf / (1.2+tf) = 2.2?tf / (1.2+tf)
Document A: 2.2?1 / (1.2+1) = 1.0
Document B: 2.2?5 / (1.2+5) = 1.79
Document C: 2.2?10 / (1.2+10) = 1.96
        

Here, the score increases more significantly from Document A to Document C. Saturation happens more slowly.

Case 3: k1 = 2.0

TFBM25 = (2.0+1)?tf / (2.0+tf) = 3.0?tf / (2.0+tf)
Document A: 3.0?1 / (2.0+1) = 1.0
Document B: 3.0?5 / (2.0+5) = 2.14
Document C: 3.0?10 / (2.0+10) = 2.5
        

Here, the score increases significantly even at higher term frequencies. Saturation happens very slowly.

How b helps?

b = 0.75 (default value)

b helps normalize the length of the document, it helps when we might give undue advantage to long documents.

Balanced approach:

  • If dl > avgdl, the denominator increases, lowering TF.
  • If dl < avgdl, the denominator decreases, raising TF slightly.

IDF (Inverse Document Frequency)

IDF describes how important a term is across all documents.

If a term appears in all the documents, it's not that important. If the term appears in few of the documents, it is important.

IDF Formula:

IDF = ln([ (N - n + 0.5) / (n + 0.5) ] + 1)
        

Where:

  • N = total number of documents
  • n = number of documents containing the term
  • smoothening factor = 0.5 is added to prevent the denominator from becoming 0

The 1 is added so the value inside the log always remains more than 1, ensuring the final result never becomes negative.

How is this formula useful?

Let's say a term appears in almost all the documents, so intuitively we can agree that this particular term is of no use when finding the relevant document, as it doesn't add any new value.

How does the IDF formula take care of this?

  • As n approaches N, ln(1) = 0, so the IDF becomes zero, indicating the term is not at all useful for the search.

Final score calculation

score = TF × IDF × boost
        

In most cases, boost = 1.0 unless specifically adjusted in the query.

Tuning BM25 in Elasticsearch

Elasticsearch allows you to customize the BM25 parameters to better suit your specific use case by adjusting the index settings:

PUT /my_index
{
  "settings": {
    "index": {
      "similarity": {
        "my_similarity": {
          "type": "BM25",
          "b": 0.75,
          "k1": 1.2
        }
      }
    }
  }
}
        

By understanding how these parameters affect the scoring, you can fine-tune your Elasticsearch queries to deliver the most relevant results for your specific use case.

#Elasticsearch #SearchEngineering #BM25 #DataScience

Note : written by me, formatted by Claude.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了