登录查看更多内容

ElasticSearch Phonetic Algorithms

Daniel Ranallo

Technologist | Advisor | Strategist | Executive | Dad

发布日期: 2019年11月20日

What is a phonetic algorithm

Phonetic algorithms are transformations on text to be indexed that modify the text, generally to a coded transformation based on how the word is pronounced. This allows incoming query requests to be transformed in the same way and ideally hit upon the intended result. There are multiple phonetic algorithms built into the ElasticSearch phonetic analysis plugin, such as Soundex, Metaphone, nysiis, and caverphone. The primary usage of these phonetic algorithms is to encode hard to spell and uncommon words based on how they sound, to make them retrievable by what they sound like. This unique encoding scheme makes them particularly useful for indexing and retrieving by a person’s name. In fact, at least two of the schemes in the plugin were specifically developed to deal with the problem of finding people’s names; New York State Identification and Intelligence System (nysiis) and Soundex. These algorithms sacrifice some precision of results in favor of improving the recall (ability to find) of the system.

Why use a phonetic algorithm

Names are hard. Imagine you are building a system where users will need to search for other users by name, think LinkedIn or Facebook find a friend features. You can probably find your close friends by directly typing in their name as your friend spells it. Because this is a close friend, you at least have a chance of spelling it correctly. But what about that guy you met at that conference Jon Smith, or was it John Smith, or Jean Smidth. Coupled with other search terms, like hypothetical Jon’s company, this can be a very useful feature. Also, consider trying to find any of the very interesting and informative talks by Venkat Subramaniam. You could search for his name with the terms Sbraminem, Subraminem, or even subramen all of which result in the Soundex code S165, meaning you would get hits, meaning high recall. If there were a bunch of talks by someone with the name Sbraminem, using just Soundex you would have poor precision; lots of results you don’t want and a few you do.

Phonetic Analysis Plugin

ElasticSearch’s open-source plugin supports many algorithms. Below is a synthesis of each algorithm included in the plugin that I can find information about:

Metaphone - An improvement on Soundex that takes into consideration different variations on spellings and pronunciations for words in the English language. Ranallo become RNL
Double Metaphone - An improvement on Metaphone that encodes English words in a few different ways, resulting in up to two codes per input that helps to encompass more varieties of word origins. Ranallo interestingly becomes RNL under both transformations.
Soundex - A phonetic algorithm used for analysis of US Census records; given it history, expect it to work well for American surnames. Ranallo becomes R540.
Refined Soundex - An improvement on Soundex that reduces the number of names that will result in the same code. Shifts the precision/recall ratio to higher precision with lower recall.
NYSIIS - An algorithm designed to work particularly well with American names. Empirically, it is slightly better than soundex. Ranallo becomes RANAL
Caverphone1 and Caverphone2 - Algorithms originating from Census related work in New Zealand. Optimized for English names as pronounced with a specific accent (Dunedin, New Zealand). Ranallo becomes RNLA111111 under both versions.
Cologne - Transformation of a word to a numeral following predefined path rules, for example A-> 0, F->3. Ranallo becomes 7060550
Diatch Mokotoff - A refinement of Soundex tuned for Slavic and Yiddish names. Ranallo becomes 968800
Beider Morse - Similar to Soundex but designed to be language independent and more generic with a lower false positive hit rate.
Two additional algorithms, Koelner and Haase are built into the plugin, though I had a harder time finding good information of what makes them unique.

Phonetics with ElasticSearch

Now that we know what a phonetic algorithm is, and some facts that might help decide which phonetic algorithm is best for the use case; it’s time to try it out. The first step is enabling the phonetics plugin for your ElasticSearch server. The documentation shows one way to do that by executing some commands on your server. I prefer immutable infrastructure in almost all cases; this is no different. To enable in docker at image build time, you can add the following line to your Dockerfile:

RUN bin/elasticsearch-plugin install analysis-phonetic

Next, you need to configure the template to index specific fields with the chosen phonetic algorithm. With the example document below:

{

  "id": "aa936e9c-c436-4337-a15e-5a224da7214b",

  "name": "Daniel Ranallo"

}

An index template similar to the following would enable phonetic algorithms on the name field using a dynamic template for the field with the name name:

{
  "index_patterns": [
    "person"
  ],
  "settings": {
    "analysis": {
      "analyzer": {
        "metaphone": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "metaphone_example"
          ]
        },
        "soundex": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "refined_soundex_example"
          ]
        },
        "nysiis": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "nysiis_example"
          ]
        }
      },
      "filter": {
        "metaphone_example": {
          "encoder": "doublemetaphone",
          "replace": false,
          "type": "phonetic"
        },
        "refined_soundex_example": {
          "type": "phonetic",
          "encoder": "refinedsoundex",
          "replace": false
        },
        "nysiis_example": {
          "type": "phonetic",
          "encoder": "nysiis",
          "replace": false
        }
      },
      "normalizer": {
        "lowercase_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "phonetic": {
            "match_mapping_type": "string",
            "match": "name",
            "mapping": {
              "type": "text",
              "analyzer": "standard",
              "fields": {
                "metaphone": {
                  "type": "text",
                  "analyzer": "metaphone"
                },
                "soundex": {
                  "type": "text",
                  "analyzer": "soundex"
                },
                "nysiis": {
                  "type": "text",
                  "analyzer": "nysiis"
                }
              }
            }
          }
        }
      ]
    }
  }
}

Set this template by curling a PUT to the server address /templates/example_template. This must be done before indexing data as type mappings are immutable for existing fields. Once the template is set and some data is indexed, you’re ready to query for some documents. Depending on if optimizing for recall or precision is more important for your use case, you can use one phonetic field, all the fields or even none of the fields. If optimizing for recall, a query like the following may suffice:

{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "ranali",
            "fields": [
              "name",
              "name.soundex",
              "name.metaphone",
              "name.nysiis"
            ]
          }
        }
      ]
    }
  }
}

Swapping the query value from ranali to other common misspellings of my surname such as renali and rinala still results in the expected document.

Performance implications

Most of the work for transformation under all the algorithms is fairly insignificant and will be performed at index time, which will decrease indexing throughput. This may be acceptable for your business requirements. There is also a small performance hit on the analyzer chain as queries are executed, since the query must go through the same transformation. The biggest performance hit I have observed comes from the use of multi-field search. The example query is searching 4 fields, which means 4 inverted indices searched to determine document match. ElasticSearch is good at this sort of work, but if unacceptable latencies are experienced, consider tuning the query by removing extraneous fields.

Conclusion

I have found the phonetic algorithms plugin in ElasticSearch to be highly effective, allowing me to get up and running in a short period of time and match on some very difficult to spell names. The downside is that making the tradeoff for recall over precision has, in some cases, resulted in some very odd results. For example, in the original Soundex implementation, the names Jessica and Joshua both mapped to the same Soundex code. This can be confusing to users but using newer versions of the algorithms and collecting data on what is working for your users and what isn’t, you can usually get around this sort of problem. The phonetic algorithms above can significantly improve the performance of search algorithms, especially when searching for people’s names.

Cover image from https://cityofgastonia.news/1920-census-unc-digitalinnovationlab/

ElasticSearch Phonetic Algorithms

Daniel Ranallo

Technologist | Advisor | Strategist | Executive | Dad

What is a phonetic algorithm

Why use a phonetic algorithm

Phonetic Analysis Plugin

Phonetics with ElasticSearch

Performance implications

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Tuning Information Retrieval in Agent Builder Search applications with Google Search?Adaptor.

Elevating RAG with Ensemble Techniques: Unlocking Richer and More Contextual Responses

Dr RAG and Mr HyDE

Notes on Data Compression: Part 4 (JPEG)

Building an Efficient Data Scraper Tool : A Step-by-Step Guide to Algorithm Creation

Economical Statistics: How to Modify a Prediction Crawler's Navigation

Expanded Explanation: Semantic Search and Knowledge Graphs with Large Language Models (LLMs)

My Learnings from CS 242: Information Retrieval & Web Search

Regex – the ultimate language we love to hate!

Speeding Up Your AI-powered Search with JAI Async

What is a phonetic algorithm

Why use a phonetic algorithm

Phonetic Analysis Plugin

Phonetics with ElasticSearch

Performance implications

Conclusion

Lambda Architecture Overview

2019年12月11日

Taints and Tolerations and Selectors, oh my

2019年12月9日

Basic Kubernetes for Developers

2019年12月8日

ElasticSearch Overview

2019年11月16日

Apache Flink Introduction

2019年11月14日

Life Of A Feature Request

2017年11月7日

Simple Rules Based MongoDB Queries

2017年10月15日

社区洞察

其他会员也浏览了

Tuning Information Retrieval in Agent Builder Search applications with Google Search?Adaptor.

Elevating RAG with Ensemble Techniques: Unlocking Richer and More Contextual Responses

Dr RAG and Mr HyDE

Notes on Data Compression: Part 4 (JPEG)

Building an Efficient Data Scraper Tool : A Step-by-Step Guide to Algorithm Creation

Economical Statistics: How to Modify a Prediction Crawler's Navigation

Expanded Explanation: Semantic Search and Knowledge Graphs with Large Language Models (LLMs)

My Learnings from CS 242: Information Retrieval & Web Search

Regex – the ultimate language we love to hate!

Speeding Up Your AI-powered Search with JAI Async