ElasticSearch Phonetic Algorithms
Recall of indexed versions of census records can be greatly improved with phonetic algorithms

ElasticSearch Phonetic Algorithms

What is a phonetic algorithm

Phonetic algorithms are transformations on text to be indexed that modify the text, generally to a coded transformation based on how the word is pronounced. This allows incoming query requests to be transformed in the same way and ideally hit upon the intended result. There are multiple phonetic algorithms built into the ElasticSearch phonetic analysis plugin, such as Soundex, Metaphone, nysiis, and caverphone. The primary usage of these phonetic algorithms is to encode hard to spell and uncommon words based on how they sound, to make them retrievable by what they sound like. This unique encoding scheme makes them particularly useful for indexing and retrieving by a person’s name. In fact, at least two of the schemes in the plugin were specifically developed to deal with the problem of finding people’s names; New York State Identification and Intelligence System (nysiis) and Soundex. These algorithms sacrifice some precision of results in favor of improving the recall (ability to find) of the system.


Why use a phonetic algorithm

Names are hard. Imagine you are building a system where users will need to search for other users by name, think LinkedIn or Facebook find a friend features. You can probably find your close friends by directly typing in their name as your friend spells it. Because this is a close friend, you at least have a chance of spelling it correctly. But what about that guy you met at that conference Jon Smith, or was it John Smith, or Jean Smidth. Coupled with other search terms, like hypothetical Jon’s company, this can be a very useful feature. Also, consider trying to find any of the very interesting and informative talks by Venkat Subramaniam. You could search for his name with the terms Sbraminem, Subraminem, or even subramen all of which result in the Soundex code S165, meaning you would get hits, meaning high recall. If there were a bunch of talks by someone with the name Sbraminem, using just Soundex you would have poor precision; lots of results you don’t want and a few you do.


Phonetic Analysis Plugin

ElasticSearch’s open-source plugin supports many algorithms. Below is a synthesis of each algorithm included in the plugin that I can find information about:

  • Metaphone - An improvement on Soundex that takes into consideration different variations on spellings and pronunciations for words in the English language. Ranallo become RNL
  • Double Metaphone - An improvement on Metaphone that encodes English words in a few different ways, resulting in up to two codes per input that helps to encompass more varieties of word origins. Ranallo interestingly becomes RNL under both transformations.
  • Soundex - A phonetic algorithm used for analysis of US Census records; given it history, expect it to work well for American surnames. Ranallo becomes R540.
  • Refined Soundex - An improvement on Soundex that reduces the number of names that will result in the same code. Shifts the precision/recall ratio to higher precision with lower recall.
  • NYSIIS - An algorithm designed to work particularly well with American names. Empirically, it is slightly better than soundex. Ranallo becomes RANAL
  • Caverphone1 and Caverphone2 - Algorithms originating from Census related work in New Zealand. Optimized for English names as pronounced with a specific accent (Dunedin, New Zealand). Ranallo becomes RNLA111111 under both versions.
  • Cologne - Transformation of a word to a numeral following predefined path rules, for example A-> 0, F->3. Ranallo becomes 7060550
  • Diatch Mokotoff - A refinement of Soundex tuned for Slavic and Yiddish names. Ranallo becomes 968800
  • Beider Morse - Similar to Soundex but designed to be language independent and more generic with a lower false positive hit rate.
  • Two additional algorithms, Koelner and Haase are built into the plugin, though I had a harder time finding good information of what makes them unique.

Phonetics with ElasticSearch

Now that we know what a phonetic algorithm is, and some facts that might help decide which phonetic algorithm is best for the use case; it’s time to try it out. The first step is enabling the phonetics plugin for your ElasticSearch server. The documentation shows one way to do that by executing some commands on your server. I prefer immutable infrastructure in almost all cases; this is no different. To enable in docker at image build time, you can add the following line to your Dockerfile:

RUN bin/elasticsearch-plugin install analysis-phonetic

Next, you need to configure the template to index specific fields with the chosen phonetic algorithm. With the example document below:

{

  "id": "aa936e9c-c436-4337-a15e-5a224da7214b",

  "name": "Daniel Ranallo"

}


An index template similar to the following would enable phonetic algorithms on the name field using a dynamic template for the field with the name name:

{
  "index_patterns": [
    "person"
  ],
  "settings": {
    "analysis": {
      "analyzer": {
        "metaphone": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "metaphone_example"
          ]
        },
        "soundex": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "refined_soundex_example"
          ]
        },
        "nysiis": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "nysiis_example"
          ]
        }
      },
      "filter": {
        "metaphone_example": {
          "encoder": "doublemetaphone",
          "replace": false,
          "type": "phonetic"
        },
        "refined_soundex_example": {
          "type": "phonetic",
          "encoder": "refinedsoundex",
          "replace": false
        },
        "nysiis_example": {
          "type": "phonetic",
          "encoder": "nysiis",
          "replace": false
        }
      },
      "normalizer": {
        "lowercase_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "phonetic": {
            "match_mapping_type": "string",
            "match": "name",
            "mapping": {
              "type": "text",
              "analyzer": "standard",
              "fields": {
                "metaphone": {
                  "type": "text",
                  "analyzer": "metaphone"
                },
                "soundex": {
                  "type": "text",
                  "analyzer": "soundex"
                },
                "nysiis": {
                  "type": "text",
                  "analyzer": "nysiis"
                }
              }
            }
          }
        }
      ]
    }
  }
}

Set this template by curling a PUT to the server address /templates/example_template. This must be done before indexing data as type mappings are immutable for existing fields. Once the template is set and some data is indexed, you’re ready to query for some documents. Depending on if optimizing for recall or precision is more important for your use case, you can use one phonetic field, all the fields or even none of the fields. If optimizing for recall, a query like the following may suffice:

{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "ranali",
            "fields": [
              "name",
              "name.soundex",
              "name.metaphone",
              "name.nysiis"
            ]
          }
        }
      ]
    }
  }
}

Swapping the query value from ranali to other common misspellings of my surname such as renali and rinala still results in the expected document.

Performance implications

Most of the work for transformation under all the algorithms is fairly insignificant and will be performed at index time, which will decrease indexing throughput. This may be acceptable for your business requirements. There is also a small performance hit on the analyzer chain as queries are executed, since the query must go through the same transformation. The biggest performance hit I have observed comes from the use of multi-field search. The example query is searching 4 fields, which means 4 inverted indices searched to determine document match. ElasticSearch is good at this sort of work, but if unacceptable latencies are experienced, consider tuning the query by removing extraneous fields.

Conclusion

I have found the phonetic algorithms plugin in ElasticSearch to be highly effective, allowing me to get up and running in a short period of time and match on some very difficult to spell names. The downside is that making the tradeoff for recall over precision has, in some cases, resulted in some very odd results. For example, in the original Soundex implementation, the names Jessica and Joshua both mapped to the same Soundex code. This can be confusing to users but using newer versions of the algorithms and collecting data on what is working for your users and what isn’t, you can usually get around this sort of problem. The phonetic algorithms above can significantly improve the performance of search algorithms, especially when searching for people’s names.

Cover image from https://cityofgastonia.news/1920-census-unc-digitalinnovationlab/

nirjan pakhira

In love with Development | Full stack developer at Data Consultants Corp.

2 年

The best explanations I've even seen

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了