Implementing phonetic search with Elasticsearch

Jun 30, 2021

16 minutes read

TLDR; This article will give you an introduction into phonetic search, why and when to use it and what limitations there are. If you thought fuzziness is your best option to find similar terms, this blog post will explain an alternative approach.

What is the problem with fuzziness?

A common approach to widen your search and to cater for typos is to enable fuzziness in your search. This has two effects. First, there is a limit on the fuzziness per term depending on the length, see the Elasticsearch Fuzziness documentation. Second, this will increase the CPU utilization of your nodes. Also, from a results perspective this may not be what you wanted. A fuzzy search for lead might include documents containing leak, but this may not be the documents that you are after.

Phonetic search takes a different approach. First this happens on index time, as the tokens are converted and stored in the inverted index as a different representation. This representation is not based on the edit distance of terms (which would not be possible, as you need to compare this against a term at query time), but takes the pronounciation and the order of certain letters within a term into account and creates a normalized output term, so that the representations of two names like Alexander and Aleksandar are the same in the best case.

Installing the phonetic analysis plugin

bin/elasticsearch-plugin install analysis-phonetic

After this you can use the phonetic token filter in your mappings. This is a minimal example using the analyze API:

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "koelnerphonetik"
    }
  ]
}

The response looks like this:

{
  "tokens" : [
    {
      "token" : "0548627"
    },
    {
      "token" : "0548627"
    },
    {
      "token" : "0548627"
    }
  ]
}

As you can see the three inputs end up with exactly the same token in the inverted index. So if anyone searches for Oleksandr any document containing Alexander and Aleksandar will also be returned. Taking a look at the two terms Alexander and Oleksandr this could not be matched with a fuzzy query, as the levenshtein edit distance between the two terms is 4 (changes required to get from one term to the other).

The basic idea is to devise an algorithm that returns the same token for similar sounding or written terms like the names above. Some of those algorithms return a term without any vowels, other like the koelnerphonetik return a set of numbers.

Even though it is only a single token filter, it is basically a multiplexer for quite a few different phonetic encoders using the encoder parameter within the filter. The Elasticsearch plugin is using phonetic analyzers from Apache Lucene, which in turn uses classes based on Apache commons-codec. Within Elasticsearch all this is grouped in PhoneticTokenFilterFactory. Let’s take a look at each of those together with some sample terms to see what they end up as.

`soundex`

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "soundex"
    }
  ]
}

Elasticsearch uses the Apache Commons Codec Soundex implementation.

Input	Output
Alexander	A425
Aleksandar	A425
Oleksandr	O425
Meyer	M600
Meier	M600
Maier	M600

Soundex is the poster child of these phonetic algorithms, as it is probably the most known one. Many databases implement some sort of Soundex in their SQL query engine to support a simple phonetic algorithm. Also, it is really old as it originates from 1918 (not a typo 😀). You can see in the above output, the first letter is preserved and then a number based encoding is used. There are a few encoding rules (let’s use Alexander as input term, and to step by step)

Drop all occurrences of all vowels plus y, h and w. Term is now Alxndr
b, f, p, v encodes to 1. Term stays Alexndr
c, g, j, k, q, s, x, z encodes to 2. Term becomes Al2ndr
d, t encodes to 3. Term becomes Al2n3r
l encodes to 4. Term becomes A42n3r
m, n encodes to 5. Term becomes A4253r
r encodes to 6. Term becomes A42536.
Reduce two of the same letters to one (if that is true before doing any other reduction). Term stays the same.
Two letters with the same number separated by h or w are coded as single number. Term stays the same.
The output will always be one letter and three numbers, either filled up with zeros or retaining only the first three. This means losing a bit of precision for longer words. For example experiment and experience are both becoming E216 (similar to certain algorithmic stemmers, where both of those terms become experi). Term becomes A425 as seen above.

In case you are wondering why Maier is M600. Dropping all vowels means, we are up to Mr as term, and encoding that one means M and 6, the rest is filled with zeros.

Of course this can become much more efficient as shown above in an concrete implementation.

There are also some limitations (that apply to the next Soundex implementation as well). When a term contains a non-ASCII character like the German term Äpfel (the plural of Apfel) or Bälle (plural of ball), it is not converted at all. So it might make sense, to use ASCII folding before using Soundex:

GET _analyze?filter_path=tokens.token
{
  "text": [ "Äpfel" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type":"asciifolding"
    },
    {
      "type": "phonetic",
      "encoder": "soundex"
    }
  ]
}

Note, the above implementation is actually the so called american soundex.

`refined_soundex`

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "refined_soundex"
    }
  ]
}

Elasticsearch uses the Apache Commons Codec Refined Soundex implementation.

Input	Output
Alexander	A070508609
Aleksandar	A070508609
Oleksandr	O07030869
Meyer	M809
Meier	M809
Maier	M809

This is an adaption of the initial Soundex idea. The most obvious difference is the length of the output term, as there is no term exactly three numbers long. Also, there is no removal of letters and the encoding alphabet has changed a little and contains the numbers from 0 up to 9. Same letters next to each other are still removed in most cases.

I could not find the proper origin of the refined Soundex algorithm. So if you happen to know, feel free to drop a note.

`metaphone` encoder

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "metaphone"
    }
  ]
}

Elasticsearch uses the Apache Commons Codec Metaphone implementation.

Input	Output
Alexander	ALKS
Aleksandar	ALKS
Oleksandr	OLKS
Meyer	MYR
Meier	MR
Maier	MR

The Metaphone algorithm was published in 1990 with a focus for English words and their pronunciation, even though there is also a Portuguese version (that also has a postgres extension).

The way Metaphone works is similar to Soundex. It basically expands Soundex with more rules than Soundex. The remaining string will be uppercased. Vowels are dropped except at the beginning of a word, other characters in use 0BFHJKLMNPRSTWXY.

`double_metaphone`

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "double_metaphone"
    }
  ]
}

Elasticsearch uses the Apache Commons Codec Double Metaphone implementation.

Input	Output
Alexander	ALKS
Aleksandar	ALKS
Oleksandr	ALKS
Meyer	MR
Meier	MR
Maier	MR

The second generation of the Metaphone algorithm is from the year 2000. Double Metaphone can emit more than one output term. However the above examples only return one. You can try with Smith and Schmidt and will retrieve four terms, and both input terms will have XMT as output. The double_metaphone algorithm introduced the possibility of having more than one output term for an input term.

The price for this is a higher complexity. The java class is twice the size of the original Metaphone algorithm. In order to cater for a few more edge cases in English, that have been introduced by using words from other languages, there are lengthy checks for single letters, like the handleC method.

There is also a third variant of the Metaphone algorithm, you can check out its implementation in the OpenRefine project.

`caverphone1`

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "caverphone1"
    }
  ]
}

Elasticsearch uses the Apache Commons Codec Caverphone1 implementation.

Input	Output
Alexander	ALKNT1
Aleksandar	ALKSNT
Oleksandr	ALKSNT
Meyer	MY1111
Meier	M11111
Maier	M11111

Originally from 2002, back then optimized for a use-case in a specific New Zealand area. A big difference is, that this algorithm does not go letter by letter but also replaces certain multi letter prefixes like cough, rough - also it has a length limitation of 6 for the output, defaulting to 1.

Note: If you want to test it, make sure you are using cavorphone1 in your mapping setup, as caverphone defaults to version 2 of the algorithm.

The describing paper includes a simple javascript implementation.

`caverphone2`

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "caverphone2"
    }
  ]
}

Elasticsearch uses the Apache Commons Codec Caverphone2 implementation.

Input	Output
Alexander	ALKNTA1111
Aleksandar	ALKSNTA111
Oleksandr	ALKSNTA111
Meyer	MA11111111
Meier	MA11111111
Maier	MA11111111

The biggest change on the outside here is the length of the output term. The authors intention here was to provide a more general phonetic match than the specific Caverphone 1.0 algorithm. The algorithm was invented in late 2004.

The describing paper includes a simple python implementation along with a comparison of some terms to Soundex and Metaphone.

`nysiis`

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "nysiis"
    }
  ]
}

Elasticsearch uses the Apache Commons Codec NYSIIS implementation.

Input	Output
Alexander	ALAXAN
Aleksandar	ALACSA
Oleksandr	OLACSA
Meyer	MAYAR
Meier	MAR
Maier	MAR

NYSIIS, short for New York State Identification and Intelligence System is an algorithm from 1970 and is basically an improved Soundex algorithm.

It contains some specializations like specific encodings for certain starts or ends of a term, i.e. starting with Mac or Sch.

`cologne`

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "cologne"
    }
  ]
}

Elasticsearch uses the Apache Commons Codec Cologne Phonetic implementation.

Input	Output
Alexander	0548627
Aleksandar	0548627
Oleksandr	0548627
Meyer	67
Meier	67
Maier	67

Let’s go a bit more German now - but you might have guessed, we’re staying in the realm of Soundex related algorithms. This one is optimized for the German language. Introduced in 1969 it is not limited in length, and even has a letter that produces a two digit code (X, try Xylophon or Exoskelett).

The wiki page lists the letter replacements or the commons-codec javadocs.

`koelnerphonetik`

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "koelnerphonetik"
    }
  ]
}

Elasticsearch uses a custom Koelner Phonetik implementation

Input	Output
Alexander	0548627
Aleksandar	0548627
Oleksandr	0548627
Meyer	67
Meier	67
Maier	67

Now this one is slightly confusing. It has the same name than the cologne phonetik, except that the German name of the city of cologne is used, Köln. This also uses a different implementation, where I am not sure why that is the case. This had been added to Elasticsearch about 9 years ago, so it is hard to get the history right I guess 😀

After digging around a little I found out that there are indeed differences because of the commons-codec version shipped with Elasticsearch, that contains two bugs. If you are interested in more details, check out the Elasticsearch GitHub issue #74614

`haasephonetik`

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "haasephonetik"
    }
  ]
}

Elasticsearch uses a custom HaasePhonetik implementation

Input	Output
Alexander	9548627
Aleksandar	9548627
Oleksandr	9548627
Meyer	67
Meier	67
Maier	67

This is an extension of the koelner phonetic, done by Martin Haase. This was initially used for management of names in a hospital and was also called extended koelner phonetic. As you can see, when comparing the terms to koelner, in this example there is no change, except the first character becoming becoming a 9 in this example, if the term starts with a Vowel.

The above link to the implementation, extends the KoelnerPhonetik class so you can go ahead and spot the differences for various partially terms and their encoding.

`daitch_mokotoff`

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "daitch_mokotoff"
    }
  ]
}

Elasticsearch uses the Apache Commons Codec Daitch Mokotoff implementation.

Input	Output
Alexander	085463
Aleksandar	085463
Oleksandr	085463
Meyer	619000
Meier	619000
Maier	619000

This is an improved Soundex algorithm from 1985, with a better accuracy for eastern European names, where pronunciation is similar, but spelling is different. By default this one is doing asciifolding, compared to other implementations. Compared to earlier Soundex implementations this one also has the ability to take several characters at once into account instead of going character-by-character.

In order to define a ruleset for several characters this algorithm features the concept of rules. So there is a dmrules.txt file in the commons-codec source that describes what a pattern is replaced with (depending if it occurs before a vowel, start at word or other).

`beider_morse`

GET _analyze?filter_path=tokens.token
{
  "text": [ "Alexander", "Aleksandar", "Oleksandr" ],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "beider_morse"
    }
  ]
}

Elasticsearch uses the Apache Commons Codec Beider-Morse implementation.

Input	Output
Alexander	many terms…
Aleksandar	many terms…
Oleksandr	many terms…
Meyer	mDr
Meier	mDr
Maier	mDr

Finally a completely different approach for encoding and a rather new, as it has been published in 2008! And that also happens to be the last of the current Elasticsearch supported phonetic encodings (check the next paragraph for what else is out there). Stephen Morse and Alexander Beider came up with this to address some shortfallings of the Daitch-Motokoff algorithm, especially for surnames.

So I have not put the output up there, because that would have been too many terms for all of them. Just for Alexander we get 48 terms back! Many of them look very similar, so what is the deal here and how does it work?

So, Beider-Morse tries to figure out the language of an input term first, by guessing it. In order to do that, there is a few language rules shipped with Beider-Morse. Those language rules let you configure if you are expecting names to be generic or of Ashkenazic/Sephardic Jewish origin.

Take this example

GET _analyze?filter_path=tokens.token
{
  "text": [ "Schwarz"],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "beider_morse"
    }
  ]
}

GET _analyze?filter_path=tokens.token
{
  "text": [ "Szwarc"],
  "tokenizer": "standard", 
  "filter": [
    {
      "type": "phonetic",
      "encoder": "beider_morse"
    }
  ]
}

returning

{
  "tokens" : [
    {
      "token" : "svYrts"
    },
    {
      "token" : "svarts"
    },
    {
      "token" : "svorts"
    }
  ]
}

{
  "tokens" : [
    {
      "token" : "svarts"
    },
    {
      "token" : "svorts"
    }
  ]`
}

As you can see, despite a very different pronunciation, both tokens end up as svarts and svorts in the index. The reason for this is, that Schwarz has been detected as German, whereas Szwarc has been detected as Polish and thus different rules are being applied.

You can take a look at the different rules here and figure out that words beginning with sch are considered German or Russian, plus ending with rz is considered Polish or German.

That said, using Beider-Morse you will probably create more tokens than with other algorithms, so keep that in mind for your space requirements.

Others not implemented in Elasticsearch

If you thought, that this is all you need to know, then you’re wrong. Let’s go briefly through some more algorithms.

Match Rating Approach

Availabe as Apache Commons Codec implementation.

This approach has been developed in 1977 by Western Airlines. It normalizes (removing accents), removes vowels as well as double consonants and then reduces down to the first three and last three letters of the output. So for Alexander this would be ALXNDR.

Moar Soundex

Fuzzy Soundex is another variant that is in some implementations, that performs additional substitutions.

Next up there is the Lein Name Coding, another variation on Soundex/NYSIIS, which produces another 4 character code for any given name.

Then there is ONCA - Oxford Name Compression Algorithm from 1997, another extension of Soundex and NYSIIS.

Roger Root Name Coding Procedure is producing a fully numeric 5 character code. The linked PDF is the same as for the Lein Name Coding. Check the page labeled 28 for a description of how this work.

Statistics Canada Name Coding is a special coding over the french standard alphabet. The linked PDF is the same as for the Lein Name Coding. Check the page labeled 26 for a description of how this work.

The final one on this list (if you dig, you will find more, also for German use-cases like phonem) is phonex, trying to improve a little based on common names.

Eudex

Another interesting candidate is Eudex, derived from the classification of pulmonic consonants. Faster than Soundex as well. Check out the rust library or the java port.

Phonetic algorithm as a building block

Spell checkers sometimes uses phonetic algorithms as a building block (along with customized spell checking implementations nowadays), with the difference to compare the input against a set of stored terms. For example Elasticsearch uses such a functionality to check all configuration settings and suggest alternative in case an invalid setting is specified.

An interesting variation of this is not using the regular edit distance, but taking the position of keys on the keyboard into account (making this more complex, if you do not know the user’s keyboard). The most common use-case of this is probably known as T9 among the elderly… Take a look at the abydos distance package that features some algorithms that take the keyboard into account when measuring the distance between two strings.

There is also a couple of new projects like NeuSpell, a neural spelling correction toolkit.

Summary

I intentionally also posted the publish times of those algorithms here. Nothing of this is new or fancy, most of this is decades old, yet it does not see too much usage. One should also keep in mind that the use-case is more narrow than expected, as this is mostly useful for names - but keep in mind that product names can also be names, and some of the algorithms have been changed to support a wider model.

You can also see this focuses a lot on western languages. A few of those algorithms have a bit more support for non-western languages, but I rarely see this in the wild.

I hope you enjoyed our ride across the different phonetic search algorithms and keep this in mind when someone asks you to implement a fuzzy search… ask for the use-case first!

If there are fancy or new algorithms and implementations I forgot about over here, please give me a ping and I am happy to add them.

Resources

java-string-similarity - a library implementing different string similarity and distance measures
Introduction into Information Retrieval (Manning, Raghavan, Schütze): Phonetic Correction
Zobel and Dart: Phonetic String Matching: Lessons from Information Retrieval
Weighting Edit Distance to Improve Spelling Correction in Music Entity Search
Approximate string-matching with q-grams and maximal matches
Aspekte der Kodierung phonetischer Ähnlichkeiten in deutschen Eigennamen
Improved An Algorithm For Arabic Name Matching
Phonetic Matching: A Better Soundex
Caverphone 1.0 specification
Caverphone 2.0 specification
phonetics, a node library implement various Soundex like algorithms
phonics R package, providing a variety of phonetic indexing algorithms in the R programming language
DIMSIM: An Accurate Chinese Phonetic Similarity Algorithm Based on Learned High Dimensional Encoding, plus python implementations
python implementation of a Russian Soundex algorithm
python abydos phonetic package

Final remarks

If you made it down here, wooow! Thanks for sticking with me. You can follow or ping me on twitter, GitHub or reach me via Email (just to tell me, you read this whole thing :-).

If there is anything to correct, drop me a note, and I am happy to do so and append to this post!

Same applies for questions. If you have question, go ahead and ask!

If you want me to speak about this, drop me an email!

Back to posts

Backend developer, productivity fan, likes the JVM, full text search, distributed databases & systems

What is the problem with fuzziness?

Installing the phonetic analysis plugin

soundex

refined_soundex

metaphone encoder

double_metaphone

caverphone1

caverphone2

nysiis

cologne

koelnerphonetik

haasephonetik

daitch_mokotoff

beider_morse