TLDR; This article will give you an introduction into phonetic search, why and when to use it and what limitations there are. If you thought fuzziness is your best option to find similar terms, this blog post will explain an alternative approach.
What is the problem with fuzziness?
A common approach to widen your search and to cater for typos is to enable
fuzziness in your search. This has two effects. First, there is a limit on
the fuzziness per term depending on the length, see the Elasticsearch
Fuzziness
documentation.
Second, this will increase the CPU utilization of your nodes. Also, from a
results perspective this may not be what you wanted. A fuzzy search for
lead
might include documents containing leak
, but this may not be the
documents that you are after.
Phonetic search takes a different approach. First this happens on index
time, as the tokens are converted and stored in the inverted index as a
different representation. This representation is not based on the edit
distance of terms (which would not be possible, as you need to compare this
against a term at query time), but takes the pronounciation and the order of certain
letters within a term into account and creates a normalized output term, so
that the representations of two names like Alexander
and Aleksandar
are
the same in the best case.
Installing the phonetic analysis plugin
bin/elasticsearch-plugin install analysis-phonetic
After this you can use the phonetic
token filter in your mappings. This is
a minimal example using the analyze API:
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "koelnerphonetik"
}
]
}
The response looks like this:
{
"tokens" : [
{
"token" : "0548627"
},
{
"token" : "0548627"
},
{
"token" : "0548627"
}
]
}
As you can see the three inputs end up with exactly the same token in the
inverted index. So if anyone searches for Oleksandr
any document
containing Alexander
and Aleksandar
will also be returned. Taking a look
at the two terms Alexander
and Oleksandr
this could not be matched with
a fuzzy query, as the levenshtein edit distance between the two terms is
4 (changes required to get from one term to the other).
The basic idea is to devise an algorithm that returns the same token for
similar sounding or written terms like the names above. Some of those
algorithms return a term without any vowels, other like the koelnerphonetik
return a set of numbers.
Even though it is only a single token filter, it is basically a multiplexer
for quite a few different phonetic encoders using the encoder
parameter
within the filter. The Elasticsearch plugin is
using phonetic analyzers from Apache Lucene, which in turn uses classes
based on Apache commons-codec
. Within Elasticsearch all this is grouped in
PhoneticTokenFilterFactory.
Let’s take a look at each of those together with some sample terms to see
what they end up as.
soundex
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "soundex"
}
]
}
Elasticsearch uses the Apache Commons Codec Soundex implementation.
Input | Output |
---|---|
Alexander | A425 |
Aleksandar | A425 |
Oleksandr | O425 |
Meyer | M600 |
Meier | M600 |
Maier | M600 |
Soundex is the poster child of
these phonetic algorithms, as it is probably the most known one. Many
databases implement some sort of Soundex in their SQL query engine to
support a simple phonetic algorithm. Also, it is really old as it originates
from 1918 (not a typo 😀). You can see in the above output, the first letter is preserved
and then a number based encoding is used. There are a few encoding rules
(let’s use Alexander
as input term, and to step by step)
- Drop all occurrences of all vowels plus
y
,h
andw
. Term is nowAlxndr
b, f, p, v
encodes to1
. Term staysAlexndr
c, g, j, k, q, s, x, z
encodes to2
. Term becomesAl2ndr
d, t
encodes to3
. Term becomesAl2n3r
l
encodes to4
. Term becomesA42n3r
m, n
encodes to5
. Term becomesA4253r
r
encodes to6
. Term becomesA42536
.- Reduce two of the same letters to one (if that is true before doing any other reduction). Term stays the same.
- Two letters with the same number separated by
h
orw
are coded as single number. Term stays the same. - The output will always be one letter and three numbers, either filled up
with zeros or retaining only the first three. This means losing a bit of
precision for longer words. For example
experiment
andexperience
are both becomingE216
(similar to certain algorithmic stemmers, where both of those terms becomeexperi
). Term becomesA425
as seen above.
In case you are wondering why Maier is M600
. Dropping all vowels means, we
are up to Mr
as term, and encoding that one means M
and 6
, the rest is
filled with zeros.
Of course this can become much more efficient as shown above in an concrete implementation.
There are also some limitations (that apply to the next Soundex
implementation as well). When a term contains a non-ASCII character like the
German term Äpfel
(the plural of Apfel
) or Bälle
(plural of ball), it
is not converted at all. So it might make sense, to use ASCII folding
before using Soundex:
GET _analyze?filter_path=tokens.token
{
"text": [ "Äpfel" ],
"tokenizer": "standard",
"filter": [
{
"type":"asciifolding"
},
{
"type": "phonetic",
"encoder": "soundex"
}
]
}
Note, the above implementation is actually the so called american soundex.
refined_soundex
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "refined_soundex"
}
]
}
Elasticsearch uses the Apache Commons Codec Refined Soundex implementation.
Input | Output |
---|---|
Alexander | A070508609 |
Aleksandar | A070508609 |
Oleksandr | O07030869 |
Meyer | M809 |
Meier | M809 |
Maier | M809 |
This is an adaption of the initial Soundex idea. The most obvious difference is the length of the output term, as there is no term exactly three numbers long. Also, there is no removal of letters and the encoding alphabet has changed a little and contains the numbers from 0 up to 9. Same letters next to each other are still removed in most cases.
I could not find the proper origin of the refined Soundex algorithm. So if you happen to know, feel free to drop a note.
metaphone
encoder
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "metaphone"
}
]
}
Elasticsearch uses the Apache Commons Codec Metaphone implementation.
Input | Output |
---|---|
Alexander | ALKS |
Aleksandar | ALKS |
Oleksandr | OLKS |
Meyer | MYR |
Meier | MR |
Maier | MR |
The Metaphone algorithm was published in 1990 with a focus for English words and their pronunciation, even though there is also a Portuguese version (that also has a postgres extension).
The way Metaphone works is similar to Soundex. It basically expands Soundex
with more rules than Soundex. The remaining string will be uppercased.
Vowels are dropped except at the beginning of a word, other characters in
use 0BFHJKLMNPRSTWXY
.
double_metaphone
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "double_metaphone"
}
]
}
Elasticsearch uses the Apache Commons Codec Double Metaphone implementation.
Input | Output |
---|---|
Alexander | ALKS |
Aleksandar | ALKS |
Oleksandr | ALKS |
Meyer | MR |
Meier | MR |
Maier | MR |
The second generation of the Metaphone algorithm is from the year 2000.
Double Metaphone can emit more than one output term. However the
above examples only return one. You can try with Smith
and Schmidt
and
will retrieve four terms, and both input terms will have XMT
as output.
The double_metaphone
algorithm introduced the possibility of having more
than one output term for an input term.
The price for this is a higher complexity. The java class is twice
the size of the original Metaphone algorithm. In order to cater for a few
more edge cases in English, that have been introduced by using words from
other languages, there are lengthy checks for single letters, like the
handleC
method.
There is also a third variant of the Metaphone algorithm, you can check out its implementation in the OpenRefine project.
caverphone1
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "caverphone1"
}
]
}
Elasticsearch uses the Apache Commons Codec Caverphone1 implementation.
Input | Output |
---|---|
Alexander | ALKNT1 |
Aleksandar | ALKSNT |
Oleksandr | ALKSNT |
Meyer | MY1111 |
Meier | M11111 |
Maier | M11111 |
Originally from 2002, back then optimized for a use-case in a specific New
Zealand area. A big difference is, that this algorithm does not go letter by
letter but also replaces certain multi letter prefixes like cough
,
rough
- also it has a length limitation of 6 for the output, defaulting to
1
.
Note: If you want to test it, make sure you are using cavorphone1
in
your mapping setup, as caverphone
defaults to version 2 of the algorithm.
The describing paper includes a simple javascript implementation.
caverphone2
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "caverphone2"
}
]
}
Elasticsearch uses the Apache Commons Codec Caverphone2 implementation.
Input | Output |
---|---|
Alexander | ALKNTA1111 |
Aleksandar | ALKSNTA111 |
Oleksandr | ALKSNTA111 |
Meyer | MA11111111 |
Meier | MA11111111 |
Maier | MA11111111 |
The biggest change on the outside here is the length of the output term. The authors intention here was to provide a more general phonetic match than the specific Caverphone 1.0 algorithm. The algorithm was invented in late 2004.
The describing paper includes a simple python implementation along with a comparison of some terms to Soundex and Metaphone.
nysiis
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "nysiis"
}
]
}
Elasticsearch uses the Apache Commons Codec NYSIIS implementation.
Input | Output |
---|---|
Alexander | ALAXAN |
Aleksandar | ALACSA |
Oleksandr | OLACSA |
Meyer | MAYAR |
Meier | MAR |
Maier | MAR |
NYSIIS, short for New York State Identification and Intelligence System is an algorithm from 1970 and is basically an improved Soundex algorithm.
It contains some specializations like specific encodings for certain starts
or ends of a term, i.e. starting with Mac
or Sch
.
cologne
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "cologne"
}
]
}
Elasticsearch uses the Apache Commons Codec Cologne Phonetic implementation.
Input | Output |
---|---|
Alexander | 0548627 |
Aleksandar | 0548627 |
Oleksandr | 0548627 |
Meyer | 67 |
Meier | 67 |
Maier | 67 |
Let’s go a bit more German now - but you might have guessed, we’re staying
in the realm of Soundex related algorithms. This one is optimized for the
German language. Introduced in 1969 it is not limited in length, and even
has a letter that produces a two digit code (X
, try Xylophon
or
Exoskelett
).
The wiki page lists the letter replacements or the commons-codec javadocs.
koelnerphonetik
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "koelnerphonetik"
}
]
}
Elasticsearch uses a custom Koelner Phonetik implementation
Input | Output |
---|---|
Alexander | 0548627 |
Aleksandar | 0548627 |
Oleksandr | 0548627 |
Meyer | 67 |
Meier | 67 |
Maier | 67 |
Now this one is slightly confusing. It has the same name than the cologne
phonetik, except that the German name of the city of cologne is used,
Köln
. This also uses a different implementation, where I am not sure why
that is the case. This had been added to Elasticsearch about 9 years ago, so
it is hard to get the history right I guess 😀
After digging around a little I found out that there are indeed differences
because of the commons-codec
version shipped with Elasticsearch, that
contains two bugs. If you are interested in more details, check out the
Elasticsearch GitHub
issue #74614
haasephonetik
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "haasephonetik"
}
]
}
Elasticsearch uses a custom HaasePhonetik implementation
Input | Output |
---|---|
Alexander | 9548627 |
Aleksandar | 9548627 |
Oleksandr | 9548627 |
Meyer | 67 |
Meier | 67 |
Maier | 67 |
This is an extension of the koelner phonetic, done by Martin Haase. This was initially used for management of names in a hospital and was also called extended koelner phonetic. As you can see, when comparing the terms to koelner, in this example there is no change, except the first character becoming becoming a 9 in this example, if the term starts with a Vowel.
The above link to the implementation, extends the KoelnerPhonetik
class so
you can go ahead and spot the differences for various partially terms and
their encoding.
daitch_mokotoff
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "daitch_mokotoff"
}
]
}
Elasticsearch uses the Apache Commons Codec Daitch Mokotoff implementation.
Input | Output |
---|---|
Alexander | 085463 |
Aleksandar | 085463 |
Oleksandr | 085463 |
Meyer | 619000 |
Meier | 619000 |
Maier | 619000 |
This is an improved Soundex algorithm from 1985, with a better accuracy for eastern European names, where pronunciation is similar, but spelling is different. By default this one is doing asciifolding, compared to other implementations. Compared to earlier Soundex implementations this one also has the ability to take several characters at once into account instead of going character-by-character.
In order to define a ruleset for several characters this algorithm features
the concept of rules. So there is a
dmrules.txt
file in the commons-codec source that describes what a pattern is replaced
with (depending if it occurs before a vowel, start at word or other).
beider_morse
GET _analyze?filter_path=tokens.token
{
"text": [ "Alexander", "Aleksandar", "Oleksandr" ],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "beider_morse"
}
]
}
Elasticsearch uses the Apache Commons Codec Beider-Morse implementation.
Input | Output |
---|---|
Alexander | many terms… |
Aleksandar | many terms… |
Oleksandr | many terms… |
Meyer | mDr |
Meier | mDr |
Maier | mDr |
Finally a completely different approach for encoding and a rather new, as it has been published in 2008! And that also happens to be the last of the current Elasticsearch supported phonetic encodings (check the next paragraph for what else is out there). Stephen Morse and Alexander Beider came up with this to address some shortfallings of the Daitch-Motokoff algorithm, especially for surnames.
So I have not put the output up there, because that would have been too many
terms for all of them. Just for Alexander
we get 48 terms back! Many of
them look very similar, so what is the deal here and how does it work?
So, Beider-Morse tries to figure out the language of an input term first, by guessing it. In order to do that, there is a few language rules shipped with Beider-Morse. Those language rules let you configure if you are expecting names to be generic or of Ashkenazic/Sephardic Jewish origin.
Take this example
GET _analyze?filter_path=tokens.token
{
"text": [ "Schwarz"],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "beider_morse"
}
]
}
GET _analyze?filter_path=tokens.token
{
"text": [ "Szwarc"],
"tokenizer": "standard",
"filter": [
{
"type": "phonetic",
"encoder": "beider_morse"
}
]
}
returning
{
"tokens" : [
{
"token" : "svYrts"
},
{
"token" : "svarts"
},
{
"token" : "svorts"
}
]
}
{
"tokens" : [
{
"token" : "svarts"
},
{
"token" : "svorts"
}
]`
}
As you can see, despite a very different pronunciation, both tokens end up
as svarts
and svorts
in the index. The reason for this is, that
Schwarz
has been detected as German, whereas Szwarc
has been detected as
Polish and thus different rules are being applied.
You can take a look at the different rules
here
and figure out that words beginning with sch
are considered German or
Russian, plus ending with rz
is considered Polish or German.
That said, using Beider-Morse you will probably create more tokens than with other algorithms, so keep that in mind for your space requirements.
Others not implemented in Elasticsearch
If you thought, that this is all you need to know, then you’re wrong. Let’s go briefly through some more algorithms.
Match Rating Approach
Availabe as Apache Commons Codec implementation.
This approach has been developed in 1977 by Western Airlines. It normalizes
(removing accents), removes vowels as well as double consonants and then
reduces down to the first three and last three letters of the output. So for
Alexander
this would be ALXNDR
.
Moar Soundex
Fuzzy Soundex is another variant that is in some implementations, that performs additional substitutions.
Next up there is the Lein Name Coding, another variation on Soundex/NYSIIS, which produces another 4 character code for any given name.
Then there is ONCA - Oxford Name Compression Algorithm from 1997, another extension of Soundex and NYSIIS.
Roger Root Name Coding
Procedure is producing a
fully numeric 5 character code. The linked PDF is the same as for the Lein
Name Coding. Check the page labeled 28
for a description of how this
work.
Statistics Canada Name
Coding is a special coding
over the french standard alphabet. The linked PDF is the same as for the Lein
Name Coding. Check the page labeled 26
for a description of how this
work.
The final one on this list (if you dig, you will find more, also for German
use-cases like phonem
) is
phonex,
trying to improve a little based on common names.
Eudex
Another interesting candidate is Eudex, derived from the classification of pulmonic consonants. Faster than Soundex as well. Check out the rust library or the java port.
Phonetic algorithm as a building block
Spell checkers sometimes uses phonetic algorithms as a building block (along with customized spell checking implementations nowadays), with the difference to compare the input against a set of stored terms. For example Elasticsearch uses such a functionality to check all configuration settings and suggest alternative in case an invalid setting is specified.
An interesting variation of this is not using the regular edit distance, but
taking the position of keys on the keyboard into account (making this more
complex, if you do not know the user’s keyboard). The most common use-case
of this is probably known as T9
among the elderly… Take a look at the
abydos distance
package that
features some algorithms that take the keyboard into account when measuring
the distance between two strings.
There is also a couple of new projects like NeuSpell, a neural spelling correction toolkit.
Summary
I intentionally also posted the publish times of those algorithms here. Nothing of this is new or fancy, most of this is decades old, yet it does not see too much usage. One should also keep in mind that the use-case is more narrow than expected, as this is mostly useful for names - but keep in mind that product names can also be names, and some of the algorithms have been changed to support a wider model.
You can also see this focuses a lot on western languages. A few of those algorithms have a bit more support for non-western languages, but I rarely see this in the wild.
I hope you enjoyed our ride across the different phonetic search algorithms and keep this in mind when someone asks you to implement a fuzzy search… ask for the use-case first!
If there are fancy or new algorithms and implementations I forgot about over here, please give me a ping and I am happy to add them.
Resources
- java-string-similarity - a library implementing different string similarity and distance measures
- Introduction into Information Retrieval (Manning, Raghavan, Schütze): Phonetic Correction
- Zobel and Dart: Phonetic String Matching: Lessons from Information Retrieval
- Weighting Edit Distance to Improve Spelling Correction in Music Entity Search
- Approximate string-matching with q-grams and maximal matches
- Aspekte der Kodierung phonetischer Ähnlichkeiten in deutschen Eigennamen
- Improved An Algorithm For Arabic Name Matching
- Phonetic Matching: A Better Soundex
- Caverphone 1.0 specification
- Caverphone 2.0 specification
- phonetics, a node library implement various Soundex like algorithms
- phonics R package, providing a variety of phonetic indexing algorithms in the
R
programming language - DIMSIM: An Accurate Chinese Phonetic Similarity Algorithm Based on Learned High Dimensional Encoding, plus python implementations
- python implementation of a Russian Soundex algorithm
- python abydos phonetic package
Final remarks
If you made it down here, wooow! Thanks for sticking with me. You can follow or ping me on twitter, GitHub or reach me via Email (just to tell me, you read this whole thing :-).
If there is anything to correct, drop me a note, and I am happy to do so and append to this post!
Same applies for questions. If you have question, go ahead and ask!
If you want me to speak about this, drop me an email!