TLDR; This article takes a look at customizable shingles using the common grams filter within Elasticsearch. This allows for a trade-off between required space for shingles and faster searches.
It’s all about stop words - again…
Stop words are omnipresent in your data. Sometimes it’s terms that occur often
in a certain language like the
, a
, that
, they
, of
and or
. Sometimes
these are terms, that occur just very often in your data set - think of medical
terms when you are indexing patient reports or medicine manuals.
Those terms tend to occur in many documents and therefore the terms dictionary will point to a lot of documents for that term.
A common practice used to be to drop those stop words - sometimes to save space,
sometimes to prevent slow queries, sometimes to not return irrelevant data.
While all are somewhat correct, this comes with other problems. First you
are not able to search for phrases anymore, second you might not be even
able to find the data your are searching for, because a phrase like to be or not to be
only contains stop words and thus documents cannot be found
anymore, once stop words have been dropped on indexing.
Many users keep stop words in their data and accept the fact of a slightly larger terms dictionary. This blog post will show a couple of alternatives and their trade-offs. Let’s go…
Preparing the data
As usual, let’s grab a data set from the internet, and get it into Elasticsearch.
I picked the Million Song Dataset, which contains a single text file with around 500k tracks. Sounds like a good candidate for indexing. This is the link.
A single line looks like this:
1922<SEP>TRSGHLU128F421DF83<SEP>Alberta Hunter<SEP>Don't Pan Me
The separator is creatively named <SEP>
, each row has four columns
- The year of the song
- A unique ID
- Artist name
- Song name
OK, let’s get this into a bulk format. We could try to be fancy and use things like xsv for this, but a little bit of good old awk is actually enough, including some cheap JSON escaping:
cat tracks_per_year.txt | awk -F '<SEP>' '{ gsub(/"/, "\\\"", $4) ; gsub(/"/, "\\\"", $3) ; printf "{\"index\":{ \"_id\":\"%s\"}}\n{ \"artist\":\"%s\", \"title\":\"%s\", \"year\":%s}\n", $2, $3, $4,$1 }' > output.ndjson
What does this do? Let’s go through it step by step
gsub(/"/, "\\\"", $4)
replaces double ticks with escaped double ticks for the song name (fourth field in the data)gsub(/"/, "\\\"", $3)
- same for the artists nameprintf
emulates the bulk index JSON format, using the ID as unique identifier and putting, artist, title and year in the JSON body
We’ll end up with a file around 54 MB in size. Let’s reduce the file size
below 10MB for easier bulk indexing. A wc -l output.ndjson
yields a little
more than one million lines, so 150000 lines per file sounds like a good
candidate.
split -l 150000 output.ndjson
You will now end up with seven files having around 8 MB per file. Now all we have to do, is indexing those:
for i in x* ; do
curl -s -H "Content-Type: application/x-ndjson" -XPOST \
localhost:9200/artists/_bulk --data-binary "@$i"
echo $i
done
This probably took some time (if you need it faster, forward the output of
the bulk response to /dev/null
as printing to a terminal always takes some
time.
Let’s verify the amount of data being indexed
GET artists/_count
For me this is was 514907
on Elasticsearch 7.12 - so a few documents have
not been indexed, but good enough for me - I suppose some more escaping
would have been needed to index all documents.
Before we continue, let’s force merge out data to a single segment, so we can compare it better in the future:
POST artists/_forcemerge?max_num_segments=1
After force merging, you can use the index stats to figure out how many segments your index has
GET artists/_stats?human&filter_path=indices.artists.primaries.store.size,indices.artists.primaries.segments.count
returns this to me
{
"indices" : {
"artists" : {
"primaries" : {
"store" : {
"size" : "52.7mb"
},
"segments" : {
"count" : 1
}
}
}
}
}
Ok, so our data needs 52.7 megabyte in a single segment. Time for a first search.
Match query search
Let’s start with the most boring search containing a stop word
GET artists/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"artist": {
"query": "The Who",
"operator": "and"
}
}
}
]
}
}
}
This returns songs from The Who
and also reveals, that there are some
duplicate songs in the data set - as always in life, data is rarely clean.
So, imagine we are searching for At the drive in
(Relationship of Command,
one of my favorite all time albums) but we only remember the beginning of
the band name at the
. Oh look! two stop words!
One thing we could do is switching to a match phrase query.
GET artists/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"artist": {
"query": "at the"
}
}
}
]
}
}
}
Or if we are sure about the term starting at the beginning we can go around
with a match_phrase_prefix
query
GET artists/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase_prefix": {
"artist": {
"query": "at the"
}
}
}
]
}
}
}
Another common option would be to search for at the
without an order, but
boost the phrase query like this
GET artists/_search
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"artist": "at the"
}
}
],
"must": [
{
"match": {
"artist": {
"query": "at the",
"operator": "and"
}
}
}
]
}
}
}
So where’s the cost for these queries? Of course, on my local machine those are still really fast, because a 50MB data set with 500k entries is not a lot too handle on my 2017 notebook - easily fitting in the page cache.
However there might be some potential to optimize. And the first thing that comes to mind is to improve the query side of things, but as usual thinking about the index side makes a lot of sense here - without removing stop words.
Common Terms Query - the deprecated way
There is a special query named the Common Terms Query, which divides the query terms into two groups, high and low frequency terms. By searching for the low frequency terms first and only then scoring for the high frequency terms the number of score calculations was greatly reduced.
What used to be a smart plan in previous Elasticsearch and Lucene versions, has nowadays become obsolete, because of the ability to skip blocks of documents efficiently. This was introduced in Elasticsearch 7.0 and so you can just use a match query.
So, this approach is here merely to historical reasons and to remind you, that you can get rid of that - I will not provide any examples here.
Using index_phrases
in text fields
Instead of indexing single terms, we can index phrases using the index_phrases option in the mapping. Time for some reindex magic.
PUT artists-index-phrases
{
"mappings": {
"properties": {
"artist": {
"type": "text",
"index_phrases" : true,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
POST _reindex
{
"source": {"index": "artists"},
"dest": {"index": "artists-index-phrases"}
}
POST artists-index-phrases/_forcemerge?max_num_segments=1
GET artists-index-phrases/_stats?human&filter_path=indices.artists-index-phrases.primaries.store.size,indices.artists-index-phrases.primaries.segments.count
After creating a new index with the index_phrases
option set to true
, we
can reindex and forcemerge our data down to a single segment, so that its
size is comparable to our original data set.
Turns out, the data is 55.2 MB, and thus only 2.5 MB bigger. Indexing shingles may easily require far more space, but in this example, we used the artist name, which is rather short compared to something like a product description. This is a good example, that you should always test with your own data and don’t trust things on the internet 😀
Now, what have we gained by setting this option. Basically we got a free query optimization under the hood. When running the following query
GET artists-index-phrases/_search
{
"explain": true,
"query": {
"match_phrase": {
"artist": "at the"
}
}
}
First, you get back correct results, but taking an extra look at the
explain
output, shows that the shingle field is used automatically,
without us doing anything.
"_explanation" : {
"value" : 6.609586,
"description" : "weight(artist._index_phrase:at the in 4466) [PerFieldSimilarity], result of:",
Elasticsearch checks at query time if the field has index_phrases
configured and if the slop is zero, the ._index_phrase
field gets used
automatically - which is faster, as this is a single lookup.
The middle ground - using common_grams
Despite us being lucky, let’s imagine for a second, that we would need much more space and that the amount of space might not be justified to the very few queries coming in using high frequency terms. Luckily there is a solution for that as well - namely the common_grams token filter.
The common grams token filter allows you to configure a list of common terms, that should be combined to a shingle (two terms as one in the terms dictionary). Let’s take a look at the EnglishAnalyzer and steal the stop word list.
GET /_analyze?filter_path=**.token
{
"tokenizer": "standard",
"filter": [
"lowercase" ,
{
"type": "common_grams",
"common_words": [
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
]
}
],
"text": "At the Drive-In"
}
This returns
{
"tokens" : [
{
"token" : "at"
},
{
"token" : "at_the"
},
{
"token" : "the"
},
{
"token" : "the_drive"
},
{
"token" : "drive"
},
{
"token" : "drive_in"
},
{
"token" : "in"
}
]
}
This looks good for the indexing side. We will index each common word also as a single term, because we want to find single term queries as well. Let’s do another round of reindexing
PUT artists-common-grams
{
"mappings": {
"properties": {
"artist": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"common_grams_analyzer" : {
"type" : "custom",
"tokenizer" :"standard",
"filter" : [ "lowercase", "stopword_common_grams"]
}
},
"filter": {
"stopword_common_grams" : {
"type": "common_grams",
"common_words": [
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
]
}
}
}
}
}
POST _reindex
{
"source": { "index": "artists"},
"dest": { "index": "artists-common-grams"}
}
GET artists-common-grams/_count
POST artists-common-grams/_flush
POST artists-common-grams/_forcemerge?max_num_segments=1
GET artists-common-grams/_stats?human&filter_path=indices.artists-common-grams.primaries.store.size,indices.artists-common-grams.primaries.segments.count
After running the last command, you should see, that your data is as big as the index data from the first index, so adding those shingles from your common words did not really attribute to size.
However, we have to do one last step to make this work, and that is setting
a different default search analyzer. Take a moment to read the
docs
about the query_mode
. Let’s take a look at the differences by running the
following two requests:
GET /_analyze?filter_path=**.token
{
"tokenizer": "standard",
"filter": [
"lowercase" ,
{
"type": "common_grams",
"common_words": [
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
]
}
],
"text": "At the Drive-In"
}
GET /_analyze?filter_path=**.token
{
"tokenizer": "standard",
"filter": [
"lowercase" ,
{
"type": "common_grams",
"common_words": [
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
],
"query_mode": true
}
],
"text": "At the Drive-In"
}
returns
# GET /_analyze?filter_path=**.token
{
"tokens" : [
{
"token" : "at"
},
{
"token" : "at_the"
},
{
"token" : "the"
},
{
"token" : "the_drive"
},
{
"token" : "drive"
},
{
"token" : "drive_in"
},
{
"token" : "in"
}
]
}
# GET /_analyze?filter_path=**.token
{
"tokens" : [
{
"token" : "at_the"
},
{
"token" : "the_drive"
},
{
"token" : "drive_in"
}
]
}
As you can see, the second request only returns the shingles - which is exactly what we need on query time. Just to be sure: If you append a non-stopword to that query, you will of course query for single terms. Try it out!
Let’s add a search analyzer to our mapping, which requires closing and opening the index.
POST artists-common-grams/_close
PUT artists-common-grams/_settings
{
"analysis": {
"analyzer": {
"common_grams_query_analyzer" : {
"type" : "custom",
"tokenizer" :"standard",
"filter" : [ "lowercase", "stopword_common_grams"]
}
},
"filter": {
"stopword_common_grams_query" : {
"type": "common_grams",
"common_words": [
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
],
"query_mode" : true
}
}
}
}
GET artists-common-grams/_settings
PUT artists-common-grams/_mapping
{
"properties": {
"artist": {
"type" : "text",
"search_analyzer": "common_grams_query_analyzer"
}
}
}
GET artists-common-grams/_mapping
POST artists-common-grams/_open
And now, finally, it’s time for a query with a proper explain
.
GET artists-common-grams/_search
{
"explain": true,
"query": {
"match": {
"artist": "drive in"
}
}
}
which returns
"details" : [
{
"value" : 9.762707,
"description" : """weight(artist:"drive in" in 83495) [PerFieldSimilarity], result of:""",
This shows that drive in
is used as a single term.
Et voila! Querying your data with your custom shingles to improve performance, while keeping storage at bay.
You might mention now, that this is a little tedious, to create two
analyzers in the mapping and two filters. This is indeed and you may want to
follow this GitHub
issue for updates,
which plans simplify this by using query_mode
automatically or adding an
option which common grams to store when specifying index_phrases
in the
mapping.
Have fun searching & finding!