Author Avatar Image
Alexander Reelsen

Backend developer, productivity fan, likes the JVM, full text search, distributed databases & systems

Using the common grams filter for faster queries
Apr 14, 2021
11 minutes read

TLDR; This article takes a look at customizable shingles using the common grams filter within Elasticsearch. This allows for a trade-off between required space for shingles and faster searches.

It’s all about stop words - again…

Stop words are omnipresent in your data. Sometimes it’s terms that occur often in a certain language like the, a, that, they, of and or. Sometimes these are terms, that occur just very often in your data set - think of medical terms when you are indexing patient reports or medicine manuals.

Those terms tend to occur in many documents and therefore the terms dictionary will point to a lot of documents for that term.

A common practice used to be to drop those stop words - sometimes to save space, sometimes to prevent slow queries, sometimes to not return irrelevant data. While all are somewhat correct, this comes with other problems. First you are not able to search for phrases anymore, second you might not be even able to find the data your are searching for, because a phrase like to be or not to be only contains stop words and thus documents cannot be found anymore, once stop words have been dropped on indexing.

Many users keep stop words in their data and accept the fact of a slightly larger terms dictionary. This blog post will show a couple of alternatives and their trade-offs. Let’s go…

Preparing the data

As usual, let’s grab a data set from the internet, and get it into Elasticsearch.

I picked the Million Song Dataset, which contains a single text file with around 500k tracks. Sounds like a good candidate for indexing. This is the link.

A single line looks like this:

1922<SEP>TRSGHLU128F421DF83<SEP>Alberta Hunter<SEP>Don't Pan Me

The separator is creatively named <SEP>, each row has four columns

  • The year of the song
  • A unique ID
  • Artist name
  • Song name

OK, let’s get this into a bulk format. We could try to be fancy and use things like xsv for this, but a little bit of good old awk is actually enough, including some cheap JSON escaping:

cat tracks_per_year.txt |  awk -F '<SEP>' '{  gsub(/"/, "\\\"", $4) ;  gsub(/"/, "\\\"", $3) ;  printf "{\"index\":{ \"_id\":\"%s\"}}\n{ \"artist\":\"%s\", \"title\":\"%s\", \"year\":%s}\n", $2, $3, $4,$1 }' > output.ndjson

What does this do? Let’s go through it step by step

  • gsub(/"/, "\\\"", $4) replaces double ticks with escaped double ticks for the song name (fourth field in the data)
  • gsub(/"/, "\\\"", $3) - same for the artists name
  • printf emulates the bulk index JSON format, using the ID as unique identifier and putting, artist, title and year in the JSON body

We’ll end up with a file around 54 MB in size. Let’s reduce the file size below 10MB for easier bulk indexing. A wc -l output.ndjson yields a little more than one million lines, so 150000 lines per file sounds like a good candidate.

split -l 150000 output.ndjson

You will now end up with seven files having around 8 MB per file. Now all we have to do, is indexing those:

for i in x* ; do 
  curl -s -H "Content-Type: application/x-ndjson" -XPOST \
    localhost:9200/artists/_bulk --data-binary "@$i"
  echo $i
done

This probably took some time (if you need it faster, forward the output of the bulk response to /dev/null as printing to a terminal always takes some time.

Let’s verify the amount of data being indexed

GET artists/_count

For me this is was 514907 on Elasticsearch 7.12 - so a few documents have not been indexed, but good enough for me - I suppose some more escaping would have been needed to index all documents.

Before we continue, let’s force merge out data to a single segment, so we can compare it better in the future:

POST artists/_forcemerge?max_num_segments=1

After force merging, you can use the index stats to figure out how many segments your index has

GET artists/_stats?human&filter_path=indices.artists.primaries.store.size,indices.artists.primaries.segments.count

returns this to me

{
  "indices" : {
    "artists" : {
      "primaries" : {
        "store" : {
          "size" : "52.7mb"
        },
        "segments" : {
          "count" : 1
        }
      }
    }
  }
}

Ok, so our data needs 52.7 megabyte in a single segment. Time for a first search.

Let’s start with the most boring search containing a stop word

GET artists/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "artist": {
              "query": "The Who",
              "operator": "and"
            }
          }
        }
      ]
    }
  }
}

This returns songs from The Who and also reveals, that there are some duplicate songs in the data set - as always in life, data is rarely clean.

So, imagine we are searching for At the drive in (Relationship of Command, one of my favorite all time albums) but we only remember the beginning of the band name at the. Oh look! two stop words!

One thing we could do is switching to a match phrase query.

GET artists/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "artist": {
              "query": "at the"
            }
          }
        }
      ]
    }
  }
}

Or if we are sure about the term starting at the beginning we can go around with a match_phrase_prefix query

GET artists/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase_prefix": {
            "artist": {
              "query": "at the"
            }
          }
        }
      ]
    }
  }
}

Another common option would be to search for at the without an order, but boost the phrase query like this

GET artists/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "artist": "at the"
          }
        }
      ],
      "must": [
        {
          "match": {
            "artist": {
              "query": "at the",
              "operator": "and"
            }
          }
        }
      ]
    }
  }
}

So where’s the cost for these queries? Of course, on my local machine those are still really fast, because a 50MB data set with 500k entries is not a lot too handle on my 2017 notebook - easily fitting in the page cache.

However there might be some potential to optimize. And the first thing that comes to mind is to improve the query side of things, but as usual thinking about the index side makes a lot of sense here - without removing stop words.

Common Terms Query - the deprecated way

There is a special query named the Common Terms Query, which divides the query terms into two groups, high and low frequency terms. By searching for the low frequency terms first and only then scoring for the high frequency terms the number of score calculations was greatly reduced.

What used to be a smart plan in previous Elasticsearch and Lucene versions, has nowadays become obsolete, because of the ability to skip blocks of documents efficiently. This was introduced in Elasticsearch 7.0 and so you can just use a match query.

So, this approach is here merely to historical reasons and to remind you, that you can get rid of that - I will not provide any examples here.

Using index_phrases in text fields

Instead of indexing single terms, we can index phrases using the index_phrases option in the mapping. Time for some reindex magic.

PUT artists-index-phrases
{
  "mappings": {
    "properties": {
      "artist": {
        "type": "text",
        "index_phrases" : true,
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

POST _reindex
{
  "source": {"index": "artists"},
  "dest": {"index": "artists-index-phrases"}
}

POST artists-index-phrases/_forcemerge?max_num_segments=1

GET artists-index-phrases/_stats?human&filter_path=indices.artists-index-phrases.primaries.store.size,indices.artists-index-phrases.primaries.segments.count

After creating a new index with the index_phrases option set to true, we can reindex and forcemerge our data down to a single segment, so that its size is comparable to our original data set.

Turns out, the data is 55.2 MB, and thus only 2.5 MB bigger. Indexing shingles may easily require far more space, but in this example, we used the artist name, which is rather short compared to something like a product description. This is a good example, that you should always test with your own data and don’t trust things on the internet 😀

Now, what have we gained by setting this option. Basically we got a free query optimization under the hood. When running the following query

GET artists-index-phrases/_search
{
  "explain": true, 
  "query": {
    "match_phrase": {
      "artist": "at the"
    }
  }
}

First, you get back correct results, but taking an extra look at the explain output, shows that the shingle field is used automatically, without us doing anything.

"_explanation" : {
  "value" : 6.609586,
  "description" : "weight(artist._index_phrase:at the in 4466) [PerFieldSimilarity], result of:",

Elasticsearch checks at query time if the field has index_phrases configured and if the slop is zero, the ._index_phrase field gets used automatically - which is faster, as this is a single lookup.

The middle ground - using common_grams

Despite us being lucky, let’s imagine for a second, that we would need much more space and that the amount of space might not be justified to the very few queries coming in using high frequency terms. Luckily there is a solution for that as well - namely the common_grams token filter.

The common grams token filter allows you to configure a list of common terms, that should be combined to a shingle (two terms as one in the terms dictionary). Let’s take a look at the EnglishAnalyzer and steal the stop word list.

GET /_analyze?filter_path=**.token
{
  "tokenizer": "standard",
  "filter": [
    "lowercase" ,
    {
      "type": "common_grams",
      "common_words": [
        "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
      ]
    }
  ],
  "text": "At the Drive-In"
}

This returns

{
  "tokens" : [
    {
      "token" : "at"
    },
    {
      "token" : "at_the"
    },
    {
      "token" : "the"
    },
    {
      "token" : "the_drive"
    },
    {
      "token" : "drive"
    },
    {
      "token" : "drive_in"
    },
    {
      "token" : "in"
    }
  ]
}

This looks good for the indexing side. We will index each common word also as a single term, because we want to find single term queries as well. Let’s do another round of reindexing

PUT artists-common-grams
{
  "mappings": {
    "properties": {
      "artist": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "common_grams_analyzer" : {
          "type" : "custom",
          "tokenizer" :"standard",
          "filter" : [ "lowercase", "stopword_common_grams"]
        }
      }, 
      "filter": {
        "stopword_common_grams" : {
          "type": "common_grams",
          "common_words": [
        "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
          ]
        }
      }
    }
  }
}

POST _reindex
{
  "source": { "index": "artists"},
  "dest": { "index": "artists-common-grams"}
}

GET artists-common-grams/_count

POST artists-common-grams/_flush

POST artists-common-grams/_forcemerge?max_num_segments=1

GET artists-common-grams/_stats?human&filter_path=indices.artists-common-grams.primaries.store.size,indices.artists-common-grams.primaries.segments.count

After running the last command, you should see, that your data is as big as the index data from the first index, so adding those shingles from your common words did not really attribute to size.

However, we have to do one last step to make this work, and that is setting a different default search analyzer. Take a moment to read the docs about the query_mode. Let’s take a look at the differences by running the following two requests:

GET /_analyze?filter_path=**.token
{
  "tokenizer": "standard",
  "filter": [
    "lowercase" ,
    {
      "type": "common_grams",
      "common_words": [
        "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
      ]
    }
  ],
  "text": "At the Drive-In"
}

GET /_analyze?filter_path=**.token
{
  "tokenizer": "standard",
  "filter": [
    "lowercase" ,
    {
      "type": "common_grams",
      "common_words": [
        "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
      ],
      "query_mode": true
    }
  ],
  "text": "At the Drive-In"
}

returns

# GET /_analyze?filter_path=**.token
{
  "tokens" : [
    {
      "token" : "at"
    },
    {
      "token" : "at_the"
    },
    {
      "token" : "the"
    },
    {
      "token" : "the_drive"
    },
    {
      "token" : "drive"
    },
    {
      "token" : "drive_in"
    },
    {
      "token" : "in"
    }
  ]
}

# GET /_analyze?filter_path=**.token
{
  "tokens" : [
    {
      "token" : "at_the"
    },
    {
      "token" : "the_drive"
    },
    {
      "token" : "drive_in"
    }
  ]
}

As you can see, the second request only returns the shingles - which is exactly what we need on query time. Just to be sure: If you append a non-stopword to that query, you will of course query for single terms. Try it out!

Let’s add a search analyzer to our mapping, which requires closing and opening the index.

POST artists-common-grams/_close

PUT artists-common-grams/_settings
{
  "analysis": {
    "analyzer": {
      "common_grams_query_analyzer" : {
        "type" : "custom",
        "tokenizer" :"standard",
        "filter" : [ "lowercase", "stopword_common_grams"]
      }
    }, 
    "filter": {
      "stopword_common_grams_query" : {
        "type": "common_grams",
        "common_words": [
      "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
        ],
        "query_mode" : true
      }
    }
  }
}

GET artists-common-grams/_settings

PUT artists-common-grams/_mapping
{
  "properties": {
    "artist": {
      "type" : "text",
      "search_analyzer": "common_grams_query_analyzer"
    }
  }
}

GET artists-common-grams/_mapping

POST artists-common-grams/_open

And now, finally, it’s time for a query with a proper explain.

GET artists-common-grams/_search
{
  "explain": true, 
  "query": {
    "match": {
      "artist": "drive in"
    }
  }
}

which returns

"details" : [
  {
    "value" : 9.762707,
    "description" : """weight(artist:"drive in" in 83495) [PerFieldSimilarity], result of:""",

This shows that drive in is used as a single term.

Et voila! Querying your data with your custom shingles to improve performance, while keeping storage at bay.

You might mention now, that this is a little tedious, to create two analyzers in the mapping and two filters. This is indeed and you may want to follow this GitHub issue for updates, which plans simplify this by using query_mode automatically or adding an option which common grams to store when specifying index_phrases in the mapping.

Have fun searching & finding!


Back to posts