Search hierarchies using the path hierarchy tokenizer

Mar 17, 2021

10 minutes read

TLDR; This article takes a look of how to approach a common problem, that usually gets’ solved with parent/child or recursive relationships by taking a look at a different solution by using a functionality called the path hierarchy tokenizer within Elasticsearch.

What use-case can you cover with this?

Before we dive deeper, let’s get the scope right! There are two common use-cases, that come to my mind immediately - I’m sure there are more! The first one is about modeling a file-system the the file paths looking like this:

/home/alr/devel/spinscale.de/website
/home/alr/devel/spinscale.de/website/index.html
/tmp/trash.xml

The second use-case comes from my prime example from my former job, an e-commerce product search engine. Each product could be in multiple categories, that were modeled as breadcrumbs.

Electronics > White goods > Refrigerators > Fridge/Freezer Combination

Before we dive into the details of how to store this, let’s get one question out of the way. What about the join field type?

Why not the `join` field type?

You may have stumbled over the join field type already. That field type in combination with the has_parent query, the has_child query, the children aggregation and the inner hits functionality allows to create a single 1:n or parent/child relationship within your data. Each parent and child are separate documents. This is sometimes considered as a solution for the above use-case, as this allows to create a hierarchy - which might be a good idea for the breadcrumbs above. Have a hierarchy Electronics, that contains all the White goods referenced to it. This comes with a couple of advantages and disadvantages:

Single document updates are easy
Moving a category to another is easy, a single document update with a new parent is sufficient
Powerful query capabilities to execute full text search against parent and children in the same query
To walk up the hierarchy, you would need as many parent queries, as you have depth
Every document can only have a single parent
Everything would need to be stored on a single shard

The last two points are fatal enough to not consider this a proper solution. The join field type has its use-cases (especially on high update rates either on the children or the parent and when you do not have a single root), but if possible, try to prevent it. And the above use case of e-commerce categorization or Unix paths do not require the join field type.

Let’s take a look how!

Understanding the `path_hierarchy` tokenizer

The main component to make everything work, is the path hierarchy tokenizer. Let’s get up and running with a proper analyze example first

GET _analyze?filter_path=**.token
{
  "tokenizer": {
    "type" : "path_hierarchy",
    "delimiter" : ">"
  }, 
  "text": [
    "Electronics > White goods > Refrigerators > Fridge/Freezer Combination"
  ]
}

This returns the following tokens:

{
  "tokens" : [
    {
      "token" : "Electronics "
    },
    {
      "token" : "Electronics > White goods "
    },
    {
      "token" : "Electronics > White goods > Refrigerators "
    },
    {
      "token" : "Electronics > White goods > Refrigerators > Fridge/Freezer Combination"
    }
  ]
}

This has one minor problem, and that is the use of white spaces at the end, let’s fix this by adding a trim filter:

GET _analyze?filter_path=**.token
{
  "tokenizer": {
    "type" : "path_hierarchy",
    "delimiter" : ">"
  },
  "filter": ["trim"], 
  "text": [
    "Electronics > White goods > Refrigerators > Fridge/Freezer Combination"
  ]
}

Side note: If your hierarchy is reversed, you can use the "reverse": true

GET _analyze?filter_path=**.token
{
  "tokenizer": {
    "type" : "path_hierarchy",
    "delimiter" : ">",
    "reverse" : true
  },
  "filter": ["trim"], 
  "text": [
    "Electronics > White goods > Refrigerators > Fridge/Freezer Combination"
  ]
}

which returns

{
  "tokens" : [
    {
      "token" : "Electronics > White goods > Refrigerators > Fridge/Freezer Combination"
    },
    {
      "token" : "White goods > Refrigerators > Fridge/Freezer Combination"
    },
    {
      "token" : "Refrigerators > Fridge/Freezer Combination"
    },
    {
      "token" : "Fridge/Freezer Combination"
    }
  ]
}

For our use-case we will stick with the regular tokenizer and not use reverse. Create an index with the appropriate mapping:

PUT products
{
  "mappings": {
    "properties": {
      "category": {
        "type": "text", 
        "analyzer": "breadcrumb"
      }
    }
  },
  "settings": {
    "analysis": {
      "tokenizer": {
        "breadcrumb_path_hierarchy": {
          "type": "path_hierarchy",
          "delimiter": ">"
        }
      },
      "analyzer": {
        "breadcrumb": {
          "type": "custom",
          "tokenizer": "breadcrumb_path_hierarchy",
          "filter" : ["trim"]
        }
      }
    }
  }
}

and index sample documents

PUT products/_doc/lg-freezer
{
  "category" : "Electronics > White goods > Refrigerators > Fridge/Freezer Combination",
  "name" : "LG Freezer"
}

PUT products/_doc/bauknecht-washing-machine
{
  "category" : "Electronics > White goods > Washing > Washing Machines",
  "name" : "Bauknecht P-500"
}

PUT products/_doc/bauknecht-washing-machine-p-200
{
  "category" : "Electronics > White goods > Washing > Washing Machines",
  "name" : "Bauknecht P-200"
}

Querying for prefixes

Now let’s execute some search to understand what we can search for. Let’s start with searching all products in Washing Machines first and then for all White goods

GET products/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "category": "Electronics > White goods > Washing > Washing Machines"
          }
        }
      ]
    }
  }
}

If you want to search for white goods, you have to provide the whole breadcrumb like this

GET products/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "category": "Electronics > White goods"
          }
        }
      ]
    }
  }
}

So, this way we can query for all documents within a certain category, no matter if they are within a sub category. As we have the power of full text search, we could use the should part of a bool query to influence the score - maybe scoring the most recently bought products up or go fancy and take the preferences of a user into account.

Utilizing the terms lookup query

In order to find products within the same category, one could utilize the terms lookup query, and optionally also exclude the own document like this:

GET products/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "term": {
            "_id": "bauknecht-washing-machine"
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "category": {
              "index": "products",
              "id": "bauknecht-washing-machine",
              "path": "category"
            }
          }
        }
      ]
    }
  }
}

The above will only work for the deepest category as this will directly extract the field from the source.

The complexity of updates

I have not stressed this enough in this article and it might not be obvious in this example: The ability to model a hierarchy like this and search within that hierarchy without actually creating a tree but solving this via full-text search is a really nice trick to not use something like the join field type.

One of the first things I usually tell at trainings or at public talks is, that search is always a trade off. Either you do things at index time (like enriching your documents or change the structure) in order to be fast at query time (using the path_hierarchy tokenizer) or you do not spend CPU time at indexing, but will probably end up with slower queries (this is the implementation of the join field type by using an in-memory data structure to join documents at query time).

Imagine you have change your category structure, and move all washing machines from

Electronics > White goods > Washing > Washing Machines

Electronics > Household > Washing Machines

If you modeled your product categories via the join field type, this might be just a single document that needs a new name. In our path hierarchy tokenizer case, we would need to run an update_by_query operation.

POST products/_update_by_query
{
  "script": {
    "source": "ctx._source.category = 'Electronics > Household > Washing Machines'",
    "lang": "painless"
  },
  "query": {
    "term": {
      "category": "Electronics > White goods > Washing > Washing Machines"
    }
  }
}

So, the number of updates along with the duration is dependent from the number of documents within a category - and thus might take some time with a massive data set. As this is not transactional, a proper update strategy might be to update this in two runs. First add the new category, then remove the old one - see the next paragraph.

This is a common example for having to do work at index time to ensure that query time operations are as fast possible.

Support for multiple paths

Another drive by feature is the support of multiple path - something that is not possible with the join field type as you could only specify a single parent. All you have to do, is to have an array of strings in the category field like this

PUT products/_doc/bauknecht-washing-machine
{
  "category": [
    "Electronics > Household > Washing Machines",
    "Electronics > White goods > Washing > Washing Machines"
  ],
  "name": "Bauknecht P-500"
}

Now you could pick any category of the categories (or any prefix like Electronics > Household to filter for documents.

Using a different search analyzer

You may have noticed, that I only used term queries in all my examples. That happened partly due to laziness and to keep the examples small, but might not reflect the real world. Due to the analyzer configuration, there was no need to take something like synonym support or lower casing into account.

Let’s have an example, why an analyzed query using the index analyzer is a problem:

PUT products/_doc/wrong-shown-product
{
  "category": [
    "Electronics > Washing > Washing Machines"
  ],
  "name": "SHOULD NOT BE SHOWN"
}

GET products/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "category": "Electronics > Household > Washing Machines"
          }
        }
      ]
    }
  }
}

Running the last query will return all results, even though we probably only expected two. Why is this?

Let’s take a look at the output of the query explanation for the product we did not expect using the explain API

GET products/_explain/wrong-shown-product
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "category": "Electronics > Household > Washing Machines"
          }
        }
      ]
    }
  }
}

shows the following for the above indexed product:

{
  "_index" : "products",
  "_type" : "_doc",
  "_id" : "wrong-shown-product",
  "matched" : true,
  "explanation" : {
    "value" : 0.0,
    "description" : "ConstantScore(Synonym(category:Electronics category:Electronics > Household category:Electronics > Household > Washing Machines))^0.0",
    "details" : [ ]
  }
}

Because of using the same analyzer when indexing data, this search actually searches for category:Electronics so that every document matches. Let’s switch the analyzer being used (note the slightly changed structure of the query):

GET products/_explain/wrong-shown-product
{
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "category": {
              "query": "Electronics > Household > Washing Machines",
              "analyzer" : "keyword"
            }
          }
        }
      ]
    }
  }
}

This will correctly returned matched: false, as the query will not be analyzed - which in this context is the same as using a term query.

So, what could be a good use-case when you need an analyzed query? Whenever the category filter is based on user input. If you are modeling file paths instead of e-commerce categories, you might want to lowercase a path first before filtering by category, because you model a case-insensitive file system.

Summary

I hope you enjoyed this ride. The main purpose of this blog post is not to show you any example of how a certain use-case is solved, but rather keep your mind bent that sometimes solving a search use-case requires different thinking than how you would solve this with other data stores - including advantages and disadvantages.

Final remarks

If you made it down here, wooow! Thanks for sticking with me. You can follow or ping me on twitter, github or reach me via Email.

If there is anything to correct, drop me a note, and I am happy to do so!

Same applies for questions. If you have question, go ahead and ask!

Back to posts

Backend developer, productivity fan, likes the JVM, full text search, distributed databases & systems