TLDR; This article takes a look of how to approach a common problem, that usually gets’ solved with parent/child or recursive relationships by taking a look at a different solution by using a functionality called the path hierarchy tokenizer within Elasticsearch.
What use-case can you cover with this?
Before we dive deeper, let’s get the scope right! There are two common use-cases, that come to my mind immediately - I’m sure there are more! The first one is about modeling a file-system the the file paths looking like this:
/home/alr/devel/spinscale.de/website
/home/alr/devel/spinscale.de/website/index.html
/tmp/trash.xml
The second use-case comes from my prime example from my former job, an e-commerce product search engine. Each product could be in multiple categories, that were modeled as breadcrumbs.
Electronics > White goods > Refrigerators > Fridge/Freezer Combination
Before we dive into the details of how to store this, let’s get one question
out of the way. What about the join
field type?
Why not the join
field type?
You may have stumbled over the join field
type
already. That field type in combination with
the has_parent query,
the has_child query,
the children aggregation
and the inner hits functionality
allows to create a single 1:n
or parent/child relationship within your data.
Each parent and child are separate documents. This is sometimes considered
as a solution for the above use-case, as this allows to create a hierarchy -
which might be a good idea for the breadcrumbs above. Have a hierarchy
Electronics
, that contains all the White goods
referenced to it. This
comes with a couple of advantages and disadvantages:
- Single document updates are easy
- Moving a category to another is easy, a single document update with a new parent is sufficient
- Powerful query capabilities to execute full text search against parent and children in the same query
- To walk up the hierarchy, you would need as many parent queries, as you have depth
- Every document can only have a single parent
- Everything would need to be stored on a single shard
The last two points are fatal enough to not consider this a proper solution.
The join
field type has its use-cases (especially on high update rates
either on the children or the parent and when you do not have a single
root), but if possible, try to prevent it. And the above use case of
e-commerce categorization or Unix paths do not require the join
field type.
Let’s take a look how!
Understanding the path_hierarchy
tokenizer
The main component to make everything work, is the path hierarchy tokenizer. Let’s get up and running with a proper analyze example first
GET _analyze?filter_path=**.token
{
"tokenizer": {
"type" : "path_hierarchy",
"delimiter" : ">"
},
"text": [
"Electronics > White goods > Refrigerators > Fridge/Freezer Combination"
]
}
This returns the following tokens:
{
"tokens" : [
{
"token" : "Electronics "
},
{
"token" : "Electronics > White goods "
},
{
"token" : "Electronics > White goods > Refrigerators "
},
{
"token" : "Electronics > White goods > Refrigerators > Fridge/Freezer Combination"
}
]
}
This has one minor problem, and that is the use of white spaces at the end, let’s fix this by adding a trim filter:
GET _analyze?filter_path=**.token
{
"tokenizer": {
"type" : "path_hierarchy",
"delimiter" : ">"
},
"filter": ["trim"],
"text": [
"Electronics > White goods > Refrigerators > Fridge/Freezer Combination"
]
}
Side note: If your hierarchy is reversed, you can use the "reverse": true
GET _analyze?filter_path=**.token
{
"tokenizer": {
"type" : "path_hierarchy",
"delimiter" : ">",
"reverse" : true
},
"filter": ["trim"],
"text": [
"Electronics > White goods > Refrigerators > Fridge/Freezer Combination"
]
}
which returns
{
"tokens" : [
{
"token" : "Electronics > White goods > Refrigerators > Fridge/Freezer Combination"
},
{
"token" : "White goods > Refrigerators > Fridge/Freezer Combination"
},
{
"token" : "Refrigerators > Fridge/Freezer Combination"
},
{
"token" : "Fridge/Freezer Combination"
}
]
}
For our use-case we will stick with the regular tokenizer and not use
reverse
. Create an index with the appropriate mapping:
PUT products
{
"mappings": {
"properties": {
"category": {
"type": "text",
"analyzer": "breadcrumb"
}
}
},
"settings": {
"analysis": {
"tokenizer": {
"breadcrumb_path_hierarchy": {
"type": "path_hierarchy",
"delimiter": ">"
}
},
"analyzer": {
"breadcrumb": {
"type": "custom",
"tokenizer": "breadcrumb_path_hierarchy",
"filter" : ["trim"]
}
}
}
}
}
and index sample documents
PUT products/_doc/lg-freezer
{
"category" : "Electronics > White goods > Refrigerators > Fridge/Freezer Combination",
"name" : "LG Freezer"
}
PUT products/_doc/bauknecht-washing-machine
{
"category" : "Electronics > White goods > Washing > Washing Machines",
"name" : "Bauknecht P-500"
}
PUT products/_doc/bauknecht-washing-machine-p-200
{
"category" : "Electronics > White goods > Washing > Washing Machines",
"name" : "Bauknecht P-200"
}
Querying for prefixes
Now let’s execute some search to understand what we can search for. Let’s
start with searching all products in Washing Machines
first and then for
all White goods
GET products/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"category": "Electronics > White goods > Washing > Washing Machines"
}
}
]
}
}
}
If you want to search for white goods, you have to provide the whole breadcrumb like this
GET products/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"category": "Electronics > White goods"
}
}
]
}
}
}
So, this way we can query for all documents within a certain category, no
matter if they are within a sub category. As we have the power of full text
search, we could use the should
part of a bool query to influence the
score - maybe scoring the most recently bought products up or go fancy and
take the preferences of a user into account.
Utilizing the terms lookup query
In order to find products within the same category, one could utilize the terms lookup query, and optionally also exclude the own document like this:
GET products/_search
{
"query": {
"bool": {
"must_not": [
{
"term": {
"_id": "bauknecht-washing-machine"
}
}
],
"filter": [
{
"terms": {
"category": {
"index": "products",
"id": "bauknecht-washing-machine",
"path": "category"
}
}
}
]
}
}
}
The above will only work for the deepest category as this will directly extract the field from the source.
The complexity of updates
I have not stressed this enough in this article and it might not be obvious
in this example: The ability to model a hierarchy like this and search within
that hierarchy without actually creating a tree but solving this via
full-text search is a really nice trick to not use something like the join
field type.
One of the first things I usually tell at trainings or at public talks is,
that search is always a trade off. Either you do things at index time (like
enriching your documents or change the structure) in order to be fast at
query time (using the path_hierarchy
tokenizer) or you do not spend CPU
time at indexing, but will probably end up with slower queries (this is the
implementation of the join
field type by using an in-memory data structure
to join documents at query time).
Imagine you have change your category structure, and move all washing machines from
Electronics > White goods > Washing > Washing Machines
to
Electronics > Household > Washing Machines
If you modeled your product categories via the join
field type, this might be
just a single document that needs a new name. In our path hierarchy
tokenizer case, we would need to run an update_by_query operation.
POST products/_update_by_query
{
"script": {
"source": "ctx._source.category = 'Electronics > Household > Washing Machines'",
"lang": "painless"
},
"query": {
"term": {
"category": "Electronics > White goods > Washing > Washing Machines"
}
}
}
So, the number of updates along with the duration is dependent from the number of documents within a category - and thus might take some time with a massive data set. As this is not transactional, a proper update strategy might be to update this in two runs. First add the new category, then remove the old one - see the next paragraph.
This is a common example for having to do work at index time to ensure that query time operations are as fast possible.
Support for multiple paths
Another drive by feature is the support of multiple path - something that is
not possible with the join
field type as you could only specify a single
parent. All you have to do, is to have an array of strings in the category
field like this
PUT products/_doc/bauknecht-washing-machine
{
"category": [
"Electronics > Household > Washing Machines",
"Electronics > White goods > Washing > Washing Machines"
],
"name": "Bauknecht P-500"
}
Now you could pick any category of the categories (or any prefix like
Electronics > Household
to filter for documents.
Using a different search analyzer
You may have noticed, that I only used term
queries in all my examples.
That happened partly due to laziness and to keep the examples small, but
might not reflect the real world. Due to the analyzer configuration, there
was no need to take something like synonym support or lower casing into
account.
Let’s have an example, why an analyzed query using the index analyzer is a problem:
PUT products/_doc/wrong-shown-product
{
"category": [
"Electronics > Washing > Washing Machines"
],
"name": "SHOULD NOT BE SHOWN"
}
GET products/_search
{
"query": {
"bool": {
"filter": [
{
"match": {
"category": "Electronics > Household > Washing Machines"
}
}
]
}
}
}
Running the last query will return all results, even though we probably only expected two. Why is this?
Let’s take a look at the output of the query explanation for the product we did not expect using the explain API
GET products/_explain/wrong-shown-product
{
"query": {
"bool": {
"filter": [
{
"match": {
"category": "Electronics > Household > Washing Machines"
}
}
]
}
}
}
shows the following for the above indexed product:
{
"_index" : "products",
"_type" : "_doc",
"_id" : "wrong-shown-product",
"matched" : true,
"explanation" : {
"value" : 0.0,
"description" : "ConstantScore(Synonym(category:Electronics category:Electronics > Household category:Electronics > Household > Washing Machines))^0.0",
"details" : [ ]
}
}
Because of using the same analyzer when indexing data, this search actually
searches for category:Electronics
so that every document matches. Let’s
switch the analyzer being used (note the slightly changed structure of the
query):
GET products/_explain/wrong-shown-product
{
"query": {
"bool": {
"filter": [
{
"match": {
"category": {
"query": "Electronics > Household > Washing Machines",
"analyzer" : "keyword"
}
}
}
]
}
}
}
This will correctly returned matched: false
, as the query will not be
analyzed - which in this context is the same as using a term
query.
So, what could be a good use-case when you need an analyzed query? Whenever the category filter is based on user input. If you are modeling file paths instead of e-commerce categories, you might want to lowercase a path first before filtering by category, because you model a case-insensitive file system.
Summary
I hope you enjoyed this ride. The main purpose of this blog post is not to show you any example of how a certain use-case is solved, but rather keep your mind bent that sometimes solving a search use-case requires different thinking than how you would solve this with other data stores - including advantages and disadvantages.
Final remarks
If you made it down here, wooow! Thanks for sticking with me. You can follow or ping me on twitter, github or reach me via Email.
If there is anything to correct, drop me a note, and I am happy to do so!
Same applies for questions. If you have question, go ahead and ask!