TLDR; This blog post dives into the new combined_fields query, that has been added in Elasticsearch 7.13 and why it’s a really nice addition, especially for the e-commerce use-case. However in order to understand this query better, we will also spent some time with the multi_match query. And of course you will learn about the complexity of German language within search - because who doesn’t?
What’s the issue
Let’s take a look at a grossly simplified JSON representation of an e-commerce product.
{
"name" : "Gestreiftes Kleid / Abendkleid",
"color" : "rot",
"brand" : "Esprit",
"size" : "L",
"price" : "32.49"
}
This intentionally ignores product variants, in-stock data, size/uom normalization etc.
Let’s quickly index five documents for testing
PUT products/_bulk?refresh
{ "index" : { "_id" : 1} }
{"name":"Gestreiftes Kleid","color":["gelb","blau"]}
{ "index" : { "_id" : 2} }
{"name":"Gestreiftes Kleid / Abendkleid","color":"rot","brand":"Esprit","size":"L","price":"32.49"}
{ "index" : { "_id" : 3} }
{"name":"Gestreiftes Kleid","color":["creme rose"]}
{ "index" : { "_id" : 4} }
{"name":"Gestreiftes Kleid 3000","color":["rot"]}
{ "index" : { "_id" : 5} }
{"name":"Hoodie Faster Runner","brand":"nike","size":"XL","color":"black"}
A common user search now might be kleid rot
- which has
two different terms in two different fields, but only one matches in each
field. This is hard with an AND query. The following does not return any
results
GET products/_search
{
"query": {
"multi_match": {
"query": "kleid rot",
"fields": ["name", "brand", "size", "color"],
"operator": "and"
}
}
}
Changing the operator
from and
to or
returns anything that contains
kleid.
GET products/_search
{
"query": {
"multi_match": {
"query": "kleid rot",
"fields": ["name", "brand", "size", "color"],
"operator": "or"
}
}
}
Using this particular query we cannot play around with
minimum_should_match
as only two terms are forming the query.
One way to dig into this, is using the most_fields
type in the
multi_match
query:
GET products/_search
{
"query": {
"multi_match": {
"query": "kleid rot",
"fields": ["name", "brand", "size", "color"],
"operator": "or",
"type": "most_fields"
}
}
}
This sums up the score of all fields and shows that documents that have a
rot
in their color and kleid
in their name a scored the highest. However
the third document does not contain rot
and so you may end up with a lot
of unwanted documents - which still may be fine in an e-commerce setup.
We can do one final change to get only proper matches and use
cross_fields
with an and
operator.
GET products/_search
{
"query": {
"multi_match": {
"query": "kleid rot",
"fields": ["name", "brand", "size", "color"],
"operator": "and",
"type": "cross_fields"
}
}
}
This only returns two documents. We could also go with
minimum_should_match
and an or
query now like this
GET products/_search
{
"query": {
"multi_match": {
"query": "kleid rot esprit",
"fields": ["name", "brand", "size", "color"],
"operator": "or",
"minimum_should_match": 2,
"type": "cross_fields"
}
}
}
This still returns two documents - and the red esprit dress is ranked
higher. In addition you could search for a term that is not included in the
index and still return data, like espritt
or something.
So everything is good, and why did I name this blog post after a new query if it’s not even shown yet?!
From a high level perspective the cross_fields
type and the new
combined_fields
query are rather similar.
GET test/_search
{
"query": {
"combined_fields": {
"query": "kleid rot esprit",
"fields": [ "name", "brand", "size", "color" ],
"operator": "or",
"minimum_should_match": 2
}
}
}
The results are similar as well, but the scoring is different. And this is where we have to dive a alittle into details.
So, is this a new query just for scoring? Yes it is, and I will explain why this makes sense. First, the cross field type could create broken scores or here. Also cross fields was using it’s own scoring formula, that was confusing users.
A solution to this problem is BM25F resulting in a more robust approach when scoring. The idea is similar to cross fields, where the query is term-centric, analyzing the query as individual terms, then looking for each term in any of the fields, basically treating all the fields as a single big field. However BM25F combines document statistics, so that the use of BM25 and its TF saturation is kept - and as a side effect the above scoring issues do not occur (not really a side effect, is it?).
If you want to read more about BM25F, check out the PDF link in the
combined_fields
query
about The Probabilistic Relevance Framework: BM25 and Beyond
.
Digging deeper
It wouldn’t be fun, if there were not any limitations, right? Now we’ll
finally figure out, why I picked a German example. German is not only the
language of compound words and thus required decompounders. In English
language a dress has stripes and is striped. So both terms could be
stemmed. In German, this is different: a dress has Streifen
, but it is
gestreift
. I don’t know the exact translation, but I think this is an
adjective in the past participle - not the the present participle. In German
this would be a partizipatives Adjektiv
.
Long story short, if you want to search for kleid gestreift
or rotes kleid
you need some extra processing. Let’s take a look at the german
analyzer,
which takes German stopwords into account and is doing some sort of
stemming. Let’s take a look
GET _analyze?filter_path=**.token
{
"analyzer": "german",
"text": ["rotes Kleid"]
}
results in
{
"tokens" : [
{
"token" : "rot"
},
{
"token" : "kleid"
}
]
}
OK, so this works. However we’re still unlucky for kleid gestreift
.
Searching for gestreiftes Kleid
however will reduce this to gestreift kleid
, so some stemming is happening. Also streifen
will be stemmed to
streif
. Even though you could pull up some more stemming tricks and use
the hunspell token
filter,
download a dictionary and improve that, there is currently no way, to get
from gestreift
to streifen
or streif
- which would rather require lemmatization,
which in turn means, you need to apply part-of-speech tagging as well, to
figure out the context of a word. This is even more hard in e-commerce, as
users rarely write full sentences, so any model would have a hard time.
Another approach, that could not be generalized, but might be worth for some
terms, could be a stupid use of synonyms. You could use a toolkit like
nltk
or HanTa
to create
lemmas,
and then use the lemmatized version as a synonym. So for gestreift
this
could be streifen
. Let’s do some index changes to try this out
POST products/_close
PUT products/_settings
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"german_stop",
"german_stemmer",
"search_synonym_filter"
]
}
},
"filter": {
"search_synonym_filter": {
"type": "synonym_graph",
"synonyms": [
"gestreift => streifen"
]
},
"german_stop": {
"type": "stop",
"stopwords": "_german_"
},
"german_stemmer": {
"type": "stemmer",
"language": "light_german"
}
}
}
}
}
}
POST products/_open
GET products/_analyze?filter_path=**.token
{
"text": "rotes kleid gestreift",
"analyzer": "synonym_search_analyzer"
}
returns
{
"tokens" : [
{
"token" : "rot"
},
{
"token" : "kleid"
},
{
"token" : "streif"
}
]
}
This looks good!
Now, in order to get this into our index, let’s recreate and re-index the data with the correct settings
DELETE products
PUT products
{
"mappings": {
"properties": {
"name": {
"type": "text",
"search_analyzer": "synonym_search_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"brand": {
"type": "text",
"search_analyzer": "synonym_search_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"size": {
"type": "text",
"search_analyzer": "synonym_search_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"color": {
"type": "text",
"search_analyzer": "synonym_search_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym_search_analyzer": {
"tokenizer": "stanard",
"filter": [
"lowercase",
"german_stop",
"german_stemmer",
"search_synonym_filter"
]
}
},
"filter": {
"search_synonym_filter": {
"type": "synonym_graph",
"synonyms": [
"gestreift => streifen"
]
},
"german_stop": {
"type": "stop",
"stopwords": "_german_"
},
"german_stemmer": {
"type": "stemmer",
"language": "light_german"
}
}
}
}
}
}
PUT products/_bulk?refresh
{ "index" : { "_id" : 1} }
{"name":"Gestreiftes Kleid","color":["gelb","blau"]}
{ "index" : { "_id" : 2} }
{"name":"Gestreiftes Kleid / Abendkleid","color":"rot","brand":"Esprit","size":"L","price":"32.49"}
{ "index" : { "_id" : 3} }
{"name":"Gestreiftes Kleid","color":["creme rose"]}
{ "index" : { "_id" : 4} }
{"name":"Gestreiftes Kleid 3000","color":["rot"]}
{ "index" : { "_id" : 5} }
{"name":"Hoodie Faster Runner","brand":"nike","size":"XL","color":"black"}
Now we can search for rotes esprit kleid mit streifen
(red esprit dress
with stripes)
GET products/_search
{
"query": {
"multi_match": {
"query": "rotes esprit kleid mit streifen",
"fields": ["name", "brand", "size", "color"],
"operator": "or",
"minimum_should_match": 2,
"type": "cross_fields"
}
}
}
GET products/_search
{
"query": {
"combined_fields": {
"query": "rotes esprit kleid mit streifen",
"fields": [ "name", "brand", "size", "color" ],
"operator": "or",
"minimum_should_match": 2
}
}
}
When you compare the score, you will see some differences. However there is
more to it, the multi_match
query would allow you to use different
analyzers in the mapping, where as the combined_fields
query requires the
same analyzer. This is indeed a problem, because in our example it might
make sense to stem the color
field to match rotes
for rot
, but it does
not make sense to stem the brand
field, as you probably would like to
search for nike
, which gets stemmed to nik
, but does not make sense, as
brands do not have root forms. Also, fuzziness is not supported in the
combined_fields
query.
It might also make more sense to use the synonyms only for the name
field
but not for others fields that are part of the query. For this you would
need to resort back to the multi_match
query with cross_fields
.
There is one more difference to the cross_fields
type, which is outlined
by Mark in this github
discussion. In our
example the meaning of name or description of a product does not change, but
in the case of an address book with first names and last names it indeed does,
as Alex as a last name is much more rare and thus might be more or less
important.
That said, if you can live with the limitations, testing out the new combined fields query is worth a try.
One last thing, if your rules may become more complex than a single synonym and you also would like to introduce some more scoring/ranking taking a look at quergy might be a good idea as well.
Summary
- Check out the search results for kleid rot gestreift and kleid rot streifen at Zalando (second biggest e-commerce retailer in Germany), and you will understand. Update: A fine fellow at Zalando contacted me, and it turns out that they are very capable of parsing this correctly, if you have a german locale set in your browser. And indeed this is true. When you set your locale to german at the top, you will be redirected to the german search and then searching for rot kleid gestreift or and kleid mit roten streifen returns the same results.
- If you are interested in Implementing A Modern E-Commerce Search, read my other long blog post
- The original Github Issue Discussion
- The combined_fields pull request