Author Avatar Image
Alexander Reelsen

Backend developer, productivity fan, likes the JVM, full text search, distributed databases & systems

Understanding Elasticsearch combined fields and multi match queries
May 28, 2021
9 minutes read

TLDR; This blog post dives into the new combined_fields query, that has been added in Elasticsearch 7.13 and why it’s a really nice addition, especially for the e-commerce use-case. However in order to understand this query better, we will also spent some time with the multi_match query. And of course you will learn about the complexity of German language within search - because who doesn’t?

What’s the issue

Let’s take a look at a grossly simplified JSON representation of an e-commerce product.

{
  "name" : "Gestreiftes Kleid / Abendkleid",
  "color" : "rot",
  "brand" : "Esprit",
  "size" : "L",
  "price" : "32.49"
}

This intentionally ignores product variants, in-stock data, size/uom normalization etc.

Let’s quickly index five documents for testing

PUT products/_bulk?refresh
{ "index" : { "_id" : 1} }
{"name":"Gestreiftes Kleid","color":["gelb","blau"]}
{ "index" : { "_id" : 2} }
{"name":"Gestreiftes Kleid / Abendkleid","color":"rot","brand":"Esprit","size":"L","price":"32.49"}
{ "index" : { "_id" : 3} }
{"name":"Gestreiftes Kleid","color":["creme rose"]}
{ "index" : { "_id" : 4} }
{"name":"Gestreiftes Kleid 3000","color":["rot"]}
{ "index" : { "_id" : 5} }
{"name":"Hoodie Faster Runner","brand":"nike","size":"XL","color":"black"}

A common user search now might be kleid rot - which has two different terms in two different fields, but only one matches in each field. This is hard with an AND query. The following does not return any results

GET products/_search
{
  "query": {
    "multi_match": {
      "query": "kleid rot",
      "fields": ["name", "brand", "size", "color"],
      "operator": "and"
    }
  }
}

Changing the operator from and to or returns anything that contains kleid.

GET products/_search
{
  "query": {
    "multi_match": {
      "query": "kleid rot",
      "fields": ["name", "brand", "size", "color"],
      "operator": "or"
    }
  }
}

Using this particular query we cannot play around with minimum_should_match as only two terms are forming the query.

One way to dig into this, is using the most_fields type in the multi_match query:

GET products/_search
{
  "query": {
    "multi_match": {
      "query": "kleid rot",
      "fields": ["name", "brand", "size", "color"],
      "operator": "or",
      "type": "most_fields"
    }
  }
}

This sums up the score of all fields and shows that documents that have a rot in their color and kleid in their name a scored the highest. However the third document does not contain rot and so you may end up with a lot of unwanted documents - which still may be fine in an e-commerce setup.

We can do one final change to get only proper matches and use cross_fields with an and operator.

GET products/_search
{
  "query": {
    "multi_match": {
      "query": "kleid rot",
      "fields": ["name", "brand", "size", "color"],
      "operator": "and",
      "type": "cross_fields"
    }
  }
}

This only returns two documents. We could also go with minimum_should_match and an or query now like this

GET products/_search
{
  "query": {
    "multi_match": {
      "query": "kleid rot esprit",
      "fields": ["name", "brand", "size", "color"],
      "operator": "or",
      "minimum_should_match": 2, 
      "type": "cross_fields"
    }
  }
}

This still returns two documents - and the red esprit dress is ranked higher. In addition you could search for a term that is not included in the index and still return data, like espritt or something.

So everything is good, and why did I name this blog post after a new query if it’s not even shown yet?!

From a high level perspective the cross_fields type and the new combined_fields query are rather similar.

GET test/_search
{
  "query": {
    "combined_fields": {
      "query": "kleid rot esprit",
      "fields": [ "name", "brand", "size", "color" ],
      "operator": "or",
      "minimum_should_match": 2
    }
  }
}

The results are similar as well, but the scoring is different. And this is where we have to dive a alittle into details.

So, is this a new query just for scoring? Yes it is, and I will explain why this makes sense. First, the cross field type could create broken scores or here. Also cross fields was using it’s own scoring formula, that was confusing users.

A solution to this problem is BM25F resulting in a more robust approach when scoring. The idea is similar to cross fields, where the query is term-centric, analyzing the query as individual terms, then looking for each term in any of the fields, basically treating all the fields as a single big field. However BM25F combines document statistics, so that the use of BM25 and its TF saturation is kept - and as a side effect the above scoring issues do not occur (not really a side effect, is it?).

If you want to read more about BM25F, check out the PDF link in the combined_fields query about The Probabilistic Relevance Framework: BM25 and Beyond.

Digging deeper

It wouldn’t be fun, if there were not any limitations, right? Now we’ll finally figure out, why I picked a German example. German is not only the language of compound words and thus required decompounders. In English language a dress has stripes and is striped. So both terms could be stemmed. In German, this is different: a dress has Streifen, but it is gestreift. I don’t know the exact translation, but I think this is an adjective in the past participle - not the the present participle. In German this would be a partizipatives Adjektiv.

Long story short, if you want to search for kleid gestreift or rotes kleid you need some extra processing. Let’s take a look at the german analyzer, which takes German stopwords into account and is doing some sort of stemming. Let’s take a look

GET _analyze?filter_path=**.token
{
  "analyzer": "german",
  "text": ["rotes Kleid"]
}

results in

{
  "tokens" : [
    {
      "token" : "rot"
    },
    {
      "token" : "kleid"
    }
  ]
}

OK, so this works. However we’re still unlucky for kleid gestreift. Searching for gestreiftes Kleid however will reduce this to gestreift kleid, so some stemming is happening. Also streifen will be stemmed to streif. Even though you could pull up some more stemming tricks and use the hunspell token filter, download a dictionary and improve that, there is currently no way, to get from gestreift to streifen or streif - which would rather require lemmatization, which in turn means, you need to apply part-of-speech tagging as well, to figure out the context of a word. This is even more hard in e-commerce, as users rarely write full sentences, so any model would have a hard time.

Another approach, that could not be generalized, but might be worth for some terms, could be a stupid use of synonyms. You could use a toolkit like nltk or HanTa to create lemmas, and then use the lemmatized version as a synonym. So for gestreift this could be streifen. Let’s do some index changes to try this out

POST products/_close

PUT products/_settings
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym_search_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "german_stop",
              "german_stemmer",
              "search_synonym_filter"
            ]
          }
        },
        "filter": {
          "search_synonym_filter": {
            "type": "synonym_graph",
            "synonyms": [
              "gestreift => streifen"
            ]
          },
          "german_stop": {
            "type": "stop",
            "stopwords": "_german_"
          },
          "german_stemmer": {
            "type": "stemmer",
            "language": "light_german"
          }
        }
      }
    }
  }
}

POST products/_open

GET products/_analyze?filter_path=**.token
{
  "text": "rotes kleid gestreift",
  "analyzer": "synonym_search_analyzer"
}

returns

{
  "tokens" : [
    {
      "token" : "rot"
    },
    {
      "token" : "kleid"
    },
    {
      "token" : "streif"
    }
  ]
}

This looks good!

Now, in order to get this into our index, let’s recreate and re-index the data with the correct settings

DELETE products

PUT products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "search_analyzer": "synonym_search_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "brand": {
        "type": "text",
        "search_analyzer": "synonym_search_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "size": {
        "type": "text",
        "search_analyzer": "synonym_search_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "color": {
        "type": "text",
        "search_analyzer": "synonym_search_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }

    }
  },
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym_search_analyzer": {
            "tokenizer": "stanard",
            "filter": [
              "lowercase",
              "german_stop",
              "german_stemmer",
              "search_synonym_filter"
            ]
          }
        },
        "filter": {
          "search_synonym_filter": {
            "type": "synonym_graph",
            "synonyms": [
              "gestreift => streifen"
            ]
          },
          "german_stop": {
            "type": "stop",
            "stopwords": "_german_"
          },
          "german_stemmer": {
            "type": "stemmer",
            "language": "light_german"
          }
        }
      }
    }
  }
}

PUT products/_bulk?refresh
{ "index" : { "_id" : 1} }
{"name":"Gestreiftes Kleid","color":["gelb","blau"]}
{ "index" : { "_id" : 2} }
{"name":"Gestreiftes Kleid / Abendkleid","color":"rot","brand":"Esprit","size":"L","price":"32.49"}
{ "index" : { "_id" : 3} }
{"name":"Gestreiftes Kleid","color":["creme rose"]}
{ "index" : { "_id" : 4} }
{"name":"Gestreiftes Kleid 3000","color":["rot"]}
{ "index" : { "_id" : 5} }
{"name":"Hoodie Faster Runner","brand":"nike","size":"XL","color":"black"}

Now we can search for rotes esprit kleid mit streifen (red esprit dress with stripes)

GET products/_search
{
  "query": {
    "multi_match": {
      "query": "rotes esprit kleid mit streifen",
      "fields": ["name", "brand", "size", "color"],
      "operator": "or",
      "minimum_should_match": 2, 
      "type": "cross_fields"
    }
  }
}

GET products/_search
{
  "query": {
    "combined_fields": {
      "query": "rotes esprit kleid mit streifen",
      "fields": [ "name", "brand", "size", "color" ],
      "operator": "or",
      "minimum_should_match": 2
    }
  }
}

When you compare the score, you will see some differences. However there is more to it, the multi_match query would allow you to use different analyzers in the mapping, where as the combined_fields query requires the same analyzer. This is indeed a problem, because in our example it might make sense to stem the color field to match rotes for rot, but it does not make sense to stem the brand field, as you probably would like to search for nike, which gets stemmed to nik, but does not make sense, as brands do not have root forms. Also, fuzziness is not supported in the combined_fields query.

It might also make more sense to use the synonyms only for the name field but not for others fields that are part of the query. For this you would need to resort back to the multi_match query with cross_fields.

There is one more difference to the cross_fields type, which is outlined by Mark in this github discussion. In our example the meaning of name or description of a product does not change, but in the case of an address book with first names and last names it indeed does, as Alex as a last name is much more rare and thus might be more or less important.

That said, if you can live with the limitations, testing out the new combined fields query is worth a try.

One last thing, if your rules may become more complex than a single synonym and you also would like to introduce some more scoring/ranking taking a look at quergy might be a good idea as well.

Summary


Back to posts