Implementing search feedback loops - part 1

June 5, 2026 · 17 min read · #technology #search #elasticsearch

TLDR; This is a multi post series about implementing feedback loops in search applications that I had sitting in my drafts for quite some time. Instead of going high tech with the latest models, we try to make it as simple as possible and integrate this in our search queries.

Think of an e-commerce context with the examples provided in this and the following posts. This is where I am coming from. So everything here revolves around e-commerce product search.

Also note, that everything I show here is heavily simplified, the real e-commerce world and its product data is a wild beast to say the least.

What is a search feedback loop?

The idea of a feedback loop is to take user interactions into account when running a search, for example:

Often clicked products are ranked higher
Often bought products are ranked higher than clicked products
Which products were found using the same search term that was typed in?
Which categories did a search term belong to? For example a search for iphone 17 should probably rate products in the phones category higher than in the phone accessories category.

One of the biggest drawbacks of such loops is reinforcement. If a product is displayed among the top 10 most of the time, it will also be clicked more often - disallowing other products that probably are great matches for a regular full text search to even occur in the top 10. The long tail products never get a chance.

Goal: Keep search understandable

The goal of this blog post series is to keep search understandable and try to integrate as much as possible into a search request, instead of having a complex chain of data science models to query before the request hits your search engine. The same applies for re-ranking. Also, often models are language specific - which in practice means, English is well supported, and then quality usually declines sharply. While again this is limited and such models might be more powerful, it will allow for a search request to be easily followed and make it explainable. Using your own data for searches also allows you to focus in your niche and the behavior of your users.

Setup

For completeness, here is my docker setup:

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:9.4.1
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
      - http.cors.enabled=true
      - http.cors.allow-origin=/https?:\/\/(localhost|127\.0\.0\.1)(:[0-9]+)?/
      - http.cors.allow-methods=OPTIONS,HEAD,GET,POST
    mem_limit: 2g
    ports:
      - "9200:9200"
    volumes:
      - ./data:/usr/share/elasticsearch/data
    networks:
      - elastic

  kibana:
    image: docker.elastic.co/kibana/kibana:9.4.1
    container_name: kibana
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch
    networks:
      - elastic

  web:
    image: python:3-alpine
    container_name: web
    working_dir: /web
    command: python3 -m http.server 8000
    ports:
      - "8000:8000"
    volumes:
      - ./web:/web
    networks:
      - elastic

networks:
  elastic:
    driver: bridge

This starts Elasticsearch, Kibana and a static HTTP server. Using the static HTTP server and ignoring all security rules we can directly query Elasticsearch from the browser - that’s why the CORS settings are required as well.

Using the python web server we can now build a sandbox, that directly queries Elasticsearch.

Requirement: A query sandbox

So what is a query sandbox? Essentially, whenever you want to play around with a query from your dataset, you need to compare current and new queries with your live data.

Especially within a company context such a sandbox fulfills several use-cases:

Allow search engineers to test current queries with current data
Allow product managers/engineering managers to understand queries and to test
Share new queries among stake holders
Easily modify existing queries
Easily understand scoring (share the explain output of Elasticsearch)
Advanced: Integrate with analytics data, to use the top 100 most common queries and compare their outputs
Advanced: NDCG/MRR comparison

A sandbox is a good candidate to vibe code a PoC. Some sort of a split screen, that allows to load the existing query for a search term, a JSON editor and a result view to compare the results.

See this example screenshot:

Sample data

I also generated some sample data, around phone and phone accessories, but only 1k products. They look like this:

{
  "name": "iPhone 14 Pro 128GB Space Black",
  "base_model": "iPhone 14 Pro",
  "brand": "Apple",
  "category": "phone",
  "year": 2022,
  "year_feature": 3,
  "color": "Space Black",
  "weight_g": 206,
  "memory_gb": 128,
  "display_inch": 6.1,
  "megapixel_front": 12,
  "megapixel_back": 48,
  "battery_mah": 3200,
  "image_url": "https://placehold.co/300x300/3c3c3d/ffffff?text=iPhone+14+Pro+128GB+Space+Black&font=Roboto"
}

They represent names and some phone specific attributes. Note: if you had a much bigger and diverse catalog, attribute mapping would need to become a lot smarter, but let’s keep it simple for this example.

Mapping

{
  "products": {
    "mappings": {
      "properties": {
        "base_model": {
          "type": "keyword",
          "fields": { 
            "text": { "type": "text" } 
          }
        },
        "battery_mah":  { "type": "integer" },
        "brand":        { "type": "keyword" },
        "category":     { "type": "keyword" },
        "color":        { "type": "keyword" },
        "display_inch": { "type": "float" },
        "image_url":    { "type": "keyword", "index": false },
        "megapixel_back": { "type": "float" },
        "megapixel_front": { "type": "float" },
        "memory_gb":    { "type": "integer" },
        "name": {
          "type": "text",
          "fields": {
            "keyword":  { "type": "keyword" },
            "no_norms": { "type": "text", "norms": false }
          }
        },
        "weight_g":     { "type": "integer" },
        "year":         { "type": "integer" },
        "year_feature": { "type": "rank_feature" }
      }
    }
  }
}

The mapping allows for simple search, disables norms on one text field, and also has a rank_feature field, which we will see later. Let’s go through some sample queries of how one might tweak an initial query before going deeper.

Sample queries

Let’s start with the most simple search:

{
  "query": { 
    "match": { 
      "name": { 
        "query": "<user typed query>",
        "minimum_should_match": "67%"
      } 
    }
  }
}

This is probably the most simple query. Running against the name field, and at least 67% of the terms need to match, so it takes the number of terms of the input query into account. By not needing to match all the terms we can allow for some tolerance towards typos, or misspellings (maybe even in our own product data) or a user typing in a way too long search string (hello dear agents).

By default, the length of the stored field is taken into account, but in e-commerce you do not really care about the difference between iphone 14 pro max and iphone 14 pro max 128 gb black in a product name, so this could be taken out of the equation by disabling norms and then query accordingly (see the mapping above):

{
  "query": { 
    "match": { 
      "name.no_norms": { 
        "query": "<user typed query>",
        "minimum_should_match": "67%"
      } 
    }
  }
}

We can also query several fields like this:

{
  "query": { 
    "multi_match": {
      "query": "<user typed query>",
      "fields": ["name^2", "base_model.text", "brand"],
      "minimum_should_match": "67%"
    }
  }
}

This query searches in name field with a boost, but also the base_model and the brand field. This is a classic example, where you would need to think about term-centric vs. field centric queries.

We could also take the age of a product into account, or the year it was released:

{
  "query": { 
    "bool": {
      "must": [
        { "match": { "name": { "query": "<user typed query>", "minimum_should_match": "67%" } } }
      ],
      "should": [
        { "rank_feature": { "field": "year_feature", "saturation": { "pivot": 3 } } }
      ]
    }
  }
}

This will score more recent products higher, which often is what you want in an e-commerce setup - but not always :-)

Another common problem in e-commerce: Some terms are more important than others, for example iphone 14 128gb yellow - probably the iphone 14 is more important than the specs, because this is the model.

We could go with many variations here, like trying to be smarter about the number of words matched (or store shingles and search for terms right next to each other). We will skip this topic a bit to build some infrastructure to take clicks into account, so no need to focus on full text search only.

All of these queries have one issue. They only act on the input data of the products and never on any interaction with them. So if your product data is mediocre (or even worse, someone uses it to game your search engine, if you have smart sellers) you might have good results from a full text search perspective, but they will always be static. Let’s change that.

Taking clicks into account

In order to be able to take clicks into account, we first need some click data. A single record would consist of a timestamp and the product id in its most simple form. Before thinking about product data, let’s think about how we want the score to evolve over time. For this we need to take a look at different use-cases:

Constant popular product

This is your cash cow, and rarely happens that a product is constantly clicking well. Usually this means, it has been placed either in ads or has direct links from the main page or other well performing landing pages. This is a good reminder that clicks alone might not be a good metric, unless correlated with sales (or paid clicks are excluded). Maybe batteries are a good use-case for such a product, as they are in constant need.

This product should outscore all others in terms of a click score.

Constant semi-popular product

Not a high flyer, but something that probably gives you a solid revenue, so you should make sure, you have a few of those. Replacement parts or additional parts like specific charging cables might go into this category.

This product should also have a solid click score because of its consistency.

Recently hyped product

Classic flagship product use case, or a flagship product release, where the previously expensive product line got a lot cheaper and now finds a fair share of buyers. Phones are probably a good candidate here.

This is the first indicator that recency should be taken into account when creating a click score.

Past hyped product

Either something like a recently hyped product after a year, or a true short term hype/one hit wonder triggered by some viral TikTok trend, or a seasonal thing (pumpkin spice latte, however in this case this is not possible because it would repeat after a year). Germans might remember the dubai chocolate trend.

Recency here is even more important, as such a product is probably the least important of all the ones listed here.

Declining popularity product

Similar to the recently hyped product, but declining over a way longer duration.

Worth a discussion, if this or the constant semi popular product should have a higher click score (probably this one).

As you can see, recency definitely has to be taken into account into our formula, we cannot just take the average or total number of clicks per day or week and consider that a great formula.

As mentioned previously, in a real world example, you would also try to catch cheating by comparing with sales data, but again we will keep it simple here.

Also, there are dozens of other curves in real product data (when a campaign started etc), but again, we’ll keep it simple here.

Creating a recency based formula

OK, so what capabilities would our formula need?

A single day should not be able to dominate
Recency should be important (last months data is more important than last year’s)
Volume: The more clicks a product has, it should get a higher score
No need for incremental updates, full recompute is fine - this would change with a massive catalog (not a lot of e-commerce companies probably have such)
The gap between popular and insanely popular should be relatively small, we’re more interested in diversifying products.

The input data looks like this for the product-clicks index and represents a single event:

{
  "@timestamp": "2026-06-04T10:00:48Z",
  "product_id": "iphone-16-512gb-black"
}

For each product id, we need our recency based formula to be applied.

ESQL to the rescue

Instead of coming up with a complex scripted metric aggregation, ES|QL is of great help here.

FROM product-clicks
  | WHERE @timestamp >= NOW() - 18 months
  | EVAL day = DATE_TRUNC(1 day, @timestamp)
  | STATS clicks = COUNT(*) BY product_id, day
  | EVAL age = DATE_DIFF("days", day, NOW())
  | EVAL weighted = SQRT(clicks) * POW(0.5, age / TO_DOUBLE(182))
  | STATS raw = SUM(weighted) BY product_id
  | EVAL ref = SQRT(TO_DOUBLE(30)) * (1 - POW(0.5, TO_DOUBLE(547) / TO_DOUBLE(182))) / (1 - POW(0.5, 1.0 / TO_DOUBLE(182)))
  | EVAL score = 1 - EXP(-raw / ref)
  | KEEP product_id, score
  | SORT score DESC

So, this is what this looks like in Kibana

Let’s dissect this query step by step

FROM product-clicks
  | WHERE @timestamp >= NOW() - 18 months
  | EVAL day = DATE_TRUNC(1 day, @timestamp)
  | STATS clicks = COUNT(*) BY product_id, day

This is grouping the data by product_id and day and counting the clicks.

  | EVAL age = DATE_DIFF("days", day, NOW())
  | EVAL weighted = SQRT(clicks) * POW(0.5, age / TO_DOUBLE(182))

This adds two new fields. age is the difference from the bucketed day till now, weighted takes the square root of the counted clicks per day and multiplies it by 0.5 to the power of age divided by the number of days of half a year. So this value is higher the nearer the date is to the current date.

age	POW(0.5, age / TO_DOUBLE(182))
0	1
1	0.996
10	0.963
50	0.827
100	0.683
200	0.467
365	0.249
547	0.125

As this value is multiplied with SQRT(clicks) you will end up with higher values when either the day is more recent or the number of clicks is high.

The next line

  | STATS raw = SUM(weighted) BY product_id

is simply adding all the daily sums for each product id.

Now the next row looks static, but mainly because I used a script to create the whole ESQL query with three constants in-lined. The 182 is essentially the half-life, where we want the score to half (see the table above). 547 is roughly 18 months and 30 is the popular daily click rate, where we consider a product to be popular - this could be calculated more dynamically as your traffic patterns probably change over time.

  | EVAL ref = SQRT(TO_DOUBLE(30)) * (1 - POW(0.5, TO_DOUBLE(547) / TO_DOUBLE(182))) / (1 - POW(0.5, 1.0 / TO_DOUBLE(182)))

Now the final task is score normalizing

  | EVAL score = 1 - EXP(-raw / ref)
  | KEEP product_id, score
  | SORT score DESC

How does it work? The basic idea is the expression 1 - e^(-x) is normalizing all positive values into a value between 0 and 1. With -raw/ref the sum of all clicks is used in combination with the formula above.

Now with the output of this query, we can take the score for each product id and index it into our products index, for example into a click_score and click_score_feature field, so that we have these fields as rank_feature and regular field (if you want to play around with it as part of a function_score query). Also note, that rank_feature fields cannot contain 0, so it must be a positive value.

A look at our use-cases

This example shows five products, which have the aforementioned characteristics and also show their score:

You can see the ids at the top, producing the following scores, applying the ESQL query from above:

iphone-16-512gb-black, score 0.772, clicked a lot all the times, but has only been released within the time frame
apple-magsafe-charger, score 0.42, low traffic, but happening all the time
iphone-14-512gb-midnight, score 0.393, clicked a lot at the beginning, but continuously fading
apple-airpods-pro-2, score 0.388, short hype, relatively recent, slowly fading
apple-usb-c-to-lightning-cable-1m, score 0.062, one hit wonder from the past, probably campaign driven

Integrating into the query

So how to integrate that into our query? We could go simple and use a similar should clause, like previously

{
  "query": {
    "bool": {
      "must": [
        { "match": { "name.no_norms": { "query": "iphone 14", "minimum_should_match": "67%" } } }
      ],
      "should": [
        { "rank_feature": { "field": "click_score_feature", "linear": {} } }
      ]
    }
  }
}

Resulting in something like this

As you can see, the iPhone 14 with the click score is clearly scored above the others. If you compare the scores, you could see, that the click_score value of 0.396 was added to the full text score of 4.772. In many cases you don’t know about the maximum value of the full text score, especially if your boosting becomes more complex. You also do not know how much a value of 0.396 will change the total score.

Normalizing full text scores

So, what can we do to normalize the score? Elasticsearch has a function_score query that could be used for normalization. Within function score you don’t have the max score, so you would need to do per document score normalization doing something like this which is a bit lossy, as high scoring documents end up roughly the same.

{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "name.no_norms": { "query": "iphone", "minimum_should_match": "67%" }
        }
      },
      "boost_mode": "replace",
      "score_mode": "sum",
      "functions": [
        { "script_score": { "script": "_score / (_score + 1)" } },
        { "field_value_factor": { "field": "click_score", "factor": 0.1, "missing" : 0 } }
      ]
    }
  }
}

Also, this executes a script for every hit, which is not super performant. Using the factor or k you could now influence the weight of the click score.

Another alternative would be to go with a linear retriever depending on your license:

{
  "retriever": {
    "linear": {
      "rank_window_size": 100,
      "retrievers": [
        {
          "retriever": {
            "standard": {
              "query": {
                "match": {
                  "name": { "query": "iphone 14", "minimum_should_match": "67%" }
                }
              }
            }
          },
          "weight": 1,
          "normalizer": "l2_norm"
        },
        {
          "retriever": {
            "standard": {
              "query": {
                "bool": {
                  "should": [
                    { "rank_feature": { "field": "click_score_feature" } }
                  ],
                  "filter": {
                    "match": { "name": { "query": "iphone 14", "minimum_should_match": "67%" } }
                  }
                }
              }
            }
          },
          "weight": 0.1,
          "normalizer": "l2_norm"
        }
      ]
    }
  }
}

I wanted to test the minmax normalizer, but my dataset was too small so I ended up with the same max scores, which normalized a lot of the data simply down to 1 as the score, and then the ordering was relying on the click score, which was not what I wanted (same applies with searching on fields without normalization that creates a lot of similar scores).

I would probably go with the boolean query and the rank_feature in the should part and play around with the different kinds of functions that are supported (saturation, log, sigmoid and linear) as that seems to be the most maintainable solution when looking at the query, albeit losing a bit of control that you get with the retrievers.

Risks

Make sure you still have a search engine and not a ranking engine. If you put too much weight into the click based score, newer products will have a hard time coming up. Also multi year seasonality is ignored. You could add another score for that and have one more should clause. Or have tags for that like a field "tag": [ "black_week", "christmas", "carnival" ], that gets a boost at a certain time of the year.

Summary

Reminder, this example mixed a square root of the daily clicks (to reduce the impact of outliers), an exponential moving sum of these clicks using time decay and saturation of that result. That’s the whole idea behind the click_score.

Again, even though I begin to sound like a broken record, keep in mind that everything here is insanely simplified and should just give you some ideas, how to integrate a factor like clicks or sales into your searches without requiring to query a second system.

Usually you will not have the click data in Elasticsearch, but probably in your analytics system, like Google Analytics or something similar, so make sure it can be extracted. In order to make this easier to maintain, an algorithm that works incrementally would also help, so that there is no need to recalculate that number for every product and every day.

Whatever you change in your search, make sure you have the ability to A/B test built into your infrastructure.

Oh, and if you take anything from this article, it should be that you need some sort of sandbox mechanism with your data in order to help others understand your search.

Thanks for going through this article. I’m not yet sure what the next article will focus on, but we will again be using historical data to improve the present.

Final remarks

If you made it down here, wooow! Thanks for sticking with me. You can follow or contact me on Bluesky, Mastodon, GitHub, LinkedIn or reach me via Email (just to tell me you read this whole thing :-).

If there is anything to correct, drop me a note, and I am happy to fix and append a note to this post!

Same applies for questions. If you have a question, go ahead and ask!