Search engines & libraries: an overview

Oct 20, 2020

22 minutes read

TLDR; This blog post will give an overview over some lesser known search engines and libraries. I have been an Elasticsearch user for almost a decade now and work for the company behind the Elastic Stack. So I am highly biased. I compare, more or less subconsciously, against Elasticsearch, which is powered by Apache Lucene, a search library that has been around for 20 years and keeps innovating in the space. I will not get into any Lucene based libraries, with the exception of Elasticsearch to get an indexing baseline number.

Quite a few of these newer search engines & libraries are either written in Go or in Rust. Many people claim that you cannot write a search engine using the JVM. To me Lucene has proven those folks wrong for 20 years. One of the major arguments have always been garbage collectors and their unpredictable and long pause times on the JVM. With the rise of shenandoah and ZGC that argument will be even harder to hold in the future I assume. If you want to know more, check out this excellent blog posts about modern garbage collection (part 1, part 2, the first one is already a bit older).

Rewriting a search engine in another language is quite the task, so naturally no library will have all the Lucene functionality available - and that’s fine, no need to blame anyone here. Most newer search engines have been written to serve a certain use case, and that use-case may only be a very small subset of what you need to solve a generic search use-case - which is what Apache Lucene tries to cover.

Long story short, I’m still biased and will compare engines to Lucene or Elasticsearch. Deal with it 😀

The sample data set

Elasticsearch is running nightly benchmarks and the data being used is available. You can check out the so-called rally tracks in the GitHub repository and find links to download sample data. I downloaded this file with 2.7 million documents to index. Sounds like a decent size to get some numbers!

A single document looks like this

{
  "@timestamp" : 893964617,
  "clientip" : "40.135.0.0",
  "request" : "GET /images/hm_bg.jpg HTTP/1.0",
  "status" : 200,
  "size" : 24736
}

In order to have some unscientific benchmark data, I indexed this data into Elasticsearch on my three year old mac book pro, a quad core i7 with 16GB of memory.

I’m running Elasticsearch 7.9.2 with all the defaults locally. Single instance, no replicas.

Let’s change the data first to adhere to the ES bulk format and then split it into useful bulk chunks. Thanks for the power of the shell there is no need for more tools.

cat documents-181998.json \
  | awk '{ print "{ \"index\" : {} }" ; print }' > documents-full-bulk.json

Now let’s split this, ending up roughly with 9MB per file.

# this will create a bunch of smaller files
cat documents-full-bulk.json | split -l 120000
# append a newline to each file to adhere to the bulk format
for i in x* ; do echo >> $i ; done

Next, create this mapping in es-mapping.json

{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date",
        "format" : "epoch_second"
      },
      "clientip": {
        "type": "ip"
      },
      "request": {
        "type": "text"
      },
      "status": {
        "type": "short"
      },
      "size": {
        "type": "long"
      }
    }
  }
}

# create the index
curl -X PUT localhost:9200/logs -d@es-mapping.json --header "Content-Type: application/json"
# do the bulk loading
for i in * ; do 
  curl -X POST localhost:9200/logs/_bulk --data-binary @$i \
    --header "Content-Type: application/json" > /tmp/bulk-$i.json
done

Indexing the data took about 45 seconds - of course everything was serialized and could be designed to be indexed in parallel with several clients. The size of the resulting index was 166 MB.

My main goal here, was to get some ballpark figures. If you take a look at the mapping, you can see it will probably be somewhat more optimized than others. The IP address is an IP data type, the status code is a short to use as few space as possible. Indexing the IP field as text adds 30 MB of size to the index, so mapping is important - and thus support for data types.

That said, the data could potentially be processed even further in order to extract the requested path and the HTTP verb from the request field.

In order to compare each solution, let’s also create a few sample queries

GET logs/_search
{
  "query": {
    "match": {
      "request": "hm_bg.jpg"
    }
  }
}

This executes a full text search on the request field. The first execution returned a took time of fifteen milliseconds, all the followup queries took zero or one milliseconds. The data set fits completely in memory and because I was only interested in the top 10, Lucene can short circuit this search and just tell me, that more than 10k were returned. Even when setting "track_total_hits": true and seeing that 14k hits were found, speeds are roughly the same.

Adding a terms or date_histogram aggregation does not make the search much slower (about 3-4 milliseconds now).

GET logs/_search
{
  "track_total_hits": true, 
  "query": {
    "match": {
      "request": "hm_bg.jpg"
    }
  },
  "aggs": {
    "by_day": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "1d"
      }
    }
  }
}

Aggregating over the whole data set and creating more buckets based on date request size and date histogram results in about 200ms queries, when keeping the shard level request cache disabled on my notebook - which is enabled by default, speeding up aggregations quite a lot after the first request in case the data does not change, resulting in faster dashboards for example.

GET logs/_search?request_cache=false
{
  "size": 0,
  "aggs": {
    "by_size": {
      "histogram": {
        "field": "size",
        "interval": 10000
      },
      "aggs": {
        "by_day": {
          "date_histogram": {
            "field": "@timestamp",
            "calendar_interval": "1d"
          }
        }
      }
    }
  }
}

Let’s dive into other search libraries.

Bleve

The first candidate here is Bleve, which has been around for some time as well. Just by taking a look at the documentation you immediately spot terms well known from Lucene. The whole text analysis chain is basically a 1:1 copy of what Lucene is doing, including terminology. To me that’s a good thing, as reinventing terminology will only lead to more confusion down the road.

Bleve is a library, written in go, which allows you to configure a mapping that consists of analyzers for different fields - again rather similar to Lucene. Also the queries are very well known and Bleve also features a query_string query.

There is support for highlighting as well as faceting, unlike Aggregations in Elasticsearch it seems you cannot nest facets.

Let’s talk about the gist of searches: scoring. Scoring seems to be TF/IDF based and supports query based boosting.

It seems to me, that Bleve has lost some steam. The video section has been updated 5 years ago and the last two years of development looked a bit slowed down compared to earlier years. There have been frequent patch releases throughout 2020 though.

There are a couple of applications using Bleve for searching, mainly Couchbase via its secondary indexes capability - judging from the copyright notices of some files Bleve was written with Couchbase in mind (by developers employed by Couchbase).

Let’s try to get up and running. You can clone the repo and run go build, and also switch to the cmd directory and run go build to end up with the CLI. One thing I have neither found on the homepage or on the GitHub repo are downloads. As this is a library, I guess most of the time you just drag this in as a dependency.

Let’s create an index with a mapping first

./cmd/bleve/bleve create http_logs --mapping ../bleve-mapping.json

The mapping looks like this. I copied a little from the tests, as I could not find any concrete documentation.

{
  "default_mapping": {
    "enabled": true,
    "dynamic": true,
    "properties": {
      "@timestamp": {
        "enabled": true,
        "dynamic": true,
        "fields": [
          {
            "type": "datetime"
          }
        ]
      },
      "request": {
        "enabled": true,
        "dynamic": true,
        "fields": [
          {
            "type": "text",
            "analyzer": "en",
            "store": true,
            "index": true,
            "include_term_vectors": true,
            "include_in_all": true
          }
        ],
        "default_analyzer": ""
      },
      "status": {
        "enabled": true,
        "dynamic": true,
        "fields": [
          {
            "type": "number",
            "include_in_all": true
          }
        ]
      },
      "size": {
        "enabled": true,
        "dynamic": true,
        "fields": [
          {
            "type": "number"
          }
        ]
      },
      "clientip": {
        "enabled": true,
        "dynamic": true,
        "fields": [
          {
            "type": "text",
            "analyzer": "keyword",
            "store": true,
            "index": true
          }
        ]
      }
    }
  },
  "type_field": "_type",
  "default_type": "_default",
  "default_analyzer": "en",
  "default_datetime_parser": "dateTimeOptional",
  "default_field": "_all",
  "byte_array_converter": "json",
  "analysis": {}
}

This was my command to index all the data

./cmd/bleve/bleve bulk http_logs ../documents-181998.json --batch 10000 | ts

In case you do not know ts, it is part of moreutils and prepends every printed line with a timestamp.

Importing the data set took about 43 minutes on my mac book, resulting in about 1050 documents per second and a total index size of 3.4 gigabyte - which is quite the difference to Elasticsearch. However by default every field is included in the _all field, term vectors are included as well as doc values - maximum flexibility also means maximum size in this case.

Lacking documentation was my biggest issue, on the other hand I did not use this library as it was intended to be used, but only its CLI interface which is more supposed to be a demo tool from my point of view.

I also did not find a way to introspect my data and check what took so much space or time.

Let’s try some searches. Counting first

$ ./cmd/bleve/bleve count http_logs
2699193

All right, some documents are missing, I don’t know why, the logs did not indicate any error though.

$ ./cmd/bleve/bleve fields http_logs
0 - clientip
1 - request
2 - status
3 - size
4 - _all

Ok, so maybe we can get rid of the all field - also the above call hangs and never returns on my system. After reindexing and disabling the all field and setting term vectors to false I ended up with an index sized 1.5 GB and index time reduced to 16 minutes.

Running queries works like this

./cmd/bleve/bleve query http_logs 'request:POST'

The above query takes less than 2ms. Switching the above query to GET matches most of the index documents and results in way longer execution, always above 400ms on my system.

Unfortunately querying with the CLI tool is rather simple and adding facets needs to be done programmatically, so I cannot comment on that feature here.

Interesting to note, that by default an index is basically a single file in a directory. It is not segment based compared to Lucene. I also did not execute any delete or reindex, so this might change the file structure.

If you want to have an HTTP API and distributed system features on top of Bleve, you might want to take a look at blast - however development is rather stalled with less than 50 commits this year. The responses that blast sends back to clients look similar to Elasticsearch, even though they are not the same.

When integrating Bleve natively in your Go applications, it is much more powerful than what I showed here - this is also what the docs evolve around at.

In terms of speed, it seems that its core has not changed to what Lucene is capable of doing nowadays by implementing the block-max WAND possibly speeding up searches tremendously.

Also, there are different index stores in the source, but they are not documented too much, so I have no idea if changing some settings would lead to very different results.

Tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

That pretty much sums it up. Tantivy is a lot more up-to-date than Bleve, featuring BM25 scoring, multi threaded indexing, block max WAND support. When reading the rust docs it becomes clear, that Lucene was a big inspiration here as well - even though more recent Lucene than for Bleve.

There is a dedicated tantivy-cli GitHub project, so let’s take a look at that one. I built it via cargo install tantivy-cli, which was 0.13.1 at the time of writing.

Next up is to create a directory and put a mapping in, before indexing

mkdir tantivy-index
touch tantivy-index/meta.json

The meta.json file looked like this

{
  "segments": [],
  "schema": [
    {
      "name": "timestamp",
      "type": "i64",
      "options": {
        "indexed": true,
        "fast": "single",
        "stored": false
      }
    },
    {
      "name": "clientip",
      "type": "text",
      "options": {
        "indexing": {
          "record": "position",
          "tokenizer": "raw"
        },
        "stored": true
      }
    },
    {
      "name": "request",
      "type": "text",
      "options": {
        "indexing": {
          "record": "position",
          "tokenizer": "simple"
        },
        "stored": true
      }
    },
    {
      "name": "status",
      "type": "i64",
      "options": {
        "indexed": true,
        "fast": "single",
        "stored": true
      }
    },
    {
      "name": "size",
      "type": "i64",
      "options": {
        "indexed": true,
        "fast": "single",
        "stored": true
      }
    }
  ],
  "opstamp": 0
}

The index could probably be optimized further. By default tokens longer than 40 characters are removed, so this might already be a problem for longer URLs in the request field.

In order to index the data, you can replace the @timestamp field with timestamp and reuse the downloaded rally tracks file. Also, the timestamp field is a number, and you cannot specify a date format for a real date field in Tantivy so I left it as a number as it seems only RFC3339 formatted dates are supported right now.

cat documents-tantivy.json | tantivy index -i tantivy-index

The result was a fast indexing, which took roughly 20 seconds. Again, this is not comparable to Elasticsearch due to different data types. Also the size of the index was much smaller with the above mapping, round about 145 MB, that could be reduced further by running tantivy merge down to 136MB in my case. Speed and index size are in a another league than Bleve.

Querying on the command line looks like this

tantivy search  --index tantivy-index --query 'status:400'

Searches are quite fast, but when you search for something with a lot of hits (i.e. status:200) it slows down, because all the data gets returned.

However you can start a webserver and run against that one with a little bit more control, start the webserver via

tantivy serve --index tantivy-index

Then query like this

curl http://localhost:3000/api/?q=status:200

The syntax is loosely resembled from the Lucene classic query parser. It supports ranges, and, or, exclusions, boosting as well as phrases.

this only returns the first 10 documents by default, however you can use the offset and nhits parameters to support pagination.

A returned hit looks like this:

{
  "score": 0.22429849,
  "doc": {
    "clientip": [
      "57.0.0.0"
    ],
    "request": [
      "GET /images/hm_f98_top.gif HTTP/1.0"
    ],
    "size": [
      915
    ],
    "status": [
      200
    ]
  },
  "id": 1
}

Each hit contains a score, and each field is returned as an array, so multi values are easy to support I assume.

The HTTP interface is somewhat limited compared to what you can do with the library, like writing your own custom queries based on a boolean query. Highlighting is supported but not exposed in the HTTP interface.

Even though distributed features are marked as out of scope currently, there are some solutions trying to implement this on top of Tantivy, for example Toshi - its query DSL looks a lot like Elasticsearch as well. Another distributed system based on Tantivy is Bayard. What I was missing from the docs in both cases are descriptions of fault events, like nodes not more reachable, level of resiliency/automatic repair.

Also, Tantivy features an extensive benchmarking suite, comparing a few search libraries and benchmarking them. This is quite awesome and you should take a look.

The author of Tantivy, Paul Masurel has recently announced, that he is going to work full time on Tantivy, so I expect more things to come here (OHAI aggregations :-)!

MeiliSearch

The award for the fastest up-and-running player in this blog post easily goes to MeiliSearch. MeiliSearch can be installed via brew and all it takes to start up is running meilisearch and you end up with a service running on port 7700.

Also, the official documentation is really nice and comprehensible.

MeiliSearch is not able to generate IDs automatically. This implies that it does not make sense to use it a lot for the time series and logging use-case.

The getting started experience is quite good as well. The documentation has a sample movie data set and MeiliSearch has a super simple web UI to test your data set after importing it.

MeiliSearch approaches scoring differently compared to libraries using BM25/TF-IDF scoring. There is a well explained ranking rules page in the documentation. There is no formula, but a couple of explanations, like words without typos get ranked higher, documents with more matching terms are ranked higher, proximity and position in the document is taken into account as well as exactness. You can also add your own attributes, like taking the popularity or the remaining stock into account for storing.

Faceting is also supported, albeit only single dimension if I read the docs correctly. Highlighting as well as returning information, which part of a document has matched is also possible.

From a security perspective, there is an API that allows read only operations directly from the browser to MeiliSearch.

Let’s show some examples based on the movies data set. A regular query looks like this (using httpie):

echo '{ "q": "wars" }' \
  | http --json GET http://localhost:7700/indexes/movies/search

The response looks like this

{
  "hits": [
    {
      "id": "293982",
      "title": "Warsaw 44",
      "poster": "...",
      "overview": "...",
      "release_date": 1411088400
    },
    {
      "id": "181808",
      "title": "Star Wars: The Last Jedi",
      "poster": "...",
      "overview": "...",
      "release_date": 1513123200
    },
    ...
  ],
  "offset": 0,
  "limit": 20,
  "nbHits": 96,
  "exhaustiveNbHits": false,
  "processingTimeMs": 1,
  "query": "wars"
}

The first obvious thing is, that there is no score returned, despite you still need to calculate some sort of ranking. The documents are sorted, but there is number indicating anything - which again is one of the goals of the search engine to not confuse the user with such data.

Interestingly enough the scoring formula did not work too well here from a user perspective, if a partial term matches first compared to a full term (the term Warsaw occurs twice in the overview and thus is deemed more important than Wars of Star Wars in the title). As already mentioned, you can tune that.

Let’s make exactness the most important criteria

curl \
  -X POST 'http://localhost:7700/indexes/movies/settings/ranking-rules' \
  --data '[
      "exactness",
      "typo",
      "words",
      "proximity",
      "attribute",
      "wordsPosition"
  ]'

Wait a bit or poll for the update to be finished, as the above request returns an ID that can be polled, and rerun the above search.

What I do miss, is some more insight about the different ranking factors, something like the Explain API or the explain parameter in a search in Elasticsearch.

Let’s check out filtering first

curl http://localhost:7700/indexes/movies/search /
  -d '{"q":"wars","filters": "release_date >= 1542490000" }'

You can basically have a bunch of filter criteria in there, which is split from the query itself.

The supplied data set does not allow to facet on anything, as current faceting seems to be only counting by a grouping criteria like a genre (which is not part of the movies data set). From an Elasticsearch perspective this is a terms aggregation, but misses aggregations based on histograms or date ranges - which I assume are coming.

One interesting part about the faceting implementation is, how it caters easily for the use-case of filtering within certain facets so you can emulate the classic ecommerce navigation with filters on left based on the displayed facets. See the documentation for a concrete example.

Lastly, highlighting:

curl http://localhost:7700/indexes/movies/search \
  -d '{"q":"wars","attributesToHighlight": ["title"] }'

This returns an additional JSON data structure called _formatted that is using <em> tags to highlight the found fields.

Also, there do not seem to be custom data types like dates (you can use a number instead) or IP addresses - which is fine for a small use-case with out any logging needs, but rather e-commerce/wiki data.

Even though the whole communication is based on HTTP, there are already dedicated clients for JavaScript, Ruby, Python, Go, Swift, Rust and PHP (+Laravel).

For the fun of it, I changed my JSON logging data set to have a unique id per entry, then indexed this into MeiliSearch and it took about 28 minutes. It seemed that only a single CPU was in use all the time. Also, the indexing requests were asynchronous, so that even though my HTTP requests including the data were returned immediately, part of the body was an updateId, which I could poll to make sure if that operation had been finished. The most interesting part was the index size. I ended up with 6 GB on disk - which is really huge. Also I indexed without optimizing anything or configuring, I just used all the defaults. And again, if you use this for the ecommerce/workplace search use-case, you possibly have much less documents, and this will be just fine.

MeiliSearch is backed by Meili, who recently got their first round of founding. To me, MeiliSearch looks like a direct competitor to Algolia. They offer similar free tools like doc scrapers (the Meili one was forked from the Algolia one) and integrations with a lot of web development tooling. The alternatives comparison mentions this as well.

All in all, MeiliSearch is a bit different than other libraries, and this might actually help them a lot in the future, as many factors are easier to comprehend for less search-affine users. Also, I suppose with the recent funding topics like MeiliSearch-as-a-Service, a neat UI on top of that and becoming distributed system to improve safety/reliability are probably on the priority list.

Others & Honorable mentions

Noise

Noise is another search tool written in Rust. It was started by Damien Katz (one of the authors of Couch DB) together with Volker Mische. Volker also gave a talk at Berlin Buzzwords 2018 and at the Munich Search Meetup. Even though equipped with a few interesting concepts it seems to have lost some steam in the last months.

Rucene

Another Rust port of Lucene, creatively named Rucene. The GitHub README states, that it is based on Lucene 6.2.1, which is rather old. However there has been constant activity this year, so it is hard to tell, if this is up-to-date.

Also there is no documentation.

Typesense

Typesense is another competitor to Algolia and MeiliSearch. They also offer their product as a service (with a delightfully transparent pricing calculator).

Typesense is written in C++, can run in a clustered mode and has a JSON API via HTTP to query and return data, along with a couple of clients for different programming languages.

It supports highlighting, faceting, curation of results (pushing results at the top, for example promoted documents), aliases and API keys. It also integrates with instantsearch to easily implement in-browser searches.

Honorable mentions

Milvus is an open source vector similarity search engine, allowing you to search using different similarity metrics (euclidean/jaccard distance etc). It also runs as a cluster, provides several SDK, but is a different use-case than a classic full text search.

Riot is another very Lucene like library written in Go. It supports sharding of the data and running across several machines, but I could not find any further documentation - which is rather slim in general. There have not been a lot of updates in the last 18 months.

Something completely different, but also an emerging trend are search engines for static sites. Instead of using something, that requires scraping y our site, you can create all those data structures at build time and offer a built-in search. One of those solutions is tinysearch, that is written in Rust, but uses WebAssembly to run in the browser.

One more search engine, that was a good candidate to take a closer look, is Sonic. Another search engine written in Rust. It’s not using HTTP, but its own TCP cleartext procotol. Main blocker was the lack of documentation from a user facing side, even though there is an interesting description of the inner workings. Sonic uses LSMs via RocksDB to store data and also creates FSTs for suggestions/auto-complete.

PISA - Performant Indexes and Search for Academia. Love the title. PISA is written in C++ and intended as a playground for novel ideas from academia. However it requires all the data to fit in memory.

Terrier is another academic open source project written in Java. Terrier supports different scoring models as well as learning to rank. Interestingly enough there is also a library to read Lucene indexes into Terrier. Also Terrier has quite a bit of documentation in the doc/ directory within the GitHub repository.

To finish the academic part there is also anserini, an open source IR toolkit on top of Lucene.

This was just a small overview. There are dozens of other products out there, some with a specific use-case like Pilosa (distributed bitmap), cloud-only solutions like rockset with a focus on analytics or even Debian Code Search - Michael Stapelberg has a couple of interesting posts how he improved the code search over time in his blog. The dcs GitHub repository is also worth a look. Also there are extensions to existing products like RediSearch, but dubious benchmarks have made me not look further into it. Finally there is Vespa, another distributed search & recommendation engine. There was also an interesting blog post from okcupid describing their recommendation use-case for their dating app. This talk might also be worth a look.

Update: Charlie Hull pointed me via a tweet to Xapian, another library with a lot of bindings written in C++.

Update 2: Another mention on twitter was about including search functionality from existing databases. This is actually a good point, as databases like Postgres feature full text search capabilities as well. Postgres uses a specialized index types called GIN and GiST. There is a great series of blog posts about the different index types in PostgreSQL, one of them covering GIN. Also ArangoDB, a multi model database integrates search operations in their query language.

Update 3: I forgot another long term open source search solution: Sphinx. However development seems to have been slowed down with 6 releases in the last three years.

Summary

As you can see, I started with a certain data set at the top, which was not really useful for all the search engines. Just a reminder to you, that you should probably start off differently if you try to compare several search engines 😱

I’m excited for the future of search engines. Competition will lead to innovation in this space.

That said, there are still a lot of features, that solely Lucene has, especially when it comes to more complex scoring like rank feature or distance feature types/queries, two phase iteration to optimize query execution independent from the query order or index sorting. This would be easy to implement for quite a few of those search engines, but OTOH they often also service different use-cases, and handle way less data (think of an ecommerce shop with several ten thousand products), where certain optimizations are not needed, yet. One would rightfully focus on different features to solve this use-case, see another of my blog posts about implementing a modern ecommerce search.

Final remarks

If you made it down here, wooow! Thanks for sticking with me. You can follow or ping me on twitter, GitHub or reach me via Email (just to tell me, you read this whole thing :-).

If there is anything to correct, drop me a note, and I am happy to do so and append to this post!

Same applies for questions. If you have question, go ahead and ask!

If you want me to speak about this, drop me an email!

Back to posts

Backend developer, productivity fan, likes the JVM, full text search, distributed databases & systems