pg_bm25: a Rust-based Postgres extension that significantly improves Postgres’ full text
pg_bm25 is named after BM25, the algorithm used by modern search engines to calculate the
relevance scores of search results.
Today, Postgres’ native full text search, which uses the
tsvector type, has two main problems:
- Performance: Searching and ranking over large tables is sluggish. When tables grow to millions of rows, a single full text search can take several minutes.
- Functionality: Postgres has no support for operations like fuzzy search, relevance tuning, or BM25 relevance scoring, which are the bread and butter of modern search engines.
pg_bm25 aims to bridge the gap between the native capabilities of Postgres’ full text search and those of a specialized search engine like ElasticSearch.
The goal is to eliminate the need to bring a cumbersome service like ElasticSearch into the data stack.
Some features of
- 100% Postgres native, with zero dependencies on an external search engine
- Built on top of Tantivy, a Rust-based alternative to the Apache Lucene search library
- Query times over 1M rows are 20x faster compared to
ts_rank, Postgres’ built-in full text search and sort functions
- Support for fuzzy search, aggregations, highlighting, and relevance tuning
- Relevance scoring uses BM25, the same algorithm used by ElasticSearch
- Real-time search — new data is immediately searchable without manual reindexing
pg_bm25 stands on the shoulders of several open-source giants. The goal of this blog post is to recognize these projects
and to share how
pg_bm25 was built.
The Shoulders of Giants
Putting a search engine inside of Postgres is hard. A few projects have attempted it, but with one caveat: every single one has relied on an external ElasticSearch instance. This means introducing a complex and expensive piece of infrastructure into the data stack. Perhaps the best-known example of this kind of design is a Postgres extension called ZomboDB.
In 2016, an open source search library called Tantivy emerged. Tantivy was designed as a Rust-based alternative to Apache Lucene, the search library that powers ElasticSearch. Three years later, a library called pgrx — built by the same author of ZomboDB — made it possible to build Postgres extensions in Rust. Combined, these projects laid the groundwork for a Postgres extension that could create Elastic-quality search experiences within Postgres.
Creating the Inverted Index
Like ElasticSearch, the backbone of Tantivy’s search engine is a data structure called the inverted index, which stores a mapping from words to their locations in a set of documents. An inverted index is like the table of contents of a book — without it, you might have to examine every page to find a specific chapter.
Rather than creating this inverted index externally,
pg_bm25 stores the
index inside Postgres as a new, Postgres-native index type, which we call the BM25 index. This is made possible
through Postgres’ index access method API.
When a BM25 index is created, Postgres automatically updates it as new data arrives
or is deleted in the underlying SQL table. In this way,
pg_bm25 enables real-time search without any
additional reindexing logic.
Building the SQL Interface
Following index creation, the next step was to expose an intuitive SQL interface for users to write search queries.
This was accomplished through the Postgres operator API,
which enables the creation of custom Postgres operators. We chose the
@@@ operator to signify the beginning
of a query to the BM25 index in homage to the
@@ operator used by Postgres’ native full text search.
The end result is the ability to search any table with a single SQL query.
SELECT * FROM my_table WHERE my_table @@@ '"my query string"'
Wherever possible, we designed the SQL interface, ParadeQL, to transparently mirror Tantivy’s API. For instance,
the right-hand side of the
@@@ operator accepts Tantivy’s mini query language and configuration options.
SELECT * FROM my_table WHERE my_table @@@ 'description:keyboard^2 OR electronics:::fuzzy_fields=description&distance=2'
On a table with one million rows,
pg_bm25 indexes 50 seconds faster than
tsvector and ranks results
20x faster. Indexing and search times are nearly identical to those of a dedicated ElasticSearch instance.
With further optimizations, we’re aiming to reduce the query times compared to ElasticSearch
by an additional 2x.
More detailed benchmark results can be found in the README.