0.10.0 - ParadeDB

Upgrading `pg_search`

Upgrading to pg_search 0.10.0 from any prior version requires dropping the pg_search extension using DROP EXTENSION pg_search CASCADE; and re-creating it with CREATE EXTENSION pg_search;. ALTER EXTENSION pg_search UPDATE; is not supported for this release.

Dropping the extension will also drop everything in your schema that has a dependency on the pg_search extension or the paradedb schema, including existing BM25 indexes. As a result, all BM25 indexes need to be re-created using CALL paradedb.create_bm25(...).

The primary reason for this is that we’ve changed where we physically store the underlying indexes on disk. This change resolves a situation where an index from database A could overwrite an index from database B that also happen to share the same internal OID (#1651). Additionally, we’ve improved the representation of our internal document id field (“ctid”), which leads to smaller indexes (#1584). Unfortunately, neither of these changes are backwards compatible. Our intent is that this will be the last release causing on-disk breaking changes.

Breaking Changes 🚨

New File Paths

As mentioned above, a big breaking change is that we now store index files in a different location on disk. The location is still relative to the $PGDATA directory, but is now organized by database id, index id, and the index’s assigned “file number”.

Primarily, this ensures databases won’t inadvertently overwrite each other’s indexes. Secondarily, it allows pg_search to better interoperate with transaction handling around CALL paradedb.create_bm25() and CALL paradedb.drop_bm25().

Compact Internal IDs

We now more compactly store “ctid” values in the index. This is a breaking change as v0.10.0’s understanding of this field is not compatible with prior versions.

On large indexes (>10G) we’ve seen about a 2% savings in disk storage. The savings will be even greater for larger indexes.

Cleanup after `DROP INDEX`

Prior to 0.10.0, dropping a BM25 index would not remove the old index files from disk. As of 0.10.0, the physical index files are deleted during CALL paradedb.drop_bm25(). This also extends to the DROP INDEX statement, and other statements that may lead to dropping an index, such as DROP SCHEMA and DROP TABLE.

Object Dependencies

pg_search now creates internal dependencies between the SCHEMA and INDEX pg_search creates (in both directions). The result is that if one object is dropped, the other will be too. Doing so ensures database objects are cleanly removed during schema modifications.

One Index Per Table

Until now, pg_search has allowed creating multiple BM25 indexes on a table.

However, when searching it wasn’t always guaranteed the specified index would actually be used. As of 0.10.0, only one BM25 index can be created per table.

`paradedb.fuzzy_term` Argument Defaults

While minor, it’s worth noting that the default prefix argument to the paradedb.fuzzy_term() function has changed to false. This change could impact search results of queries using this function. The new behavior, by default, matches more documents.

Stability Improvements 💪

`JSON` Fields Don’t Crash

Tables with a column of type json would cause a crash during indexing if they were used in the set of json_fields => during CALL paradedb.create_bm25(). This has been resolved.

Background Worker Doesn’t Terminate Unexpectedly

The background worker responsible for performing index write operations is now more resilient to unexpected errors. It was possible for it to exit early in some situations, which could then lead to a client backend crashing.

Improved `COMMIT`/`ABORT` Handling

0.10.0 improves the code around COMMIT and ABORT, resolving issues where a COMMIT wouldn’t always happen after an ABORT.

Vacuums Remove Dead Rows

0.9.3 introduced a bug where VACUUM wouldn’t remove dead rows. This has also been resolved.

Improved Locking Primitives

Similar to above, our internal locking structures are now more resilient to unexpected errors. This is an improvement for the code along with user-facing error propagation.

New Features 🎉

Improved Query Planner Integration

pg_search now supports more Postgres query plan types and provides greatly improved query cost estimations.

With this, our pseudo-secret @@@ operator is now ready for direct use.

The documentation covers its usage, but it’s worth mentioning that using the @@@ operator in a WHERE clause, as opposed to SELECTing directly from idxname.search(...), can lead to drastically improved query execution times in situations where scores (or the effects of scoring on ordering) are not necessary.

The @@@ operator also allows combining standard SQL WHERE clause predicates with advanced text-search queries. Additionally, it’s our recommended way of working with JOINed tables where one (or both) sides of the JOIN require some kind of text-search filtering.

Tokenization Configuration

It’s now possible to configure token filters like lowercase and stemmers for any tokenizer.

Introduce `fuzzy_phrase` Query

The fuzzy_phrase query is a new query builder function that makes it easier to perform fuzzy search over phrase queries.

Lenient and Conjunction Configs

Added lenient query parsing and conjunction configs to the search function.

PostGIS support

ParadeDB’s Dockerfile now includes support for PostGIS.

Documentation 📚

We’ve completely overhauled our documentation. Please refer to them for more details on new features in this release.

New Contributors 👋

@aalexandrov made their first contribution in #1583
@wendyzhan05 made their first contribution in #1617

Full Changelog

The full changelog is available here.

Changelog

​Upgrading pg_search

​Breaking Changes 🚨

​New File Paths

​Compact Internal IDs

​Cleanup after DROP INDEX

​Object Dependencies

​One Index Per Table

​paradedb.fuzzy_term Argument Defaults

​Stability Improvements 💪

​JSON Fields Don’t Crash

​Background Worker Doesn’t Terminate Unexpectedly

​Improved COMMIT/ABORT Handling

​Vacuums Remove Dead Rows

​Improved Locking Primitives

​New Features 🎉

​Improved Query Planner Integration

​Tokenization Configuration

​Introduce fuzzy_phrase Query

​Lenient and Conjunction Configs

​PostGIS support

​Documentation 📚

​New Contributors 👋

​Full Changelog