0.20.0 promotes the v2 API to default. The v2 API should be used for all new
indexes. The original API is now referred to as the legacy API and will be
removed in a future version.While there are no breaking changes, the default tokenizer has been changed
from
simple to unicode_words, which will affect new indexes that don’t
specify tokenizers for all columns.- buriedpot - Fixed ExecutorRun hook handling (#3461)
- matthew p robertson - Added trim token filter (#3545)
- Daniil Tatarinov - Implemented JSON key sorting and GROUP BY NULL handling (#3479, #3454)
New Features 🎉
Search Aggregation and Faceting
ParadeDB 0.20 introduces powerful search aggregation capabilities through the newpdb.agg() function. This function can be used in two ways to push analytics down into the Tantivy index for optimal performance.
The first is as a window function for fast faceting alongside TopN queries:
COUNT(*) are automatically routed to use this optimized path when possible.
v2 API
The v2 API has reached feature parity (except for custom stopwords and dismax) with the legacy API. It has now been promoted to the default API and can be found documented at http://docs.paradedb.com/documentation. The old API can be found here and will be removed in a future version. As a reminder the v2 API has the following improvements:- Index creation using SQL rather than JSON blobs
- Tokenizers as Postgres types
- Columnar fast fields by default for all non-text and literal tokenized types, removing the need to configure manually.
- Improved SQL query API, optimizing for both developer experience AND ORM integration
pdb.unicode_words, which splits text based on the Unicode Standard Annex #29 rules for better international text support.
Text Array Tokenization: Arrays of text can now be tokenized directly in the v2 API, enabling more flexible document structures.
New Token Filters: Added a trim token filter that removes leading and trailing whitespace from tokens, improving search precision.
Performance Improvements 🚀
Write Throughput
We have made significant improvements to ParadeDB’s write throughput through two major architectural changes. The first is enabling mutable segments by default, which are designed to incur minimal overhead during single-row writes. The overhead of tokenizing, serializing, and flushing an immutable segment is now completely eliminated for these operations. The second improvement is default background merging. All merging operations now happen in background threads by default, dramatically improving write performance by removing merge overhead from the critical write path. The system allows up to 2 concurrent background mergers for optimal resource utilization.Query Performance
Window Aggregate Pipelining: Implemented pipelined execution of window aggregates, significantly improving performance for analytical queries that combine search with aggregation operations. Optimized Large Term Sets: Added a fast field variant of TermSet for queries involving very large sets of terms, reducing memory usage and improving query response times. Reduced Memory Copying: Eliminated unnecessary Postgres buffer copies during query processing, reducing CPU overhead and improving throughput for complex queries. Numeric Data Type Support: ParadeDB now supports pushing downnumeric data types in aggregate queries, enabling efficient calculations on decimal fields like prices and financial data.
The full changelog is available on the GitHub Release.