Overview - ParadeDB

Basic Usage

ParadeDB supports aggregates through plain SQL functions like COUNT, SUM, and AVG. For instance, the following query tallies the total number of search results for a full text query.

SELECT COUNT(*) FROM mock_items
WHERE description @@@ 'shoes';

Fast Aggregates

The performance of aggregates over very large result sets (millions of rows or more) is a well-known problem in Postgres. In ParadeDB Enterprise, aggregates over full text search results are drastically accelerated by fast fields. Fast fields are stored in a column-oriented fashion, which makes them much quicker to read and process. The following steps are necessary to prepare the ParadeDB instance for fast fields.

Configure Parallel Workers

ParadeDB Enterprise uses Postgres parallel workers. By default, Postgres allows two workers per parallel query. Increasing the number of parallel workers allows parallel queries to use all of the available hardware on the host machine and can deliver significant speedups. For instance, if the host machine has 16 CPUs, max_worker_processes, max_parallel_workers, and max_parallel_workers_per_gather should be set to 16.

Remember to restart Postgres in order for postgresql.conf changes to take effect.

Run `VACUUM`

VACUUM updates the table’s visibility map, which speeds up Postgres’ visibility checks. Specifying INDEX_CLEANUP false allows VACUUM to run much faster over large tables by skipping the cleanup of dead rows in the index.

Make sure to pass the table name, not the index name, to VACUUM.

VACUUM (INDEX_CLEANUP false) mock_items;

If the table experiences frequent updates, we recommend configuring autovacuum.

Run `pg_prewarm`

The pg_prewarm extension can be used to preload data from the index into the Postgres buffer cache, which improves the response times of “cold” queries (i.e. the first search query after Postgres has restarted).

CREATE EXTENSION pg_prewarm;
SELECT pg_prewarm('search_idx');

Index Fast Fields

You can check which fields are indexed as fast by running paradedb.schema.

SELECT name, fast FROM paradedb.schema('search_idx') ORDER BY fast;

Expected Response

     name     | fast
--------------+------
 metadata     | f
 description  | f
 category     | f
 id           | t
 in_stock     | t
 rating       | t
 weight_range | t
 created_at   | t
 ctid         | t
(9 rows)

In order for a query to be accelerated, all of the returned columns must be fast fields. For instance, the following query is accelerated because rating is a fast field, and COUNT(*) does not specify any fields.

SELECT rating, COUNT(*) FROM mock_items
WHERE description @@@ 'shoes'
GROUP BY rating LIMIT 5;

You can verify if a query will be accelerated by running EXPLAIN. Accelerated queries will show either a Custom Scan or Parallel Index Only Scan in the query plan.

Expected Response

                                                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10.04..10.09 rows=3 width=12)
   ->  GroupAggregate  (cost=10.04..10.09 rows=3 width=12)
         Group Key: rating
         ->  Sort  (cost=10.04..10.05 rows=3 width=4)
               Sort Key: rating
               ->  Custom Scan (ParadeDB Scan) on mock_items  (cost=10.00..10.02 rows=3 width=4)
                     Table: mock_items
                     Index: search_idx
                     Segment Count: 1
                     Exec Method: MixedFastFieldExecState
                     Fast Fields: rating
                     Numeric Fast Fields: rating
                     Scores: false
                     Tantivy Query: {"with_index":{"query":{"parse_with_field":{"field":"description","query_string":"shoes","lenient":null,"conjunction_mode":null}}}}
(14 rows)

On the other hand, a query that selects a non-fast field like description is not accelerated.

SELECT description, COUNT(*) FROM mock_items
WHERE description @@@ 'shoes'
GROUP BY description LIMIT 5;

If we run EXPLAIN on this query, we’ll see a regular Index Scan, which indicates that no fast field optimization occurred.

EXPLAIN SELECT description, COUNT(*) FROM mock_items
WHERE description @@@ 'shoes'
GROUP BY description LIMIT 5;

Expected Response

                                                                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=14.04..14.08 rows=2 width=40)
   ->  GroupAggregate  (cost=14.04..14.08 rows=2 width=40)
         Group Key: description
         ->  Sort  (cost=14.04..14.04 rows=2 width=32)
               Sort Key: description
               ->  Index Scan using search_idx on mock_items  (cost=10.00..14.03 rows=2 width=32)
                     Index Cond: (id @@@ '{"with_index":{"query":{"parse_with_field":{"field":"description","query_string":"shoes","lenient":null,"conjunction_mode":null}}}}'::paradedb.searchqueryinput COLLATE "default")
(7 rows)

That’s it! ParadeDB is now ready for fast aggregates.

Documentation

​Basic Usage

​Fast Aggregates

​Configure Parallel Workers

​Run VACUUM

​Run pg_prewarm

​Index Fast Fields

Basic Usage

Fast Aggregates

Configure Parallel Workers

Run `VACUUM`

Run `pg_prewarm`

Index Fast Fields