# Date Histogram Source: https://docs.paradedb.com/documentation/aggregates/bucket/datehistogram Count the number of occurrences over fixed time intervals The date histogram aggregation constructs a histogram for date fields. ```sql theme={null} SELECT pdb.agg('{"date_histogram": {"field": "created_at", "fixed_interval": "30d"}}') FROM mock_items WHERE id @@@ pdb.all(); ``` ```ini Expected Response theme={null} --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- {"buckets": [{"key": 1679616000000.0, "doc_count": 14, "key_as_string": "2023-03-24T00:00:00Z"}, {"key": 1682208000000.0, "doc_count": 27, "key_as_string": "2023-04-23T00:00:00Z"}]} (1 row) ``` See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/bucket/struct.DateHistogramAggregationReq.html) for all available options. # Filters Source: https://docs.paradedb.com/documentation/aggregates/bucket/filters Compute aggregations over multiple filters in one query The filters aggregation allows a single query to return aggregations for multiple search queries at a time. To use this aggregation, pass `pdb.agg` to the left-hand side of `FILTER` and a search query to the right-hand side. For example: ```sql theme={null} SELECT pdb.agg('{"value_count": {"field": "id"}}') FILTER (WHERE category === 'electronics') AS electronics_count, pdb.agg('{"value_count": {"field": "id"}}') FILTER (WHERE category === 'footwear') AS footwear_count FROM mock_items; ``` ```ini Expected Response theme={null} electronics_count | footwear_count -------------------+---------------- {"value": 5.0} | {"value": 6.0} (1 row) ``` # Histogram Source: https://docs.paradedb.com/documentation/aggregates/bucket/histogram Count the number of occurrences over some interval The histogram aggregation dynamically creates buckets for a given `interval` and counts the number of occurrences in each bucket. Each value is rounded down to its bucket. For instance, a rating of `18` with an interval of `5` rounds down to a bucket with key `15`. ```sql theme={null} SELECT pdb.agg('{"histogram": {"field": "rating", "interval": "1"}}') FROM mock_items WHERE id @@@ pdb.all(); ``` ```ini Expected Response theme={null} agg ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- {"buckets": [{"key": 1.0, "doc_count": 1}, {"key": 2.0, "doc_count": 3}, {"key": 3.0, "doc_count": 9}, {"key": 4.0, "doc_count": 16}, {"key": 5.0, "doc_count": 12}]} (1 row) ``` See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/bucket/struct.HistogramAggregation.html) for all available options. # Range Source: https://docs.paradedb.com/documentation/aggregates/bucket/range Count the number of occurrences over user-defined buckets The range aggregation counts the number of occurrences over user-defined buckets. The buckets must be continuous and cannot overlap. ```sql theme={null} SELECT pdb.agg('{"range": {"field": "rating", "ranges": [{"to": 3.0 }, {"from": 3.0, "to": 6.0} ]}}') FROM mock_items WHERE id @@@ pdb.all(); ``` ```ini Expected Response theme={null} agg ---------------------------------------------------------------------------------------------------------------------------------------------------------------- {"buckets": [{"to": 3.0, "key": "*-3", "doc_count": 4}, {"to": 6.0, "key": "3-6", "from": 3.0, "doc_count": 37}, {"key": "6-*", "from": 6.0, "doc_count": 0}]} (1 row) ``` See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/bucket/struct.RangeAggregation.html) for all available options. # Terms Source: https://docs.paradedb.com/documentation/aggregates/bucket/terms Count the number of occurrences for each value in a result set If a text or JSON field is in the `GROUP BY` or `ORDER BY` clause, it must use the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer. A terms aggregation counts the number of occurrences for every unique value in a field. For example, the following query groups the `mock_items` table by `rating`, and calculates the number of items for each unique `rating`. ```sql theme={null} SELECT rating, pdb.agg('{"value_count": {"field": "id"}}') FROM mock_items WHERE id @@@ pdb.all() GROUP BY rating LIMIT 10; ``` ```ini Expected Response theme={null} rating | agg --------+----------------- 4 | {"value": 16.0} 5 | {"value": 12.0} 3 | {"value": 9.0} 2 | {"value": 3.0} 1 | {"value": 1.0} (5 rows) ``` Ordering by the bucketing field is supported: ```sql theme={null} SELECT rating, pdb.agg('{"value_count": {"field": "id"}}') FROM mock_items WHERE id @@@ pdb.all() GROUP BY rating ORDER BY rating LIMIT 10; ``` Ordering by the aggregate value is not yet supported. For performance reasons, we strongly recommend adding a `LIMIT` to the `GROUP BY`. Terms aggregations without a `LIMIT` consume more memory and are slower to execute. If a query does not have a limit and more than `65000` unique values are found in a field, an error will be returned. # Facets Source: https://docs.paradedb.com/documentation/aggregates/facets Compute a Top N and aggregate in one query A common pattern in search is to query for both an aggregate and a set of search results. For example, "find the top 10 results, and also count the total number of results." Instead of issuing two separate queries -- one for the search results, and another for the aggregate -- `pdb.agg` allows for these results to be returned in a single "faceted" query. This can significantly improve read throughput, since issuing a single query uses less CPU and disk I/O. For example, this query returns the top 3 search results alongside the total number of results found. ```sql theme={null} SELECT id, description, rating, pdb.agg('{"value_count": {"field": "id"}}') OVER () FROM mock_items WHERE category === 'electronics' ORDER BY rating DESC LIMIT 3; ``` ```ini Expected Response theme={null} id | description | rating | agg ----+-----------------------------+--------+---------------- 12 | Innovative wireless earbuds | 5 | {"value": 5.0} 1 | Ergonomic metal keyboard | 4 | {"value": 5.0} 2 | Plastic Keyboard | 4 | {"value": 5.0} (3 rows) ``` Faceted queries require that `pdb.agg` be used as a window function: `pdb.agg() OVER ()`. ### Faceted Performance Optimization On every query, ParadeDB runs checks to ensure that deleted or updated-away rows are not factored into the result set. If your table is not frequently updated or you can tolerate an approximate result, the performance of faceted queries can be improved by disabling these visibility checks. To do so, set the second argument of `pdb.agg` to `false`. ```sql theme={null} SELECT description, rating, category, pdb.agg('{"value_count": {"field": "id"}}', false) OVER () FROM mock_items WHERE description ||| 'running shoes' ORDER BY rating LIMIT 5; ``` Disabling this check can improve query times by 2-4x in some cases (at the expense of correctness). # Limitations Source: https://docs.paradedb.com/documentation/aggregates/limitations Caveats for aggregate support ## ParadeDB Operator In order for ParadeDB to push down an aggregate, a ParadeDB text search operator must be present in the query. ```sql theme={null} -- Not pushed down SELECT COUNT(*) FROM mock_items WHERE rating = 5; -- Pushed down SELECT COUNT(*) FROM mock_items WHERE rating = 5 AND id @@@ pdb.all(); ``` If your query does not contain a ParadeDB operator, a way to "force" aggregate pushdown is to append the [all query](/documentation/query-builder/compound/all) to the query's `WHERE` clause. ## Join Support ParadeDB is currently only able to push down aggregates over a single table. JOINs are not yet pushed down but are on the [roadmap](/welcome/roadmap). ## NUMERIC Columns `NUMERIC` columns do not support aggregate pushdown. Queries with aggregates on `NUMERIC` columns will automatically fall back to PostgreSQL for aggregation. For numeric data that requires aggregate pushdown, use `FLOAT` or `DOUBLE PRECISION` instead: ```sql theme={null} -- Aggregates can be pushed down CREATE TABLE products ( id SERIAL PRIMARY KEY, price DOUBLE PRECISION ); -- Aggregates fall back to PostgreSQL CREATE TABLE products ( id SERIAL PRIMARY KEY, price NUMERIC(10,2) ); ``` Filter pushdown (equality and range queries) is fully supported for all `NUMERIC` columns. Only aggregate pushdown is not supported. # Average Source: https://docs.paradedb.com/documentation/aggregates/metrics/average Compute the average value of a field The following query computes the average value over a specific field: ```sql theme={null} SELECT pdb.agg('{"avg": {"field": "rating"}}') FROM mock_items WHERE id @@@ pdb.all(); ``` ```ini Expected Response theme={null} agg ------------------------------- {"value": 3.8536585365853657} (1 row) ``` See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.AverageAggregation.html) for all available options. ## SQL Average Syntax SQL's `AVERAGE` syntax is supported in beta. To enable it, first run ```sql theme={null} SET paradedb.enable_aggregate_custom_scan TO on; ``` With this feature enabled, the following query is equivalent to the above and is executed in the same way. ```sql theme={null} SELECT AVG(rating) FROM mock_items WHERE id @@@ pdb.all(); ``` By default, `AVG` ignores null values. Use `COALESCE` to include them in the final average: ```sql theme={null} SELECT AVG(COALESCE(rating, 0)) FROM mock_items WHERE id @@@ pdb.all(); ``` # Cardinality Source: https://docs.paradedb.com/documentation/aggregates/metrics/cardinality Compute the number of distinct values in a field The cardinality aggregation estimates the number of distinct values in a field. ```sql theme={null} SELECT pdb.agg('{"cardinality": {"field": "rating"}}') FROM mock_items WHERE id @@@ pdb.all(); ``` ```ini Expected Response theme={null} agg ---------------- {"value": 5.0} (1 row) ``` Unlike SQL's `DISTINCT` clause, which returns an exact value but is very computationally expensive, the cardinality aggregation uses the HyperLogLog++ algorithm to closely approximate the number of distinct values. See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.CardinalityAggregationReq.html) for all available options. # Count Source: https://docs.paradedb.com/documentation/aggregates/metrics/count Count the number of values in a field The following query counts the number of values in a field: ```sql theme={null} SELECT pdb.agg('{"value_count": {"field": "rating"}}') FROM mock_items WHERE id @@@ pdb.all(); ``` ```ini Expected Response theme={null} agg ----------------- {"value": 41.0} (1 row) ``` See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.CountAggregation.html) for all available options. ## SQL Count Syntax SQL's `COUNT` syntax is supported in beta. To enable it, first run ```sql theme={null} SET paradedb.enable_aggregate_custom_scan TO on; ``` With this feature enabled, the following query is equivalent to the above and is executed in the same way. ```sql theme={null} SELECT COUNT(rating) FROM mock_items WHERE id @@@ pdb.all(); ``` To count all rows, including rows with null values, use `COUNT(*)`: ```sql theme={null} SELECT COUNT(*) FROM mock_items WHERE id @@@ pdb.all(); ``` # Min/Max Source: https://docs.paradedb.com/documentation/aggregates/metrics/minmax Compute the min/max value of a field `min` and `max` return the smallest and largest values of a column, respectively. ```sql Min theme={null} SELECT pdb.agg('{"min": {"field": "rating"}}') FROM mock_items WHERE id @@@ pdb.all(); ``` ```sql Max theme={null} SELECT pdb.agg('{"max": {"field": "rating"}}') FROM mock_items WHERE id @@@ pdb.all(); ``` ```ini Expected Response (Min) theme={null} agg ---------------- {"value": 1.0} (1 row) ``` ```ini Expected Response (Max) theme={null} agg ---------------- {"value": 5.0} (1 row) ``` See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.MinAggregation.html) for all available options. ## SQL Min/Max Syntax SQL's `MIN`/`MAX` syntax is supported in beta. To enable it, first run ```sql theme={null} SET paradedb.enable_aggregate_custom_scan TO on; ``` With this feature enabled, the following query is equivalent to the above and is executed in the same way. ```sql Min theme={null} SELECT MIN(rating) FROM mock_items WHERE id @@@ pdb.all(); ``` ```sql Max theme={null} SELECT MAX(rating) FROM mock_items WHERE id @@@ pdb.all(); ``` By default, `MIN`/`MAX` ignore null values. Use `COALESCE` to include them in the final sum: ```sql theme={null} SELECT MIN(COALESCE(rating, 0)) FROM mock_items WHERE id @@@ pdb.all(); ``` # Percentiles Source: https://docs.paradedb.com/documentation/aggregates/metrics/percentiles Analyze the distribution of a field The percentiles aggregation computes the values below which a given percentage of the data falls. In this example, the aggregation will return the 50th and 95th percentiles for `rating`. ```sql theme={null} SELECT pdb.agg('{"percentiles": {"field": "rating", "percents": [50, 95]}}') FROM mock_items WHERE id @@@ pdb.all(); ``` ```ini Expected Response theme={null} agg --------------------------------------------------------------------- {"values": {"50.0": 4.014835333028612, "95.0": 5.0028295751107414}} (1 row) ``` See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.PercentilesAggregationReq.html) for all available options. # Stats Source: https://docs.paradedb.com/documentation/aggregates/metrics/stats Compute several metrics at once The stats aggregation returns the count, sum, min, max, and average all at once. ```sql theme={null} SELECT pdb.agg('{"stats": {"field": "rating"}}') FROM mock_items WHERE id @@@ pdb.all(); ``` ```ini Expected Response theme={null} agg -------------------------------------------------------------------------------- {"avg": 3.8536585365853657, "max": 5.0, "min": 1.0, "sum": 158.0, "count": 41} (1 row) ``` See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.StatsAggregation.html) for all available options. # Sum Source: https://docs.paradedb.com/documentation/aggregates/metrics/sum Compute the sum of a field The sum aggregation computes the sum of a field. ```sql theme={null} SELECT pdb.agg('{"sum": {"field": "rating"}}') FROM mock_items WHERE id @@@ pdb.all(); ``` ```ini Expected Response theme={null} agg ------------------ {"value": 158.0} (1 row) ``` See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.SumAggregation.html) for all available options. ## SQL Sum Syntax SQL's `SUM` syntax is supported in beta. To enable it, first run ```sql theme={null} SET paradedb.enable_aggregate_custom_scan TO on; ``` With this feature enabled, the following query is equivalent to the above and is executed in the same way. ```sql theme={null} SELECT SUM(rating) FROM mock_items WHERE id @@@ pdb.all(); ``` By default, `SUM` ignores null values. Use `COALESCE` to include them in the final sum: ```sql theme={null} SELECT SUM(COALESCE(rating, 0)) FROM mock_items WHERE id @@@ pdb.all(); ``` # Top Hits Source: https://docs.paradedb.com/documentation/aggregates/metrics/tophits Compute the top hits for each bucket in a terms aggregation The top hits aggregation is meant to be used in conjunction with the [terms](/documentation/aggregates/bucket/terms) aggregation. It returns the top documents for each bucket of a terms aggregation. For example, the following query answers "what are top 3 results sorted by `created_at` for each `rating` category?" ```sql theme={null} SELECT pdb.agg('{"top_hits": {"size": 3, "sort": [{"created_at": "desc"}], "docvalue_fields": ["id", "created_at"]}}') FROM mock_items WHERE id @@@ pdb.all() GROUP BY rating; ``` ```ini Expected Response theme={null} agg --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- {"hits": [{"sort": [10907000251854775808], "docvalue_fields": {"id": [25], "created_at": ["2023-05-09T10:30:15Z"]}}, {"sort": [10906844884854775808], "docvalue_fields": {"id": [26], "created_at": ["2023-05-07T15:20:48Z"]}}, {"sort": [10906666358854775808], "docvalue_fields": {"id": [13], "created_at": ["2023-05-05T13:45:22Z"]}}]} {"hits": [{"sort": [10906756363854775808], "docvalue_fields": {"id": [24], "created_at": ["2023-05-06T14:45:27Z"]}}, {"sort": [10906385295854775808], "docvalue_fields": {"id": [28], "created_at": ["2023-05-02T07:40:59Z"]}}, {"sort": [10906236353854775808], "docvalue_fields": {"id": [29], "created_at": ["2023-04-30T14:18:37Z"]}}]} {"hits": [{"sort": [10906480573854775808], "docvalue_fields": {"id": [17], "created_at": ["2023-05-03T10:08:57Z"]}}, {"sort": [10906315942854775808], "docvalue_fields": {"id": [20], "created_at": ["2023-05-01T12:25:06Z"]}}, {"sort": [10906218361854775808], "docvalue_fields": {"id": [8], "created_at": ["2023-04-30T09:18:45Z"]}}]} {"hits": [{"sort": [10906573359854775808], "docvalue_fields": {"id": [27], "created_at": ["2023-05-04T11:55:23Z"]}}, {"sort": [10905961160854775808], "docvalue_fields": {"id": [15], "created_at": ["2023-04-27T09:52:04Z"]}}, {"sort": [10905202003854775808], "docvalue_fields": {"id": [7], "created_at": ["2023-04-18T14:59:27Z"]}}]} {"hits": [{"sort": [10906586188854775808], "docvalue_fields": {"id": [10], "created_at": ["2023-05-04T15:29:12Z"]}}]} (5 rows) ``` The `sort` value returned by the aggregation is Tantivy's internal sort ID and should be ignored. To get the actual fields, pass a list of fields to `docvalue_fields`. If a text or JSON field is passed to `docvalue_fields`, it must be indexed with the [literal](/documentation/tokenizers/available-tokenizers/literal) or [literal normalized](/documentation/tokenizers/available-tokenizers/literal-normalized) tokenizer. To specify an offset, use `from`: ```sql theme={null} SELECT pdb.agg('{"top_hits": {"size": 3, "from": 1, "sort": [{"created_at": "desc"}], "docvalue_fields": ["id", "created_at"]}}') FROM mock_items WHERE id @@@ pdb.all() GROUP BY rating; ``` If multiple fields are passed into `sort`, the additional fields are used as tiebreakers: ```sql theme={null} SELECT pdb.agg('{"top_hits": {"size": 3, "sort": [{"created_at": "desc"}, {"id": "asc"}], "docvalue_fields": ["id", "created_at"]}}') FROM mock_items WHERE id @@@ pdb.all() GROUP BY rating; ``` See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.TopHitsAggregationReq.html) for all available options. # Aggregate Syntax Source: https://docs.paradedb.com/documentation/aggregates/overview Accelerate aggregates with the ParadeDB index The `pdb.agg` function accepts an Elasticsearch-compatible JSON aggregate query string. It executes the aggregate using the [columnar](/welcome/architecture#columnar-index) portion of the ParadeDB index, which can significantly accelerate performance compared to vanilla Postgres. For example, the following query counts the total number of results for a search query. ```sql theme={null} SELECT pdb.agg('{"value_count": {"field": "id"}}') FROM mock_items WHERE category === 'electronics'; ``` ```ini Expected Response theme={null} agg ---------------- {"value": 5.0} (1 row) ``` This query counts the number of results for every distinct group: ```sql theme={null} SELECT rating, pdb.agg('{"value_count": {"field": "id"}}') FROM mock_items WHERE category === 'electronics' GROUP BY rating ORDER BY rating LIMIT 5; ``` ```ini Expected Response theme={null} rating | agg --------+---------------- 3 | {"value": 1.0} 4 | {"value": 3.0} 5 | {"value": 1.0} (3 rows) ``` ## Multiple Aggregations To compute multiple aggregations at once, simply include multiple `pdb.agg` functions in the target list: ```sql theme={null} SELECT pdb.agg('{"avg": {"field": "rating"}}') AS avg_rating, pdb.agg('{"value_count": {"field": "id"}}') AS count FROM mock_items WHERE category === 'electronics'; ``` ```ini Expected Response theme={null} avg_rating | count ----------------+---------------- {"value": 4.0} | {"value": 5.0} (1 row) ``` ## JSON Fields If `metadata` is a JSON field with key `color`, use `metadata.color` as the field name: ```sql theme={null} SELECT pdb.agg('{"terms": {"field": "metadata.color"}}') FROM mock_items WHERE id @@@ pdb.all(); ``` If a text or JSON field is used inside `pdb.agg`, it must use the [literal](/documentation/tokenizers/available-tokenizers/literal) or [literal normalized](/documentation/tokenizers/available-tokenizers/literal-normalized) tokenizer. # Performance Tuning Source: https://docs.paradedb.com/documentation/aggregates/tuning Several settings can be tuned to improve the performance of aggregates in ParadeDB ### Configure Parallel Workers ParadeDB uses Postgres parallel workers. By default, Postgres allows two workers per parallel query. Increasing the number of [parallel workers](/documentation/performance-tuning/reads) allows parallel queries to use all of the available hardware on the host machine and can deliver significant speedups. ### Run `VACUUM` `VACUUM` updates the table's [visibility map](https://www.postgresql.org/docs/current/storage-vm.html), which speeds up Postgres' visibility checks. ```sql theme={null} VACUUM mock_items; ``` If the table experiences frequent updates, we recommend configuring [autovacuum](https://www.postgresql.org/docs/current/routine-vacuuming.html). ### Run `pg_prewarm` The `pg_prewarm` extension can be used to preload data from the index into the Postgres buffer cache, which improves the response times of "cold" queries (i.e. the first search query after Postgres has restarted). ```sql theme={null} CREATE EXTENSION pg_prewarm; SELECT pg_prewarm('search_idx'); ``` # Filtering Source: https://docs.paradedb.com/documentation/filtering Filter search results based on metadata from other fields Adding filters to text search is as simple as using PostgreSQL's built-in `WHERE` clauses and operators. For instance, the following query filters out results that do not meet `rating > 2`. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description ||| 'running shoes' AND rating > 2; ``` ## Filter Pushdown ### Non-Text Fields While not required, filtering performance over non-text columns can be improved by including them in the BM25 index. When these columns are part of the index, `WHERE` clauses that reference them can be pushed down into the index scan itself. This can result in faster query execution over large datasets. For example, if `rating` and `created_at` are frequently used in filters, they can be added to the BM25 index during index creation: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25(id, description, rating, created_at) WITH (key_field = 'id'); ``` Filter pushdown is currently supported for the following combinations of types and operators: | Operator | Left Operand Type | Right Operand Type | Example | | ------------------------------------------ | ----------------- | ------------------ | -------------------------- | | `=`, `<`, `>`, `<=`, `>=`, `<>`, `BETWEEN` | `int2` | `int2` | `WHERE rating = 2` | | | `int4` | `int4` | | | | `int8` | `int8` | | | | `int2` | `int4` | | | | `int2` | `int8` | | | | `int4` | `int8` | | | | `float4` | `float4` | | | | `float8` | `float8` | | | | `float4` | `float8` | | | | `numeric` | `numeric` | `WHERE price = 99.99` | | | `date` | `date` | | | | `time` | `time` | | | | `timetz` | `timetz` | | | | `timestamp` | `timestamp` | | | | `timestamptz` | `timestamptz` | | | | `uuid` | `uuid` | | | `=` | `bool` | `bool` | `WHERE in_stock = true` | | `IN`, `ANY`, `ALL` | `bool` | `bool[]` | `WHERE rating IN (1,2,3)` | | | `int2` | `int2[]` | | | | `int4` | `int4[]` | | | | `int8` | `int8[]` | | | | `int2` | `int4[]` | | | | `int2` | `int8[]` | | | | `int4` | `int8[]` | | | | `float4` | `float4[]` | | | | `float8` | `float8[]` | | | | `float4` | `float8[]` | | | | `date` | `date[]` | | | | `timetz` | `timetz[]` | | | | `timestamp` | `timestamp[]` | | | | `timestamptz` | `timestamptz[]` | | | | `uuid` | `uuid[]` | | | `IS`, `IS NOT` | `bool` | `bool` | `WHERE in_stock IS true` | | `IS NULL`, `IS NOT NULL` | `bool` | | `WHERE rating IS NOT NULL` | | | `int2` | | | | | `int4` | | | | | `int8` | | | | | `int2` | | | | | `int2` | | | | | `int4` | | | | | `float4` | | | | | `float8` | | | | | `float4` | | | | | `date` | | | | | `time` | | | | | `timetz` | | | | | `timestamp` | | | | | `timestamptz` | | | | | `uuid` | | | ### Text Fields Suppose we have a text filter that looks for an exact string match like `category = 'Footwear'`: ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description @@@ 'shoes' AND category = 'Footwear'; ``` To push down the `category = 'Footwear'` filter, `category` must be indexed using the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25(id, description, (category::pdb.literal)) WITH (key_field = 'id'); ``` Pushdown of set filters over text fields also requires the literal tokenizer: ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description @@@ 'shoes' AND category IN ('Footwear', 'Apparel'); ``` # Fuzzy Source: https://docs.paradedb.com/documentation/full-text/fuzzy Allow for typos in the query string Fuzziness allows for tokens to be considered a match even if they are not identical, allowing for typos in the query string. While fuzzy matching will work for non-latin characters (Chinese, Japanese, Korean, etc..), it may not give expected results (with large result sets returned) as Levenshtein distance relies on individual character difference. If you need this functionality then please thumbs-up this [issue](https://github.com/paradedb/paradedb/issues/3782), and leave a comment with your use case. ## Overview To add fuzziness to a query, cast it to the `fuzzy(n)` type, where `n` is the [edit distance](#how-it-works). Fuzziness is supported for [match](/documentation/full-text/match) and [term](/documentation/full-text/term) queries. ```sql theme={null} -- Fuzzy match disjunction SELECT id, description FROM mock_items WHERE description ||| 'runing shose'::pdb.fuzzy(2) LIMIT 5; -- Fuzzy match conjunction SELECT id, description FROM mock_items WHERE description &&& 'runing shose'::pdb.fuzzy(2) LIMIT 5; -- Fuzzy Term SELECT id, description FROM mock_items WHERE description === 'shose'::pdb.fuzzy(2) LIMIT 5; ``` ## How It Works By default, the [match](/documentation/full-text/match) and [term](/documentation/full-text/term) queries require exact token matches between the query and indexed text. When a query is cast to `fuzzy(n)`, this requirement is relaxed -- tokens are matched if their Levenshtein distance, or edit distance, is less than or equal to `n`. Edit distance is a measure of how many single-character operations are needed to turn one string into another. The allowed operations are: * **Insertion** adds a character e.g., "shoe" → "shoes" (insert "s") has an edit distance of `1` * **Deletion** removes a character e.g. "runnning" → "running" (delete one "n") has an edit distance of `1` * **Transposition** replaces on character with another e.g., "shose" → "shoes" (transpose "s" → "e") has an edit distance of `2` For performance reasons, the maximum allowed edit distance is `2`. Casting a query to `fuzzy(0)` is the same as an exact token match. ## Fuzzy Prefix `fuzzy` also supports prefix matching. For instance, "runn" is a prefix of "running" because it matches the beginning of the token exactly. "rann" is a **fuzzy prefix** of "running" because it matches the beginning within an edit distance of `1`. To treat the query string as a prefix, set the second argument of `fuzzy` to either `t` or `"true"`: ```sql theme={null} SELECT id, description FROM mock_items WHERE description === 'rann'::pdb.fuzzy(1, t) LIMIT 5; ``` Postgres requires that `true` be double-quoted, i.e. `fuzzy(1, "true")`. When used with [match](/documentation/full-text/match) queries, fuzzy prefix treats all tokens in the query string as prefixes. For instance, the following query means "find all documents containing the fuzzy prefix `rann` AND the fuzzy prefix `slee`": ```sql theme={null} SELECT id, description FROM mock_items WHERE description &&& 'slee rann'::pdb.fuzzy(1, t) LIMIT 5; ``` ## Transposition Cost By default, the cost of a transposition (i.e. "shose" → "shoes") is `2`. Setting the third argument of `fuzzy` to `t` lowers the cost of a transposition to `1`: ```sql theme={null} SELECT id, description FROM mock_items WHERE description === 'shose'::pdb.fuzzy(1, f, t) LIMIT 5; ``` The default value for the second and third arguments of `fuzzy` is `f`, which means `fuzzy(1)` is equivalent to `fuzzy(1, f, f)`. # Highlighting Source: https://docs.paradedb.com/documentation/full-text/highlight Generate snippets for portions of the source text that match the query string Highlighting is an expensive process and can slow down query times. We recommend passing a `LIMIT` to any query where `pdb.snippet` or `pdb.snippets` is called to restrict the number of snippets that need to be generated. Highlighting is not supported for fuzzy search. Highlighting refers to the practice of visually emphasizing the portions of a document that match a user's search query. ## Basic Usage `pdb.snippet()` can be added to any query where an `@@@` operator is present. `pdb.snippet` returns the single best snippet, sorted by relevance score. The following query generates highlighted snippets against the `description` field. ```sql theme={null} SELECT id, pdb.snippet(description) FROM mock_items WHERE description ||| 'shoes' LIMIT 5; ``` The leading indicator around the highlighted region. The trailing indicator around the highlighted region. Max number of characters for a highlighted snippet. A snippet may contain multiple matches if they are close to each other. By default, `` encloses the snippet. This can be configured with `start_tag` and `end_tag`: ```sql theme={null} SELECT id, pdb.snippet(description, start_tag => '', end_tag => '') FROM mock_items WHERE description ||| 'shoes' LIMIT 5; ``` ## Multiple Snippets `pdb.snippets()` returns an array of snippets, allowing you to retrieve multiple highlighted matches from a document. This is particularly useful when a document has several relevant matches spread throughout its content. ```sql theme={null} SELECT id, pdb.snippets(description, max_num_chars => 15) FROM mock_items WHERE description ||| 'artistic vase' LIMIT 5; ``` ```ini Expected Response theme={null} id | snippets ----+----------------------------------------- 19 | {Artistic,"ceramic vase"} (1 row) ``` The leading indicator around the highlighted region. The trailing indicator around the highlighted region. Max number of characters for a highlighted snippet. When `max_num_chars` is small, multiple snippets may be generated for a single document. The maximum number of snippets to return per document. The number of snippets to skip before returning results. Use with `limit` for pagination. The order in which to sort the snippets. Can be `'score'` (default, sorts by relevance) or `'position'` (sorts by appearance in the document). ### Limiting and Offsetting Snippets You can control the number and order of snippets returned using the `limit`, `offset`, and `sort_by` parameters. For example, to get only the first snippet: ```sql theme={null} SELECT id, pdb.snippets(description, max_num_chars => 15, "limit" => 1) FROM mock_items WHERE description ||| 'running' LIMIT 5; ``` To get the second snippet (by skipping the first one): ```sql theme={null} SELECT id, pdb.snippets(description, max_num_chars => 15, "limit" => 1, "offset" => 1) FROM mock_items WHERE description ||| 'running' LIMIT 5; ``` ### Sorting Snippets Snippets can be sorted either by their relevance score (`'score'`) or their position within the document (`'position'`). To sort snippets by their appearance in the document: ```sql theme={null} SELECT id, pdb.snippets(description, max_num_chars => 15, sort_by => 'position') FROM mock_items WHERE description ||| 'artistic vase' LIMIT 5; ``` ## Byte Offsets `pdb.snippet_positions()` returns the byte offsets in the original text where the snippets would appear. It returns an array of tuples, where the first element of the tuple is the byte index of the first byte of the highlighted region, and the second element is the byte index after the last byte of the region. ```sql theme={null} SELECT id, pdb.snippet(description), pdb.snippet_positions(description) FROM mock_items WHERE description ||| 'shoes' LIMIT 5; ``` ```ini Expected Response theme={null} id | snippet | snippet_positions ----+----------------------------+------------------- 3 | Sleek running shoes | {"{14,19}"} 4 | White jogging shoes | {"{14,19}"} 5 | Generic shoes | {"{8,13}"} (3 rows) ``` # Match Source: https://docs.paradedb.com/documentation/full-text/match Returns documents that match the provided query string, which is tokenized before matching Match queries are the go-to query type for text search in ParadeDB. There are two types of match queries: [match disjunction](#match-disjunction) and [match conjunction](#match-conjunction). ## Match Disjunction Match disjunction uses the `|||` operator and means "find all documents that contain one or more of the terms tokenized from this text input." To understand what this looks like in practice, let's consider the following query: ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description ||| 'running shoes'; ``` This query returns: ```csv theme={null} description | rating | category ---------------------+--------+---------- Sleek running shoes | 5 | Footwear White jogging shoes | 3 | Footwear Generic shoes | 4 | Footwear (3 rows) ``` ### How It Works Let's look at what the `|||` operator does: 1. Retrieves the tokenizer configuration of the `description` column. In this example, let's assume `description` uses the [unicode](/documentation/tokenizers/available-tokenizers/unicode) tokenizer. 2. Tokenizes the query string with the same tokenizer. This means `running shoes` becomes two tokens: `running` and `shoes`. 3. Finds all rows where `description` contains **any one** of the tokens, `running` or `shoes`. This is why all results have either `running` or `shoes` tokens in `description`. ### Examples Let's consider a few more hypothetical documents to see whether they would be returned by match disjunction. These examples assume that the index uses the default tokenizer and token filters, and that the query is `running shoes`. | Original Text | Tokens | Match | Reason | Related | | ------------------- | ------------------------- | ----- | --------------------------------------- | --------------------------------------------------------------------- | | Sleek running shoes | `sleek` `running` `shoes` | ✅ | Contains both `running` and `shoes`. | | | Running shoes sleek | `sleek` `running` `shoes` | ✅ | Contains both `running` and `shoes`. | [Phrase](/documentation/full-text/phrase) | | SLeeK RUNNING ShOeS | `sleek` `running` `shoes` | ✅ | Contains both `running` and `shoes`. | [Lowercasing](/documentation/indexing/create-index) | | Sleek run shoe | `sleek` `run` `shoe` | ❌ | Contains neither `running` nor `shoes`. | [Stemming](/documentation/indexing/create-index) | | Sleke ruining shoez | `sleke` `ruining` `shoez` | ❌ | Contains neither `running` nor `shoes`. | [Fuzzy](/documentation/full-text/fuzzy) | | White jogging shoes | `white` `jogging` `shoes` | ✅ | Contains `shoes`. | [Match conjunction](/documentation/full-text/match#match-conjunction) | ## Match Conjunction Suppose we want to find rows that contain both `running` **and** `shoes`. This is where the `&&&` match conjunction operator comes in. `&&&` means "find all documents that contain all terms tokenized from this text input." ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description &&& 'running shoes'; ``` This query returns: ```csv theme={null} description | rating | category ---------------------+--------+---------- Sleek running shoes | 5 | Footwear (1 row) ``` Note that `White jogging shoes` and `Generic shoes` are no longer returned because they do not have the token `running`. ### How It Works Match conjunction works exactly like match disjunction, except for one key distinction. Instead of finding documents containing at least one matching token from the query, it finds documents where **all tokens** from the query are a match. ### Examples Let’s consider a few more hypothetical documents to see whether they would be returned by match conjunction. These examples assume that the index uses the default tokenizer and token filters, and that the query is `running shoes`. | Original Text | Tokens | Match | Reason | Related | | ------------------- | ------------------------- | ----- | -------------------------------------------- | --------------------------------------------------------------------- | | Sleek running shoes | `sleek` `running` `shoes` | ✅ | Contains both `running` and `shoes`. | | | Running shoes sleek | `sleek` `running` `shoes` | ✅ | Contains both `running` and `shoes`. | [Phrase](/documentation/full-text/phrase) | | SLeeK RUNNING ShOeS | `sleek` `running` `shoes` | ✅ | Contains both `running` and `shoes`. | [Lowercasing](/documentation/indexing/create-index) | | Sleek run shoe | `sleek` `run` `shoe` | ❌ | Does not contain both `running` and `shoes`. | [Stemming](/documentation/indexing/create-index) | | Sleke ruining shoez | `sleke` `ruining` `shoez` | ❌ | Does not contain both `running` and `shoes`. | [Fuzzy](/documentation/full-text/fuzzy) | | White jogging shoes | `white` `jogging` `shoes` | ❌ | Does not contain both `running` and `shoes`. | [Match conjunction](/documentation/full-text/match#match-conjunction) | If the query string only contains one token, then `|||` and `&&&` are effectively the same: ```sql theme={null} -- These two queries produce the same results SELECT description, rating, category FROM mock_items WHERE description ||| 'shoes'; SELECT description, rating, category FROM mock_items WHERE description &&& 'shoes'; ``` ## Using a Custom Tokenizer By default, the match query automatically tokenizes the query string with the same tokenizer used by the field it's being searched against. This behavior can be overridden by explicitly casting the query to a different tokenizer. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description ||| 'running shoes'::pdb.whitespace; ``` ## Using Pretokenized Text The match operators also accept text arrays. If a text array is provided, each element of the array is treated as an exact token, which means that no further processing is done. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description &&& ARRAY['running', 'shoes']; ``` # How Text Search Works Source: https://docs.paradedb.com/documentation/full-text/overview Understand how ParadeDB uses token matching to efficiently search large corpuses of text Text search in ParadeDB, like Elasticsearch and most search engines, is centered around the concept of **token matching**. Token matching consists of two steps. First, at indexing time, text is processed by a tokenizer, which breaks input into discrete units called **tokens** or **terms**. For example, the [default](/documentation/indexing/create-index) tokenizer splits the text `Sleek running shoes` into the tokens `sleek`, `running`, and `shoes`. Second, at query time, the query engine looks for token matches based on the specified query and query type. Some common query types include: * [Match](/documentation/full-text/match): Matches documents containing any or all query tokens * [Phrase](/documentation/full-text/phrase): Matches documents where all tokens appear in the same order as the query * [Term](/documentation/full-text/term): Matches documents containing an exact token * ...and many more [advanced](/documentation/query-builder/overview) query types ## Not Substring Matching While ParadeDB supports substring matching via [regex](/documentation/query-builder/term/regex) queries, it's important to note that token matching is **not** the same as substring matching. Token matching is a much more versatile and powerful technique. It enables relevance scoring, language-specific analysis, typo tolerance, and more expressive query types — capabilities that go far beyond simply looking for a sequence of characters. ## Similarity Search Text search is different than similarity search, also known as vector search. Whereas text search matches based on token matches, similarity search matches based on semantic meaning. ParadeDB currently does not build its own extensions for similarity search. Most ParadeDB users install [pgvector](https://github.com/pgvector/pgvector), the Postgres extension for vector search, for this use case. We have tentative long-term plans in our [roadmap](/welcome/roadmap#vector-search-improvements) to make improvements to Postgres' vector search. If this is useful to you, please [reach out](mailto:support@paradedb.com). # Phrase Source: https://docs.paradedb.com/documentation/full-text/phrase Phrase queries are like match queries, but with order and position of matching tokens enforced Phrase queries work exactly like [match conjunction](/documentation/full-text/match#match-conjunction), but are more strict in that they require the order and position of tokens to be the same. Suppose our query is `running shoes`, and we want to omit results like `running sleek shoes` or `shoes running` — these results contain the right tokens, but not in the exact order and position that the query specifies. Enter the `###` phrase operator: ```sql theme={null} INSERT INTO mock_items (description, rating, category) VALUES ('running sleek shoes', 5, 'Footwear'), ('shoes running', 5, 'Footwear'); SELECT description, rating, category FROM mock_items WHERE description ### 'running shoes'; ``` This query returns: ```csv theme={null} description | rating | category ---------------------+--------+---------- Sleek running shoes | 5 | Footwear (1 row) ``` Note that `running sleek shoes` and `shoes running` did not match the phrase `running shoes` despite having the tokens `running` and `shoes` because they appear in the wrong order or with other words in between. ## How It Works Let's look at what happens under the hood for the above phrase query: 1. Retrieves the tokenizer configuration of the `description` column. In this example, let's assume `description` uses the [unicode](/documentation/tokenizers/available-tokenizers/unicode) tokenizer. 2. Tokenizes the query string with the same tokenizer. This means `running shoes` becomes two tokens: `running` and `shoes`. 3. Finds all rows where `description` contains `running` immediately followed by `shoes`. ## Examples Let’s consider a few more hypothetical documents to see whether they would be returned by the phrase query. These examples assume that index uses the default tokenizer and token filters, and that the query is `running shoes`. | Original Text | Tokens | Match | Reason | Related | | ------------------- | ------------------------- | ----- | ---------------------------------------------- | --------------------------------------------------------------------- | | Sleek running shoes | `sleek` `running` `shoes` | ✅ | Contains `running` and `shoes`, in that order. | | | Running shoes sleek | `sleek` `running` `shoes` | ❌ | `running` and `shoes` not in the right order. | [Match conjunction](/documentation/full-text/match#match-conjunction) | | SLeeK RUNNING ShOeS | `sleek` `running` `shoes` | ✅ | Contains `running` and `shoes`, in that order. | [Lowercasing](/documentation/indexing/create-index) | | Sleek run shoe | `sleek` `run` `shoe` | ❌ | Does not contain both `running` and `shoes`. | [Stemming](/documentation/indexing/create-index) | | Sleke ruining shoez | `sleke` `ruining` `shoez` | ❌ | Does not contain both `running` and `shoes`. | | | White jogging shoes | `white` `jogging` `shoes` | ❌ | Does not contain both `running` and `shoes`. | | ## Adding Slop Slop allows the token ordering requirement of phrase queries to be relaxed. It specifies how many changes — like extra words in between or transposed word positions — are allowed while still considering the phrase a match: * An extra word in between (e.g. `sleek shoes` vs. `sleek running shoes`) has a slop of `1` * A transposition (e.g. `running shoes` vs. `shoes running`) has a slop of `2` To apply slop to a phrase query, cast the query to `slop(n)`, where `n` is the maximum allowed slop. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description ### 'shoes running'::pdb.slop(2); ``` ## Using a Custom Tokenizer The phrase query supports custom query tokenization. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description ### 'running shoes'::pdb.whitespace; ``` ## Using Pretokenized Text The phrase operator also accepts a text array as the right-hand side argument. If a text array is provided, each element of the array is treated as an exact token, which means that no further processing is done. The following query matches documents containing the token `shoes` immediately followed by `running`: ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description ### ARRAY['running', 'shoes']; ``` Adding slop is supported: ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description ### ARRAY['shoes', 'running']::pdb.slop(2); ``` # Proximity Source: https://docs.paradedb.com/documentation/full-text/proximity Match documents based on token proximity within the source document Proximity queries are used to match documents containing tokens that are within a certain token distance of one another. ## Overview The following query finds all documents where the token `sleek` is at most `1` token away from `shoes`. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description @@@ ('sleek' ## 1 ## 'shoes'); ``` Like the [term](/documentation/full-text/term) query, the query string in a proximity query is treated as a finalized token. `##` does not care about order -- the term on the left-hand side may appear before or after the term on the right-hand side. To ensure that the left-hand term appears before the right-hand term, use `##>`. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description @@@ ('sleek' ##> 1 ##> 'shoes'); ``` ## Proximity Regex In addition to exact tokens, proximity queries can also match against regex expressions. The following query finds all documents where any token matching the regex query `sl.*` is at most `1` token away from the token `shoes`. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description @@@ (pdb.prox_regex('sl.*') ## 1 ## 'shoes'); ``` By default, `pdb.prox_regex` will expand to the first `50` regex matches in each document. This limit can be overridden by providing a second argument: ```sql theme={null} -- Expand up to 100 regex matches SELECT description, rating, category FROM mock_items WHERE description @@@ (pdb.prox_regex('sl.*', 100) ## 1 ## 'shoes'); ``` ## Proximity Array `pdb.prox_array` matches against an array of tokens instead of a single token. For example, the following query finds all documents where any of the tokens `sleek` or `white` is within `1` token of `shoes`. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description @@@ (pdb.prox_array('sleek', 'white') ## 1 ## 'shoes'); ``` `pdb.prox_array` can also take regex: ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description @@@ (pdb.prox_array(pdb.prox_regex('sl.*'), 'white') ## 1 ## 'shoes'); ``` # Term Source: https://docs.paradedb.com/documentation/full-text/term Look for exact token matches in the source document, without any further processing of the query string Term queries look for exact token matches. A term query is like an exact string match, but at the token level. Unlike [match](/documentation/full-text/match) or [phrase](/documentation/full-text/phrase) queries, term queries treat the query string as a **finalized token**. This means that the query string is taken as-is, without any further tokenization or filtering. Term queries use the `===` operator. To understand exactly how it works, let's consider the following two term queries: ```sql theme={null} -- Term query 1 SELECT description, rating, category FROM mock_items WHERE description === 'running'; -- Term query 2 SELECT description, rating, category FROM mock_items WHERE description === 'RUNNING'; ``` The first query returns: ```csv theme={null} description | rating | category ---------------------+--------+---------- Sleek running shoes | 5 | Footwear (1 row) ``` However, the second query returns no results. This is because term queries look for exact matches, which includes case sensitivity, and there are no documents in the example dataset containing the token `RUNNING`. All tokenizers besides the literal tokenizer [lowercase](/documentation/token-filters/lowercase) tokens by default. Make sure to account for this when searching for a term. If you are using `===` to do an exact string match on the original text, make sure that the text uses the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer. ## How It Works Under the hood, `===` simply finds all documents where any of their tokens are an exact string match against the query token. A document's tokens are determined by the field's tokenizer and token filters, configured at index creation time. ## Examples Let’s consider a few more hypothetical documents to see whether they would be returned by the term query. These examples assume that index uses the default tokenizer and token filters, and that the term query is `running`. | Original Text | Tokens | Match | Reason | Related | | ------------------- | ------------------------- | ----- | ------------------------------------- | --------------------------------------------------- | | Sleek running shoes | `sleek` `running` `shoes` | ✅ | Contains the token `running`. | | | Running shoes sleek | `sleek` `running` `shoes` | ✅ | Contains the token `running`. | | | SLeeK RUNNING ShOeS | `sleek` `running` `shoes` | ✅ | Contains the token `running`. | [Lowercasing](/documentation/indexing/create-index) | | Sleek run shoe | `sleek` `run` `shoe` | ❌ | Does not contain the token `running`. | [Stemming](/documentation/indexing/create-index) | | Sleke ruining shoez | `sleke` `ruining` `shoez` | ❌ | Does not contain the token `running`. | [Fuzzy](/documentation/full-text/fuzzy) | | White jogging shoes | `white` `jogging` `shoes` | ❌ | Does not contain the token `running`. | | ## Term Set Passing a text array to the right-hand side of `===` means "find all documents containing any one of these tokens." ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description === ARRAY['shoes', 'running']; ``` # Install ParadeDB Source: https://docs.paradedb.com/documentation/getting-started/install How to run the ParadeDB Docker image The fastest way to install ParadeDB is by pulling the ParadeDB Docker image and running it locally. If your primary Postgres is in a virtual private cloud (VPC), we recommend deploying ParadeDB on a compute instance within your VPC to avoid exposing public IP addresses and needing to provision traffic routing rules. **Note**: ParadeDB supports Postgres 15+, and the `latest` tag ships with Postgres 18. To specify a different Postgres version, please refer to the available tags on [Docker Hub](https://hub.docker.com/r/paradedb/paradedb/tags). ```bash theme={null} docker run \ --name paradedb \ -e POSTGRES_USER=myuser \ -e POSTGRES_PASSWORD=mypassword \ -e POSTGRES_DB=mydatabase \ -v paradedb_data:/var/lib/postgresql/ \ -p 5432:5432 \ -d \ paradedb/paradedb:latest ``` You may replace `myuser`, `mypassword`, and `mydatabase` with whatever values you want. These will be your database connection credentials. To connect to ParadeDB, install the `psql` client and run ```bash theme={null} docker exec -it paradedb psql -U myuser -d mydatabase -W ``` To see all the ways in which you can install ParadeDB, please refer to our [deployment documentation](/deploy/overview). That's it! Next, let's [run a few queries](/documentation/getting-started/quickstart) over mock data with ParadeDB. # Load Data from Postgres Source: https://docs.paradedb.com/documentation/getting-started/load Dump data from an existing Postgres and load into ParadeDB The easiest way to copy data from another Postgres into ParadeDB is with the `pg_dump` and `pg_restore` utilities. These are installed by default when you install `psql`. This approach is ideal for quickly testing ParadeDB. See the [deployment guide](/deploy/overview) for how to deploy ParadeDB into production. ## Create a Dump Run `pg_dump` to create a copy of your database. The `pg_dump` version needs be greater than or equal to that of your Postgres database. You can check the version with `pg_dump --version`. Below, we use the "custom" format (`-Fc`) for both `pg_dump` and `pg_restore`. Please review the [Postgres `pg_dump` documentation](https://www.postgresql.org/docs/current/app-pgdump.html) for other options that may be more appropriate for your environment. Replace `host`, `username`, and `dbname` with your existing Postgres database credentials. If you deployed ParadeDB within your VPC, the `host` will be the private IP address of your existing Postgres database. ```bash theme={null} pg_dump -Fc --no-acl --no-owner \ -h \ -U \ > old_db.dump ``` If your database is large, this can take some time. You can speed this up by dumping specific tables. ```bash theme={null} pg_dump -Fc --no-acl --no-owner \ -h \ -U \ -t -t \ > old_db.dump ``` ## Restore the Dump Run `pg_restore` to load this data into ParadeDB. The `pg_restore` version needs be greater than or equal to that of your `pg_dump`. You can check the version with `pg_restore --version`. Replace `host`, `username`, and `dbname` with your ParadeDB credentials. ```bash theme={null} pg_restore --verbose --clean --no-acl --no-owner \ -h \ -U \ -d \ -Fc \ old_db.dump ``` Congratulations! You are now ready to run real queries over your data. To get started, refer to our [full text search documentation](https://docs.paradedb.com/documentation/full-text/overview). # Quickstart Source: https://docs.paradedb.com/documentation/getting-started/quickstart Get started with ParadeDB in five minutes This guide will walk you through a few queries to give you a feel for ParadeDB. ## Create Example Table ParadeDB comes with a helpful procedure that creates a table populated with mock data to help you get started. Once connected with `psql`, run the following commands to create and inspect this table. ```sql theme={null} CALL paradedb.create_bm25_test_table( schema_name => 'public', table_name => 'mock_items' ); SELECT description, rating, category FROM mock_items LIMIT 3; ``` ```ini Expected Response theme={null} description | rating | category --------------------------+--------+------------- Ergonomic metal keyboard | 4 | Electronics Plastic Keyboard | 4 | Electronics Sleek running shoes | 5 | Footwear (3 rows) ``` Next, let's create a BM25 index called `search_idx` on this table. A BM25 index is a covering index, which means that multiple columns can be included in the same index. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, description, category, rating, in_stock, created_at, metadata, weight_range) WITH (key_field='id'); ``` As a general rule of thumb, any columns that you want to filter, `GROUP BY`, `ORDER BY`, or aggregate as part of a full text query should be added to the index for faster performance. Note the mandatory `key_field` option. See [choosing a key field](/documentation/indexing/create-index#choosing-a-key-field) for more details. ## Match Query We're now ready to execute a basic text search query. We'll look for matches where `description` matches `running shoes` where `rating` is greater than `2`. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description ||| 'running shoes' AND rating > 2 ORDER BY rating LIMIT 5; ``` ```ini Expected Response theme={null} description | rating | category ---------------------+--------+---------- White jogging shoes | 3 | Footwear Generic shoes | 4 | Footwear Sleek running shoes | 5 | Footwear (3 rows) ``` `|||` is ParadeDB's custom [match disjunction](/documentation/full-text/match#disjunction) operator, which means "find me all documents containing `running OR shoes`. If we want all documents containing `running AND shoes`, we can use ParadeDB's `&&&` [match conjunction](/documentation/full-text/match#conjunction) operator. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description &&& 'running shoes' AND rating > 2 ORDER BY rating LIMIT 5; ``` ```ini Expected Response theme={null} description | rating | category ---------------------+--------+---------- Sleek running shoes | 5 | Footwear (1 row) ``` ## BM25 Scoring Next, let's add [BM25 scoring](/documentation/sorting/score) to the results, which sorts matches by relevance. To do this, we'll use `pdb.score`. ```sql theme={null} SELECT description, pdb.score(id) FROM mock_items WHERE description ||| 'running shoes' AND rating > 2 ORDER BY score DESC LIMIT 5; ``` ```ini Expected Response theme={null} description | score ---------------------+----------- Sleek running shoes | 6.817111 Generic shoes | 3.8772602 White jogging shoes | 3.4849067 (3 rows) ``` ## Highlighting Finally, let's also [highlight](/documentation/full-text/highlight) the relevant portions of the documents that were matched. To do this, we'll use `pdb.snippet`. ```sql theme={null} SELECT description, pdb.snippet(description), pdb.score(id) FROM mock_items WHERE description ||| 'running shoes' AND rating > 2 ORDER BY score DESC LIMIT 5; ``` ```ini Expected Response theme={null} description | snippet | score ---------------------+-----------------------------------+----------- Sleek running shoes | Sleek running shoes | 6.817111 Generic shoes | Generic shoes | 3.8772602 White jogging shoes | White jogging shoes | 3.4849067 (3 rows) ``` ## Top N ParadeDB is highly optimized for quickly returning the [Top N](/documentation/sorting/topn) results out of the index. In SQL, this means queries that contain an `ORDER BY...LIMIT`: ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description ||| 'running shoes' ORDER BY rating LIMIT 5; ``` ```ini Expected Response theme={null} description | rating | category ---------------------+--------+---------- White jogging shoes | 3 | Footwear Generic shoes | 4 | Footwear Sleek running shoes | 5 | Footwear (3 rows) ``` ## Facets [Faceted queries](/documentation/aggregates/facets) allow a single query to return both the Top N results and an aggregate value, which is more CPU-efficient than issuing two separate queries. For example, the following query returns the top 3 results as well as the total number of results matched. ```sql theme={null} SELECT description, rating, category, pdb.agg('{"value_count": {"field": "id"}}') OVER () FROM mock_items WHERE description ||| 'running shoes' ORDER BY rating LIMIT 5; ``` ```ini Expected Response theme={null} description | rating | category | agg ---------------------+--------+----------+---------------- White jogging shoes | 3 | Footwear | {"value": 3.0} Generic shoes | 4 | Footwear | {"value": 3.0} Sleek running shoes | 5 | Footwear | {"value": 3.0} (3 rows) ``` That's it! Next, let's [load your data](/documentation/getting-started/load) to start running real queries. # Create an Index Source: https://docs.paradedb.com/documentation/indexing/create-index Index a Postgres table for full text search Before a table can be searched, it must be indexed. ParadeDB uses a custom index type called the BM25 index. The following code block creates a BM25 index over several columns in the `mock_items` table. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, description, category) WITH (key_field='id'); ``` By default, text columns are tokenized using the [unicode](/documentation/tokenizers/available-tokenizers/unicode) tokenizer, which splits text according to the Unicode segmentation standard. Because index creation is a time-consuming operation, we recommend experimenting with the [available tokenizers](/documentation/tokenizers/overview) to find the most suitable one before running `CREATE INDEX`. For instance, if a column contains multiple languages, the ICU tokenizer may be more appropriate. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.icu), category) WITH (key_field='id'); ``` Only one BM25 index can exist per table. We recommend indexing all columns in a table that may be present in a search query, including columns used for sorting, grouping, filtering, and aggregations. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, description, category, rating, in_stock, created_at, metadata, weight_range) WITH (key_field='id'); ``` Most Postgres types, including text, JSON, numeric, timestamp, range, boolean, and arrays, can be indexed. ## Track Create Index Progress To monitor the progress of a long-running `CREATE INDEX`, open a separate Postgres connection and query `pg_stat_progress_create_index`: ```sql theme={null} SELECT pid, phase, blocks_done, blocks_total FROM pg_stat_progress_create_index; ``` Comparing `blocks_done` to `blocks_total` will provide a good approximation of the progress so far. If `blocks_done` equals `blocks_total`, that means that all rows have been indexed and the index is being flushed to disk. ## Choosing a Key Field In the `CREATE INDEX` statement above, note the mandatory `key_field` option. Every BM25 index needs a `key_field`, which is the name of a column that will function as a row’s unique identifier within the index. The `key_field` must: 1. Have a `UNIQUE` constraint. Usually this means the table's `PRIMARY KEY`. 2. Be the first column in the column list. 3. Be untokenized, if it is a text field. ## Token Filters After tokens are created, [token filters](/documentation/token-filters/overview) can be configured to apply further processing like lowercasing, stemming, or unaccenting. For example, the following code block adds English stemming to `description`: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.simple('stemmer=english')), category) WITH (key_field='id'); ``` # Indexing Text Arrays Source: https://docs.paradedb.com/documentation/indexing/indexing-arrays Add text arrays to the index The BM25 index accepts arrays of type `text[]` or `varchar[]`. ```sql theme={null} CREATE TABLE array_demo (id SERIAL PRIMARY KEY, categories TEXT[]); INSERT INTO array_demo (categories) VALUES ('{"food","groceries and produce"}'), ('{"electronics","computers"}'), ('{"books","fiction","mystery"}'); CREATE INDEX ON array_demo USING bm25 (id, categories) WITH (key_field = 'id'); ``` Under the hood, each element in the array is indexed as a separate entry. This means that an array is considered a match if **any** of its entries is a match. ```sql theme={null} SELECT * FROM array_demo WHERE categories === 'food'; ``` ```ini Expected Response theme={null} id | categories ----+-------------------------------- 1 | {food,"groceries and produce"} (1 row) ``` Text arrays can be [tokenized](/documentation/tokenizers/overview) and [filtered](/documentation/token-filters/overview) in the same way as text fields: ```sql theme={null} CREATE INDEX ON array_demo USING bm25 (id, (categories::pdb.literal)) WITH (key_field = 'id'); ``` # Indexing Expressions Source: https://docs.paradedb.com/documentation/indexing/indexing-expressions Add Postgres expressions to the index In addition to indexing columns, Postgres expressions can also be indexed. ## Indexing Text/JSON Expressions The following statement indexes an expression which concatenates `description` and `category`, which are both text fields: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, ((description || ' ' || category)::pdb.simple('alias=description_concat'))) WITH (key_field='id'); ``` To index a text/JSON expression: 1. Add the expression to the column list. In this example, the expression is `description || ' ' || category`. 2. Cast it to a [tokenizer](/documentation/tokenizers/overview), in this example `pdb.simple`. 3. ParadeDB will try and infer a field name based on the field used in the expression. However, if the field name cannot be inferred (e.g. because the expression involves more than one field), you will be required to add an `alias=` to the tokenizer. Querying against the expression is the same as querying a regular field: ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE (description || ' ' || category) &&& 'running shoes'; ``` The expression on the left-hand side of the operator must exactly match the expression that was indexed. ## Indexing Non-Text Expressions To index a non-text expression, cast the expression to `pdb.alias`. For example, the following statement indexes the expression `rating + 1`, which returns an integer: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, description, ((rating + 1)::pdb.alias('rating'))) WITH (key_field='id'); ``` With the expression indexed, queries containing the expression can be pushed down to the ParadeDB index: ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description &&& 'running shoes' AND rating + 1 > 3; ``` # Indexing JSON Source: https://docs.paradedb.com/documentation/indexing/indexing-json Add JSON and JSONB types to the index When indexing JSON, ParadeDB automatically indexes all sub-fields of the JSON object. The type of each sub-field is also inferred automatically. For example, consider the following statement where `metadata` is `JSONB`: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, metadata) WITH (key_field='id'); ``` A single `metadata` JSON may look like: ```json theme={null} { "color": "Silver", "location": "United States" } ``` ParadeDB will automatically index both `metadata.color` and `metadata.location` as text. By default, all text sub-fields of a JSON object use the same tokenizer. The tokenizer can be configured the same way as text fields: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (metadata::pdb.ngram(2,3))) WITH (key_field='id'); ``` Instead of indexing the entire JSON, sub-fields of the JSON can be indexed individually. This allows for configuring separate tokenizers within a larger JSON: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, ((metadata->>'color')::pdb.ngram(2,3))) WITH (key_field='id'); ``` # Reindexing Source: https://docs.paradedb.com/documentation/indexing/reindexing Rebuild an existing index with zero downtime ## Changing the Schema If an index's schema is changed, it must be rebuilt. This includes: 1. Adding a field to the index 2. Removing a field from the index 3. Renaming an indexed column in the underlying table 4. Changing a field's tokenizer Let's assume the existing index is called `search_idx`, and we want to create a new index called `search_idx_v2`. First, use `CREATE INDEX CONCURRENTLY` to build a new index in the background. ```sql theme={null} CREATE INDEX CONCURRENTLY search_idx_v2 ON mock_items USING bm25 (id, description, category) WITH (key_field = 'id'); ``` The `CONCURRENTLY` clause is required. `CONCURRENTLY` allows the existing index to continue serving queries while the new index is being built. From another session, you can use `pg_stat_progress_create_index` to [track the progress](/documentation/indexing/create-index#track-create-index-progress) of the new index. Once the new index is done building, confirm that it is valid: ```sql theme={null} SELECT ix.indisvalid, ix.indisready, ix.indislive FROM pg_class i JOIN pg_index ix ON ix.indexrelid = i.oid WHERE i.relname = 'search_idx_v2'; ``` ```csv Expected Response theme={null} indisvalid | indisready | indislive ------------+------------+----------- t | t | t (1 row) ``` If all three columns are `true`, the original index can safely be dropped, which will redirect queries to the new index. ```sql theme={null} DROP INDEX search_idx; ``` ## Rebuilding the Index `REINDEX` is used to rebuild an index without changing the schema. The basic syntax for `REINDEX` is: ```sql theme={null} REINDEX INDEX search_idx; ``` This operation takes an exclusive lock on the table, which blocks incoming writes (but not reads) while the new index is being built. To allow for concurrent writes during a reindex, use `REINDEX CONCURRENTLY`: ```sql theme={null} REINDEX INDEX CONCURRENTLY search_idx; ``` The tradeoff is that `REINDEX CONCURRENTLY` is slower than a plain `REINDEX`. Generally speaking, `REINDEX CONCURRENTLY` is recommended for production systems that cannot tolerate temporarily blocked writes. ## Important Caveats Although `CREATE INDEX CONCURRENTLY` and `REINDEX CONURRENTLY` run in the background, Postgres requires that the session that is executing the command remain open. If the session is closed, Postgres will cancel the operation. This is relevant if you are using a connection pooler like `pgbouncer`, which may terminate sessions after a certain idle timeout is reached. If `REINDEX CONCURRENTLY` fails or is cancelled, an invalid transient index will be left behind that must be dropped manually. To check for invalid indexes in `psql`, run `\d ` and look for `INVALID` indexes. # Verify Index Integrity Source: https://docs.paradedb.com/documentation/indexing/verify-index Check BM25 indexes for corruption and structural issues ParadeDB provides `amcheck`-style index verification functions to detect corruption and validate the structural integrity of BM25 indexes. These functions are useful for: * Proactive corruption detection before issues become critical * Validating index health after hardware failures or unexpected shutdowns * Verifying backup integrity * Debugging index-related issues ## Basic Verification The `pdb.verify_index` function performs structural integrity checks on a BM25 index: ```sql theme={null} SELECT * FROM pdb.verify_index('search_idx'); ``` This returns a table with three columns: | Column | Type | Description | | ------------ | ------- | --------------------------------------------- | | `check_name` | text | Name of the verification check | | `passed` | boolean | Whether the check passed | | `details` | text | Additional information about the check result | ### Example Output ``` check_name | passed | details ----------------------------------------+--------+----------------------------------------------- search_idx: schema_valid | t | Index schema loaded successfully search_idx: index_readable | t | Index reader opened successfully search_idx: checksums_valid | t | All segment checksums validated successfully search_idx: segment_metadata_valid | t | 3 segments validated successfully ``` ## Heap Reference Validation To verify that all indexed entries still exist in the heap table, use the `heapallindexed` option: ```sql theme={null} SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true); ``` This adds an additional check that validates every indexed `ctid` (tuple identifier) references a valid row in the table. This is particularly useful for detecting index entries that reference deleted or non-existent rows. The `heapallindexed` option can be slow on large indexes as it must verify every document. Consider using `sample_rate` for quick spot checks on large indexes. ## Options ### Sampling for Large Indexes For large indexes, you can check a random sample of documents instead of all documents: ```sql theme={null} -- Check 10% of documents SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, sample_rate := 0.1 ); ``` ### Progress Reporting For long-running verifications, enable progress reporting to see status updates: ```sql theme={null} SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, report_progress := true ); ``` Progress messages are emitted via PostgreSQL's `WARNING` channel. ### Verbose Mode For detailed logging including segment-by-segment progress and resume hints, enable verbose mode: ```sql theme={null} SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, report_progress := true, verbose := true ); ``` ### Stop on First Error To stop verification immediately when the first error is found (similar to `pg_amcheck --on-error-stop`): ```sql theme={null} SELECT * FROM pdb.verify_index('search_idx', on_error_stop := true); ``` ## Parallel Verification A single `verify_index` call processes segments sequentially within one PostgreSQL backend. For very large indexes, you can distribute verification across multiple database connections by specifying which segments each connection should check using the `segment_ids` parameter. This allows you to utilize multiple CPU cores by running verification in parallel processes. ### Listing Segments First, list all segments in the index: ```sql theme={null} SELECT * FROM pdb.index_segments('search_idx'); ``` ``` partition_name | segment_idx | segment_id | num_docs | num_deleted | max_doc ----------------+-------------+------------+----------+-------------+--------- search_idx | 0 | b7e661af | 10000 | 0 | 10000 search_idx | 1 | b4fc1b40 | 10000 | 0 | 10000 search_idx | 2 | 9894b412 | 10000 | 0 | 10000 search_idx | 3 | 4d0168d6 | 5000 | 0 | 5000 ``` ### Verifying Specific Segments Then verify specific segments using the `segment_ids` parameter: ```sql theme={null} -- Worker 1: Verify even segments SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, segment_ids := ARRAY[0, 2] ); -- Worker 2: Verify odd segments SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, segment_ids := ARRAY[1, 3] ); ``` ### Automation Example Distribute verification across N workers: ```sql theme={null} -- Get segments for worker 0 (of 4 workers) SELECT array_agg(segment_idx) AS segments FROM pdb.index_segments('search_idx') WHERE segment_idx % 4 = 0; -- Run verification with those segments SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, segment_ids := ( SELECT array_agg(segment_idx) FROM pdb.index_segments('search_idx') WHERE segment_idx % 4 = 0 ) ); ``` ## Verifying All BM25 Indexes To verify all BM25 indexes in the database at once: ```sql theme={null} SELECT * FROM pdb.verify_all_indexes(); ``` ### Filtering by Pattern Filter indexes by schema or name pattern (using SQL `LIKE` syntax): ```sql theme={null} -- Verify indexes in the 'public' schema only SELECT * FROM pdb.verify_all_indexes(schema_pattern := 'public'); -- Verify indexes matching a name pattern SELECT * FROM pdb.verify_all_indexes(index_pattern := 'search_%'); -- Combine filters SELECT * FROM pdb.verify_all_indexes( schema_pattern := 'app_%', index_pattern := '%_idx', heapallindexed := true ); ``` ## Listing All BM25 Indexes To see all BM25 indexes in the database with summary statistics: ```sql theme={null} SELECT * FROM pdb.indexes(); ``` ``` schemaname | tablename | indexname | indexrelid | num_segments | total_docs ------------+-------------+---------------+------------+--------------+------------ public | products | products_idx | 16421 | 3 | 50000 public | documents | documents_idx | 16435 | 5 | 125000 app | articles | articles_idx | 16448 | 2 | 10000 ``` ## Function Reference ### `pdb.verify_index` Verifies a single BM25 index. | Parameter | Type | Default | Description | | ----------------- | -------- | ---------- | ---------------------------------------------------------- | | `index` | regclass | (required) | The index to verify | | `heapallindexed` | boolean | `false` | Check that all indexed ctids exist in the heap | | `sample_rate` | float | `NULL` | Fraction of documents to check (0.0-1.0). NULL = check all | | `report_progress` | boolean | `false` | Emit progress messages | | `verbose` | boolean | `false` | Emit detailed segment-level progress and resume hints | | `on_error_stop` | boolean | `false` | Stop on first error found | | `segment_ids` | int\[] | `NULL` | Specific segment indices to check. NULL = all segments | ### `pdb.verify_all_indexes` Verifies all BM25 indexes in the database. | Parameter | Type | Default | Description | | ----------------- | ------- | ------- | ---------------------------------------------------------- | | `schema_pattern` | text | `NULL` | Filter by schema name (SQL LIKE pattern). NULL = all | | `index_pattern` | text | `NULL` | Filter by index name (SQL LIKE pattern). NULL = all | | `heapallindexed` | boolean | `false` | Check that all indexed ctids exist in the heap | | `sample_rate` | float | `NULL` | Fraction of documents to check (0.0-1.0). NULL = check all | | `report_progress` | boolean | `false` | Emit progress messages | | `on_error_stop` | boolean | `false` | Stop on first error found | ### `pdb.index_segments` Lists all segments in a BM25 index. | Parameter | Type | Default | Description | | --------- | -------- | ---------- | -------------------- | | `index` | regclass | (required) | The index to inspect | Returns: | Column | Type | Description | | ---------------- | ------ | ------------------------------------------------ | | `partition_name` | text | Name of the index partition | | `segment_idx` | int | Segment index (use with `segment_ids` parameter) | | `segment_id` | text | Tantivy segment UUID | | `num_docs` | bigint | Number of live documents | | `num_deleted` | bigint | Number of deleted documents | | `max_doc` | bigint | Maximum document ID | ### `pdb.indexes` Lists all BM25 indexes in the database. Returns: | Column | Type | Description | | -------------- | ------ | ----------------------------------- | | `schemaname` | text | Schema containing the index | | `tablename` | text | Table the index is on | | `indexname` | text | Name of the index | | `indexrelid` | oid | OID of the index | | `num_segments` | int | Number of Tantivy segments | | `total_docs` | bigint | Total documents across all segments | # Index Creation Source: https://docs.paradedb.com/documentation/performance-tuning/create-index Settings to make index creation faster These actions can improve the performance and memory consumption of `CREATE INDEX` and `REINDEX` statements. ### Raise Parallel Indexing Workers ParadeDB uses Postgres' `max_parallel_maintenance_workers` setting to determine the degree of parallelism during `CREATE INDEX`/`REINDEX`. Postgres' default is `2`, which may be too low for large tables. ```sql theme={null} SET max_parallel_maintenance_workers = 8; ``` In order for `max_parallel_maintenance_workers` to take effect, it must be less than or equal to both `max_parallel_workers` and `max_worker_processes`. ### Configure Indexing Memory The default Postgres `maintenance_work_mem` value of `64MB` is quite conservative and can slow down parallel index builds. We recommend at least `64MB` per [parallel indexing worker](#raise-parallel-indexing-workers). ```sql theme={null} SET maintenance_work_mem = '2GB'; ``` Each worker is required to have at least `15MB` memory. If `maintenance_work_mem` is set too low, an error will be returned. ### Defer Index Creation If possible, creating the BM25 index should be deferred until **after** a table has been populated. To illustrate: ```sql theme={null} -- This is preferred CREATE TABLE test (id SERIAL, data text); INSERT INTO test (data) VALUES ('hello world'), ('many more values'); CREATE INDEX ON test USING bm25 (id, data) WITH (key_field = 'id'); -- ...to this CREATE TABLE test (id SERIAL, data text); CREATE INDEX ON test USING bm25 (id, data) WITH (key_field = 'id'); INSERT INTO test (data) VALUES ('hello world'), ('many more values'); ``` This allows the BM25 index to create a more tightly packed, efficient representation on disk and will lead to faster build times. # Joins Source: https://docs.paradedb.com/documentation/performance-tuning/joins Optimize JOIN queries in ParadeDB ParadeDB supports all PostgreSQL JOIN types and extends them with BM25-powered full-text search. This guide explains how JOINs behave with search, how to identify sub-optimal query plans, and offers strategies to keep queries fast. ## Supported JOIN Types ParadeDB supports all PostgreSQL JOINs: * `INNER JOIN` * `LEFT / RIGHT / FULL OUTER JOIN` * `CROSS JOIN` * `LATERAL` * Semi and Anti JOINs For the most part you can mix search and relational queries without changing your SQL. ## Scoring in JOINs When using `paradedb.score()` or `paradedb.snippet()` inside JOINs: * Scores and snippets are computed **before the JOIN** at the base table level. * JOIN conditions never change the score, they only determine which rows are combined. This design keeps scores predictable and consistent across queries. ## Performance Characteristics ### Fast Cases Queries are efficient when search filters can be applied directly to the underlying tables. In these cases, PostgreSQL can push down the `|||` operators so that each table does its own filtered index scan before the JOIN runs. That means: * Each table only contributes rows that already match the search condition. * The JOIN operates on much smaller intermediate sets. In this query, both `a.bio` and `b.content` are filtered independently. The planner runs efficient index scans on each table and then joins the results. ```sql theme={null} SELECT a.name, b.title, paradedb.score(a.id) FROM authors a JOIN books b ON a.id = b.author_id WHERE a.bio ||| 'science fiction' AND b.content ||| 'space travel'; ``` The plan will have this shape: ``` Gather -> Parallel Hash Join Hash Cond: (b.id = a.id) -> Parallel ParadeDB Scan on authors a -> Parallel Hash -> Parallel ParadeDB Scan on books b ``` ### Slower Cases Queries become slower when search conditions span multiple tables in a way that prevents PostgreSQL from pushing them down. The most common example is an `OR` across different tables: ```sql theme={null} SELECT a.name, b.title FROM authors a JOIN books b ON a.id = b.author_id WHERE a.bio ||| 'science' OR b.content ||| 'artificial'; ``` Because the condition references both `a` and `b`, PostgreSQL cannot apply it until after the join. As a result, both tables must be scanned in full, joined, and only then filtered. The plan will have this shape: ``` Gather -> Parallel Hash Join Hash Cond: (a.id = b.author_id) Join Filter: (a.bio ||| (...) OR b.content ||| (...)) -> Parallel Seq Scan on authors a -> Parallel Hash -> Parallel Seq Scan on books b ``` Note that the `|||` query is in the *Join Filter*, not in the scan. ## Diagnosing Performance Use `EXPLAIN` to check the query plan: ```sql theme={null} EXPLAIN (ANALYZE, BUFFERS) SELECT a.name, b.title, paradedb.score(a.id) FROM authors a JOIN books b ON a.id = b.author_id WHERE a.bio ||| 'science' OR b.content ||| 'artificial'; ``` Watch for: * `Custom Scan` nodes with large row counts * ParadeDB operators inside JOIN conditions * `Tantivy Query: all` (full index scan) ## Writing Faster JOIN Queries ### Replace Cross-Table OR with UNION If you don’t need scores/snippets and have a simple JOIN, express the OR as a UNION of two separately filtered joins. This lets PostgreSQL push each search predicate down to a Custom Index Scan and avoid a join-time filter. ```sql theme={null} SELECT a.name, b.title FROM authors a JOIN books b ON a.id = b.author_id WHERE a.bio ||| 'science' UNION SELECT a.name, b.title FROM authors a JOIN books b ON a.id = b.author_id WHERE b.content ||| 'artificial'; ``` ### Use CTEs for Complex Queries Use common table expressions (CTEs) to pre-filter each table with its own search condition, then join the smaller result sets together. If possible, add a `LIMIT` to each CTE to keep the result sets small. ```sql theme={null} WITH matching_authors AS ( SELECT id, name, paradedb.score(id) AS author_score FROM authors WHERE bio ||| 'science' LIMIT 100 ), matching_books AS ( SELECT id, title, author_id, paradedb.score(id) AS book_score FROM books WHERE content ||| 'artificial' LIMIT 100 ) SELECT COALESCE(ma.name, a.name) AS name, COALESCE(mb.title, b.title) AS title, ma.author_score, mb.book_score FROM matching_authors ma FULL JOIN matching_books mb ON ma.id = mb.author_id LEFT JOIN authors a ON mb.author_id = a.id AND ma.id IS NULL LEFT JOIN books b ON ma.id = b.author_id AND mb.id IS NULL; ``` BM25 scores should not be added, if you want to combine scores then consider using [reciprocal rank fusion (RRF)](https://www.paradedb.com/learn/search-concepts/reciprocal-rank-fusion). ## Roadmap We really want to remove the need to think about the way to do `JOIN`s in ParadeDB. At the moment we are actively working on: * A `CustomScan Join API` for native join handling * Smarter cost estimation for the PostgreSQL planner # How to Tune ParadeDB Source: https://docs.paradedb.com/documentation/performance-tuning/overview Settings for better read and write performance ParadeDB uses Postgres' settings, which can be found in the `postgresql.conf` file. To find your `postgresql.conf` file, use `SHOW`. ```sql theme={null} SHOW config_file; ``` These settings can be changed in several ways: 1. By editing the `postgresql.conf` file and restarting Postgres. This makes the setting permanent for all sessions. `postgresql.conf` accepts ParadeDB's custom `paradedb.*` settings. 2. By running `SET`. This temporarily changes the setting for the current session. Note that Postgres does not allow all `postgresql.conf` settings to be changed with `SET`. ```sql theme={null} SET maintenance_work_mem = '8GB' ``` If ParadeDB is deployed with [CloudNativePG](/deploy/self-hosted/kubernetes), these settings should be set in your `.tfvars` file. ```hcl .tfvars theme={null} postgresql = { parameters = { max_worker_processes = 76 max_parallel_workers = 64 # Note that paradedb.* settings must be wrapped in double quotes "paradedb.global_mutable_segment_rows" = 1000 } } ``` # Read Throughput Source: https://docs.paradedb.com/documentation/performance-tuning/reads Settings to improve read performance As a general rule of thumb, the performance of expensive search queries can be greatly improved if they are able to access more parallel Postgres workers and more shared buffer memory. ## Raise Parallel Workers There are three settings that control how many parallel workers ultimately get assigned to a query. First, `max_worker_processes` is a global limit for the number of workers. Next, `max_parallel_workers` is a subset of `max_worker_processes`, and sets the limit for workers used in parallel queries. Finally, `max_parallel_workers_per_gather` limits how many workers a *single query* can receive. ```init postgresql.conf theme={null} max_worker_processes = 72 max_parallel_workers = 64; max_parallel_workers_per_gather = 4; ``` In the above example, the maximum number of workers that a single query can receive is set to `4`. The `max_parallel_workers` pool is set to `64`, which means that `16` queries can execute simultaneously with `4` workers each. Finally, `max_worker_processes` is set to `72` to give headroom for other workers like autovacuum and replication. In practice, we recommend experimenting with different settings, as the best configuration depends on the underlying hardware, query patterns, and volume of data. If all `max_parallel_workers` are in use, Postgres will still execute additional queries, but those queries will run without parallelism. This means that queries do not fail — they just may run slower due to lack of parallelism. ## Raise Shared Buffers `shared_buffers` controls how much memory is available to the Postgres buffer cache. We recommend allocating no more than 40% of total memory to `shared_buffers`. ```bash postgresql.conf theme={null} shared_buffers = 8GB ``` The `pg_prewarm` extension can be used to load the BM25 index into the buffer cache after Postgres restarts. A higher `shared_buffers` value allows more of the index to be stored in the buffer cache. ```sql theme={null} CREATE EXTENSION pg_prewarm; SELECT pg_prewarm('search_idx'); ``` ## Configure Autovacuum If an index experiences frequent writes, the search performance of some queries like [sorting](/documentation/sorting/score) or [aggregates](/documentation/aggregates/overview) can degrade if `VACUUM` has not been recently run. This is because writes can cause parts of Postgres' visibility map to go out of date, and `VACUUM` updates the visibility map. To determine if search performance is degraded by lack of `VACUUM`, run `EXPLAIN ANALYZE` over a query. A `Parallel Custom Scan` in the query plan with a large number of `Heap Fetches` typically means that `VACUUM` should be run. Postgres can be configured to automatically vacuum a table when a certain number of rows have been updated. Autovacuum settings can be set globally in `postgresql.conf` or for a specific table. ```sql theme={null} ALTER TABLE mock_items SET (autovacuum_vacuum_threshold = 500); ``` There are several [autovacuum settings](https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM), but the important ones to note are: 1. `autovacuum_vacuum_scale_factor` triggers an autovacuum if a certain percentage of rows in a table have been updated. 2. `autovacuum_vacuum_threshold` triggers an autovacuum if an absolute number of rows have been updated. 3. `autovacuum_naptime` ensures that vacuum does not run too frequently. This means that setting `autovacuum_vacuum_scale_factor` to `0` and `autovacuum_vacuum_threshold` to `100000` will trigger an autovacuum for every `100000` row updates. As a general rule of thumb, we recommend autovacuuming at least once every `100000` single-row updates. ## Adjust Target Segment Count By default, `CREATE INDEX`/`REINDEX` will create as many segments as there are CPUs on the host machine. This can be changed using the `target_segment_count` index option. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, description, rating) WITH (key_field = 'id', target_segment_count = 32, ...); ``` This property is attached to the index so that during `REINDEX`, the same value will be used. It can be changed with ALTER INDEX, like so: ```sql theme={null} ALTER INDEX search_idx SET (target_segment_count = 8); ``` However, a `REINDEX` is required to rebalance the index to that segment count. For optimal performance, the segment count should equal the number of parallel workers that a query can receive, which is controlled by [`max_parallel_workers_per_gather`](/documentation/performance-tuning/reads#raise-parallel-workers). If `max_parallel_workers_per_gather` is greater than the number of CPUs on the host machine, then increasing the target segment count to match `max_parallel_workers_per_gather` can improve query performance. `target_segment_count` is merely a suggestion. While `pg_search` will endeavor to ensure the created index will have exactly this many segments, it is possible for it to have less or more. Mostly this depends on the distribution of work across parallel builder processes, memory constraints, and table size. # Write Throughput Source: https://docs.paradedb.com/documentation/performance-tuning/writes Settings to improve write performance These actions can improve the throughput of `INSERT`/`UPDATE`/`COPY` statements to the BM25 index. ## Ensure Merging Happens in the Background During every `INSERT`/`UPDATE`/`COPY`/`VACUUM`, the BM25 index runs a compaction process that looks for opportunities to merge segments together. The goal is to consolidate smaller segments into larger ones, reducing the total number of segments and improving query performance. Segments become candidates for merging if their combined size meets or exceeds one of several **configurable layer thresholds**. These thresholds define target segment sizes — such as `10KB`, `100KB`, `1MB`, etc. For each layer, the compactor checks if there are enough smaller segments whose total size adds up to the threshold. The default layer sizes are `100KB`, `1MB`, `100MB`, `1GB`, and `10GB` but can be configured. ```sql theme={null} ALTER INDEX search_idx SET (background_layer_sizes = '100MB, 1GB'); ``` By default, merging happens in the background so that writes are not blocked. The `layer_sizes` option allows merging to happen in the foreground. This is not typically recommended because it slows down writes, but can be used to apply back pressure to writes if segments are being created faster than they can be merged down. ```sql theme={null} ALTER INDEX search_idx SET (layer_sizes = '100KB, 1MB'); ``` Setting `layer_sizes` to `0` disables foreground merging, and setting `background_layer_sizes` to `0` disables background merging. ## Increase Work Memory for Bulk Updates `work_mem` controls how much memory to allocate to a single `INSERT`/`UPDATE`/`COPY` statement. Each statement that writes to a BM25 index is required to have at least `15MB` memory. If `work_mem` is below `15MB`, it will be ignored and `15MB` will be used. If your typical update patterns are large, bulk updates (not single-row updates) a larger value may be better. ```sql theme={null} SET work_mem = 64MB; ``` Since many write operations can be running concurrently, this value should be raised more conservatively than `maintenance_work_mem`. ## Increase Mutable Segment Size The `mutable_segment_rows` setting enables use of mutable segments, which buffer new rows in order to amortize the cost of indexing them. By default, it is set to `1000`, which means that 1000 writes are buffered before being flushed. ```sql theme={null} ALTER INDEX search_idx SET (mutable_segment_rows = 1000); ``` A higher value generally improves write throughput at the expense of read performance, since the mutable data structure is slower to search. Additionally, the mutable data structure is read into memory, so higher values cause reads to consume more RAM. Alternatively, this setting can be set to apply to all indexes in the database: ```sql theme={null} SET paradedb.global_mutable_segment_rows = 1000 ``` If both a per-index setting and global setting exist, the global `paradedb.global_mutable_segment_rows` will be used. To ignore the global setting, set `paradedb.global_mutable_segment_rows` to `-1` (this is the default). ```sql theme={null} SET paradedb.global_mutable_segment_rows = -1 ``` # All Source: https://docs.paradedb.com/documentation/query-builder/compound/all Search all rows in the index The all query means "search all rows in the index." The primary use case for the all query is to force the query to be executed by the ParadeDB index instead of Postgres' other execution methods. Because ParadeDB executes a query only when a ParadeDB operator is present in the query, the all query injects an operator into the query without changing the query's meaning. To use it, pass the [key field](/documentation/indexing/create-index#choosing-a-key-field) to the left-hand side of `@@@` and `pdb.all()` to the right-hand side. ```sql theme={null} -- Top N executed by standard Postgres SELECT * FROM mock_items WHERE rating IS NOT NULL ORDER BY rating LIMIT 5; -- Top N executed by ParadeDB SELECT * FROM mock_items WHERE rating IS NOT NULL AND id @@@ pdb.all() ORDER BY rating LIMIT 5; ``` This is useful for cases where queries that don't contain a ParadeDB operator can be more efficiently executed by ParadeDB vs. standard Postgres, like [Top N](/documentation/sorting/topn) or [aggregate](/documentation/aggregates/overview) queries. # Query Parser Source: https://docs.paradedb.com/documentation/query-builder/compound/query-parser Accept raw user-provided query strings The parse query accepts a [Tantivy query string](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html). The intended use case is for accepting raw query strings provided by the end user. To use it, pass the [key field](/documentation/indexing/create-index#choosing-a-key-field) to the left-hand side of `@@@` and `pdb.parse('')` to the right-hand side. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE id @@@ pdb.parse('description:(sleek shoes) AND rating:>3'); ``` Please refer to the [Tantivy docs](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html) for an overview of the query string language. ## Lenient Parsing By default, strict syntax parsing is used. This means that if any part of the query does not conform to Tantivy’s query string syntax, the query fails. For instance, a valid field name must be provided before every query (i.e. `category:footwear`). By setting `lenient` to `true`, the query is executed on a best-effort basis. For example, if no field names are provided, the query is executed over all fields in the index. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE id @@@ pdb.parse('description:(sleek shoes) AND rating:>3', lenient => true); ``` ## Conjunction Mode By default, terms in the query string are `OR`ed together. With `conjunction_mode` set to `true`, they are instead `AND`ed together. For instance, the following query returns documents containing both `sleek` and `shoes`. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE id @@@ pdb.parse('description:(sleek shoes)', conjunction_mode => true); ``` # How Advanced Query Functions Work Source: https://docs.paradedb.com/documentation/query-builder/overview ParadeDB's query builder functions provide advanced query types In addition to basic [match](/documentation/full-text/match), [phrase](/documentation/full-text/phrase), and [term](/documentation/full-text/term) queries, additional advanced query types are exposed as query builder functions. Query builder functions use the `@@@` operator. `@@@` takes a column on the left-hand side and a query builder function on the right-hand side. It means "find all rows where the column matches the given query." For example: ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description @@@ pdb.regex('key.*rd'); ``` ```ini Expected Response theme={null} description | rating | category --------------------------+--------+------------- Ergonomic metal keyboard | 4 | Electronics Plastic Keyboard | 4 | Electronics (2 rows) ``` This uses the [regex](/documentation/query-builder/term/regex) builder function to match all rows where `description` matches the regex expression `key.*rd`. # Phrase Prefix Source: https://docs.paradedb.com/documentation/query-builder/phrase/phrase-prefix Finds documents containing a phrase followed by a term prefix Phrase prefix identifies documents containing a phrase followed by a term prefix. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description @@@ pdb.phrase_prefix(ARRAY['running', 'sh']); ```
An `ARRAY` of tokens that the search is looking to match, followed by a term prefix rather than a complete term. Limits the number of term variations that the prefix can expand to during the search. This helps in controlling the breadth of the search by setting a cap on how many different terms the prefix can match. ## Performance Considerations Expanding a prefix might lead to thousands of matching terms, which impacts search times. With `max_expansions`, the prefix term is expanded to at most `max_expansions` terms in lexicographic order. For instance, if `sh` matches `shall`, `share`, `shoe`, and `shore` but `max_expansions` is set to 3, `sh` will only be expanded to `shall`, `share`, and `shoe`. # Regex Phrase Source: https://docs.paradedb.com/documentation/query-builder/phrase/regex-phrase Matches a specific sequence of regex queries Regex phrase matches a specific sequence of regex queries. Think of it like a conjunction of [regex](/documentation/query-builder/term/regex) queries, with positions and ordering of tokens enforced. For example, the regex phrase query for `ru.* shoes` will match `running shoes`, but will not match `shoes running`. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description @@@ pdb.regex_phrase(ARRAY['ru.*', 'shoes']); ```
An `ARRAY` of expressions that form the search phrase. These expressions must appear in the specified order within the document for a match to occur, although some flexibility is allowed based on the `slop` parameter. Please see [regex](/documentation/query-builder/term/regex) for allowed regex constructs. A slop of `0` requires the terms to appear exactly as they are in the phrase and adjacent to each other. Higher slop values allow for transpositions and distance between terms. Limits total number of terms that the regex phrase query can expand to. If this number is exceeded, an error will be returned. # More Like This Source: https://docs.paradedb.com/documentation/query-builder/specialized/more-like-this Finds documents that are "like" another document. The more like this (MLT) query finds documents that are "like" another document. To use this query, pass the [key field](/documentation/indexing/create-index#choosing-a-key-field) value of the input document to `pdb.more_like_this`. For instance, the following query finds documents that are "like" a document with an `id` of `3`: ```sql theme={null} SELECT id, description, rating, category FROM mock_items WHERE id @@@ pdb.more_like_this(3) ORDER BY id; ``` ```ini Expected Response theme={null} id | description | rating | category ----+----------------------+--------+---------- 3 | Sleek running shoes | 5 | Footwear 4 | White jogging shoes | 3 | Footwear 5 | Generic shoes | 4 | Footwear 13 | Sturdy hiking boots | 4 | Footwear 23 | Comfortable slippers | 3 | Footwear 33 | Winter woolen socks | 5 | Footwear (6 rows) ``` In the output above, notice that documents matching any of the indexed fields, `description`, `rating`, and `category`, were returned. This is because, by default, all fields present in the index are considered for matching. The only exception is JSON fields, which are not yet supported and are ignored by the more like this query. To find only documents that match on specific fields, provide an array of field names as the second argument: ```sql theme={null} SELECT id, description, rating, category FROM mock_items WHERE id @@@ pdb.more_like_this(3, ARRAY['description']) ORDER BY id; ``` ```ini Expected Response theme={null} id | description | rating | category ----+---------------------+--------+---------- 3 | Sleek running shoes | 5 | Footwear 4 | White jogging shoes | 3 | Footwear 5 | Generic shoes | 4 | Footwear (3 rows) ``` Because JSON fields are not yet supported for MLT, an error will be returned if a JSON field is passed into the array. ## How It Works Let's look at how the MLT query works under the hood: 1. Stored values for the input document's fields are retrieved. If they are text fields, they are tokenized and filtered in the same way as the field was during [index creation](/documentation/indexing/create-index). 2. A set of representative terms is created from the input document. For example, in the statement above, these terms would be `sleek`, `running`, and `shoes` for the `description` field; `5` for the `rating` field; `footwear` for the `category` field. 3. Documents with at least one term match across any of the fields are considered a match. ## Using a Custom Input Document In addition to providing a key field value, a custom document can also be provided as JSON. The JSON keys are field names and must correspond to field names in the index. ```sql theme={null} SELECT id, description, rating, category FROM mock_items WHERE id @@@ pdb.more_like_this('{"description": "Sleek running shoes", "category": "footwear"}') ORDER BY id; ``` ## Configuration Options ### Term Frequency `min_term_frequency` excludes terms that appear fewer than a certain number of times in the input document, while `max_term_frequency` excludes terms that appear more than that many times. By default, no terms are excluded based on term frequency. For instance, the following query returns no results because no term appears twice in the input document. ```sql theme={null} SELECT id, description, rating, category FROM mock_items WHERE id @@@ pdb.more_like_this(3, min_term_frequency => 2) ORDER BY id; ``` ### Document Frequency `min_doc_frequency` excludes terms that appear in fewer than a certain number of documents across the entire index, while `max_doc_frequency` excludes terms that appear in more than that many documents. By default, no terms are excluded based on document frequency. ```sql theme={null} SELECT id, description, rating, category FROM mock_items WHERE id @@@ pdb.more_like_this(3, min_doc_frequency => 3) ORDER BY id; ``` ### Max Query Terms By default, only the top 25 terms across all fields are considered for matching. Terms are scored using a combination of inverse document frequency and term frequency (TF-IDF) -- this means that terms that appear frequently in the input document and are rare across the index score the highest. This can be configured with `max_query_terms`: ```sql theme={null} SELECT id, description, rating, category FROM mock_items WHERE id @@@ pdb.more_like_this(3, max_query_terms => 10) ORDER BY id; ``` ### Term Length `min_word_length` and `max_word_length` can be used to exclude terms that are too short or too long, respectively. By default, no terms are excluded based on length. ```sql theme={null} SELECT id, description, rating, category FROM mock_items WHERE id @@@ pdb.more_like_this(3, min_word_length => 5) ORDER BY id; ``` ### Custom Stopwords To exclude terms from being considered, provide a text array to `stopwords`: ```sql theme={null} SELECT id, description, rating, category FROM mock_items WHERE id @@@ pdb.more_like_this(3, stopwords => ARRAY['the', 'a']) ORDER BY id; ``` # Range Term Source: https://docs.paradedb.com/documentation/query-builder/term/range-term Filters over Postgres range types `range_term` is the equivalent of Postgres' operators over [range types](https://www.postgresql.org/docs/current/rangetypes.html). It supports operations like range containment, overlap, and intersection. ## Term Within In this example, `weight_range` is an `int4range` type. The following query finds all rows where `weight_range` contains `1`: ```sql theme={null} SELECT id, weight_range FROM mock_items WHERE weight_range @@@ pdb.range_term(1); ``` ## Range Intersects The following query finds all ranges that share at least one common point with the query range: ```sql theme={null} SELECT id, weight_range FROM mock_items WHERE weight_range @@@ pdb.range_term('(10, 12]'::int4range, 'Intersects'); ``` ## Range Contains The following query finds all ranges that are contained by the query range: ```sql theme={null} SELECT id, weight_range FROM mock_items WHERE weight_range @@@ pdb.range_term('(3, 9]'::int4range, 'Contains'); ``` ## Range Within The following query finds all ranges that contain the query range: ```sql theme={null} SELECT id, weight_range FROM mock_items WHERE weight_range @@@ pdb.range_term('(2, 11]'::int4range, 'Within'); ``` # Regex Source: https://docs.paradedb.com/documentation/query-builder/term/regex Searches for terms that match a regex pattern Regex queries search for terms that follow a pattern. For example, the wildcard pattern `key.*` finds all terms that start with `key`. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description @@@ pdb.regex('key.*'); ``` ParadeDB supports all regex constructs of the Rust [regex](https://docs.rs/regex/latest/regex/) crate, with the following exceptions: 1. Lazy quantifiers such as `+?` 2. Word boundaries such as `\b` Otherwise, the full syntax of the [regex](https://docs.rs/regex/latest/regex/) crate is supported, including all Unicode support and relevant flags. A list of regex flags and grouping options can be [found here](https://docs.rs/regex/latest/regex/#grouping-and-flags), which includes: * named and numbered capture groups * case insensitivty flag (`i`) * multi-line mode (`m`) Regex queries operate at the token level. To execute regex over the original text, use the keyword tokenizer. ## Performance Considerations During a regex query, ParadeDB doesn't scan through every single word. Instead, it uses a highly optimized structure called a [finite state transducer (FST)](https://en.wikipedia.org/wiki/Finite-state_transducer) that makes it possible to jump straight to the matching terms. Even if the index contains millions of words, the regex query only looks at the ones that have a chance of matching, skipping everything else. This is why the certain regex constructs are not supported -- they are difficult to implement efficiently. # Relevance Tuning Source: https://docs.paradedb.com/documentation/sorting/boost Tune the BM25 score by adjusting the weights of individual queries ## Boosting ParadeDB offers several ways to tune a document's [BM25 score](/documentation/sorting/score). The first is boosting, which increases or decreases the impact of a specific query by multiplying its contribution to the overall BM25 score. To boost a query, cast the query to the `boost` type. In this example, the `shoes` query is weighted twice as heavily as the `footwear` query. ```sql theme={null} SELECT id, pdb.score(id), description, category FROM mock_items WHERE description ||| 'shoes'::pdb.boost(2) OR category ||| 'footwear' ORDER BY score DESC LIMIT 5; ``` `boost` takes a numeric value, which is the multiplicative boost factor. It can be any floating point number between `-2048` and `2048`. [Query builder functions](/documentation/query-builder/overview) can also be boosted: ```sql theme={null} SELECT id, description, category, pdb.score(id) FROM mock_items WHERE description @@@ pdb.regex('key.*')::pdb.boost(2) ORDER BY score DESC LIMIT 5; ``` Boost can be used in conjunction with other type casts, like [fuzzy](/documentation/full-text/fuzzy): ```sql theme={null} SELECT id, description, category, pdb.score(id) FROM mock_items WHERE description ||| 'shose'::pdb.fuzzy(2)::pdb.boost(2) ORDER BY score DESC LIMIT 5; ``` ## Constant Scoring Constant scoring assigns the same score to all documents that match a query. To apply a constant score, cast the query to the `const` type with a numeric value. For instance, the following query assigns a score of `1` to all documents matching the query `shoes`. ```sql theme={null} SELECT id, pdb.score(id), description, category FROM mock_items WHERE description ||| 'shoes'::pdb.const(1) ORDER BY score DESC LIMIT 5; ``` # BM25 Scoring Source: https://docs.paradedb.com/documentation/sorting/score BM25 scores sort the result set by relevance BM25 scores measure how relevant a score is for a given query. Higher scores indicate higher relevance. ## Basic Usage The `pdb.score()` function produces a BM25 score and can be added to any query where any of the ParadeDB operators are present. ```sql theme={null} SELECT id, pdb.score(id) FROM mock_items WHERE description ||| 'shoes' ORDER BY pdb.score(id) LIMIT 5; ``` In order for a field to be factored into the BM25 score, it must be present in the BM25 index. For instance, consider this query: ```sql theme={null} SELECT id, pdb.score(id) FROM mock_items WHERE description ||| 'keyboard' OR rating < 2 ORDER BY pdb.score(id) LIMIT 5; ``` While BM25 scores will be returned as long as `description` is indexed, including `rating` in the BM25 index definition will allow results matching `rating < 2` to rank higher than those that do not match. ## Joined Scores First, let's create a second table called `orders` that can be joined with `mock_items`: ```sql theme={null} CALL paradedb.create_bm25_test_table( schema_name => 'public', table_name => 'orders', table_type => 'Orders' ); ALTER TABLE orders ADD CONSTRAINT foreign_key_product_id FOREIGN KEY (product_id) REFERENCES mock_items(id); CREATE INDEX orders_idx ON orders USING bm25 (order_id, product_id, order_quantity, order_total, customer_name) WITH (key_field = 'order_id'); ``` Next, let's compute a "combined BM25 score" over a join across both tables. ```sql theme={null} SELECT o.order_id, o.customer_name, m.description, pdb.score(o.order_id) + pdb.score(m.id) as score FROM orders o JOIN mock_items m ON o.product_id = m.id WHERE o.customer_name ||| 'Johnson' AND m.description ||| 'running shoes' ORDER BY score DESC, o.order_id LIMIT 5; ``` ## Score Refresh The scores generated by the BM25 index may be influenced by dead rows that have not been cleaned up by the `VACUUM` process. Running `VACUUM` on the underlying table will remove all dead rows from the index and ensures that only rows visible to the current transaction are factored into the BM25 score. ```sql theme={null} VACUUM mock_items; ``` This can be automated with [autovacuum](/documentation/performance-tuning/overview). # Top N Source: https://docs.paradedb.com/documentation/sorting/topn ParadeDB is optimized for quickly finding the "Top N" results in a table ParadeDB is highly optimized for quickly returning the Top N results out of the index. In SQL, this means queries that contain an `ORDER BY...LIMIT`: ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description ||| 'running shoes' ORDER BY rating LIMIT 5; ``` In order for a Top N query to be executed by ParadeDB vs. vanilla Postgres, all of the following conditions must be met: 1. All `ORDER BY` fields must be indexed. If they are text fields, they [must use the literal tokenizer](#sorting-by-text). 2. At least one ParadeDB text search operator must be present at the same level as the `ORDER BY...LIMIT`. 3. The query must have a `LIMIT`. 4. With the exception of `lower`, ordering by expressions is not supported -- only the raw fields themselves. To verify that ParadeDB is executing the Top N, look for a `Custom Scan` with a `TopNScanExecState` in the `EXPLAIN` output: ```sql theme={null} EXPLAIN SELECT description, rating, category FROM mock_items WHERE description ||| 'running shoes' ORDER BY rating LIMIT 5; ``` ```csv theme={null} QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=10.00..10.02 rows=3 width=552) -> Custom Scan (ParadeDB Scan) on mock_items (cost=10.00..10.02 rows=3 width=552) Table: mock_items Index: search_idx Segment Count: 1 Exec Method: TopNScanExecState Scores: false TopN Order By: rating asc TopN Limit: 5 Tantivy Query: {"with_index":{"query":{"match":{"field":"description","value":"running shoes","tokenizer":null,"distance":null,"transposition_cost_one":null,"prefix":null,"conjunction_mode":false}}}} (10 rows) ``` If any of the above conditions are not met, the query cannot be fully optimized and you will not see a `TopNScanExecState` in the `EXPLAIN` output. ## Tiebreaker Sorting To guarantee stable sorting in the event of a tie, additional columns can be provided to `ORDER BY`: ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description ||| 'running shoes' ORDER BY rating, id LIMIT 5; ``` ParadeDB is currently able to handle 3 `ORDER BY` columns. If there are more than 3 columns, the `ORDER BY` will not be efficiently executed by ParadeDB. ## Sorting by Text If a text field is present in the `ORDER BY` clause, it must be indexed with the [literal](/documentation/tokenizers/available-tokenizers/literal) or [literal normalized](/documentation/tokenizers/available-tokenizers/literal-normalized) tokenizer. Sorting by lowercase text using `lower()` is also supported. To enable this, the expression `lower()` must be indexed with either the literal or literal normalized tokenizer. See [indexing expressions](/documentation/indexing/indexing-expressions) for more information. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (lower(description)::pdb.literal)) WITH (key_field='id'); ``` This allows sorting by lowercase to be optimized. ```sql theme={null} SELECT description, rating, category FROM mock_items WHERE description ||| 'sleek running shoes' ORDER BY lower(description) LIMIT 5; ``` ## Sorting by JSON Ordering by JSON subfield is on the roadmap but not yet supported. For example, this query will not receive an optimized Top N scan: ```sql theme={null} SELECT id, description, metadata FROM mock_items WHERE description ||| 'sleek running shoes' ORDER BY metadata->'weight' LIMIT 5; ``` # Alpha Numeric Only Source: https://docs.paradedb.com/documentation/token-filters/alphanumeric Removes any tokens that contain characters that are not ASCII letters The alpha numeric only filter removes any tokens that contain characters that are not ASCII letters (i.e. `a` to `z` and `A` to `Z`) or digits (i.e. `0` to `9`). It is supported for all tokenizers besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer. To enable, append `alpha_num_only=true` to the tokenizer's arguments. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.simple('alpha_num_only=true'))) WITH (key_field='id'); ``` To demonstrate this token filter, let's compare the output of the following two statements: ```sql theme={null} SELECT 'The café at 9pm!'::pdb.simple::text[], 'The café at 9pm!'::pdb.simple('alpha_num_only=true')::text[]; ``` ```ini Expected Response theme={null} text | text -------------------+-------------- {the,café,at,9pm} | {the,at,9pm} (1 row) ``` # ASCII Folding Source: https://docs.paradedb.com/documentation/token-filters/ascii-folding Strips away diacritical marks like accents The ASCII folding filter strips away diacritical marks (accents, umlauts, tildes, etc.) while leaving the base character intact. It is supported for all tokenizers besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer. To enable, append `ascii_folding=true` to the tokenizer's arguments. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.simple('ascii_folding=true'))) WITH (key_field='id'); ``` To demonstrate this token filter, let's compare the output of the following two statements: ```sql theme={null} SELECT 'Café naïve coöperate'::pdb.simple::text[], 'Café naïve coöperate'::pdb.simple('ascii_folding=true')::text[]; ``` ```ini Expected Response theme={null} text | text ------------------------+------------------------ {café,naïve,coöperate} | {cafe,naive,cooperate} (1 row) ``` # Lowercase Source: https://docs.paradedb.com/documentation/token-filters/lowercase Converts all characters to lowercase The lowercase filter converts all characters to lowercase, allowing for case-insensitive queries. It is enabled by default but can be configured for all tokenizers besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer. To disable, append `lowercase=false` to the tokenizer's arguments: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.simple('lowercase=false'))) WITH (key_field='id'); ``` To demonstrate this token filter, let's compare the output of the following two statements: ```sql theme={null} SELECT 'Tokenize me!'::pdb.simple::text[], 'Tokenize me!'::pdb.simple('lowercase=false')::text[]; ``` ```ini Expected Response theme={null} text | text ---------------+--------------- {tokenize,me} | {Tokenize,me} (1 row) ``` # How Token Filters Work Source: https://docs.paradedb.com/documentation/token-filters/overview Token filters apply additional processing to tokens like lowercasing or stemming After a [tokenizer](/documentation/tokenizers/overview) splits up text into tokens, token filters apply additional processing to each token. Common examples include [stemming](/documentation/token-filters/stemming) to reduce words to their root form, or [ASCII folding](/documentation/token-filters/ascii-folding) to remove accents. Token filters can be added to any tokenizer besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer, which by definition must preserve the source text exactly. To add a token filter to a tokenizer, append a configuration string to the argument list: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.simple('stemmer=english', 'ascii_folding=true'))) WITH (key_field='id'); ``` # Stemmer Source: https://docs.paradedb.com/documentation/token-filters/stemming Reduces words to their root form for a given language Stemming is the process of reducing words to their root form. In English, for example, the root form of "running" and "runs" is "run". Stemming can be configured for any tokenizer besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer. To set a stemmer, append `stemmer=` to the tokenizer's arguments. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.simple('stemmer=english'))) WITH (key_field='id'); ``` Valid languages are `arabic`, `danish`, `dutch`, `english`, `finnish`, `french`, `german`, `greek`, `hungarian`, `italian`, `norwegian`, `polish`, `portuguese`, `romanian`, `russian`, `spanish`, `swedish`, `tamil`, and `turkish`. To demonstrate this token filter, let's compare the output of the following two statements: ```sql theme={null} SELECT 'I am running'::pdb.simple::text[], 'I am running'::pdb.simple('stemmer=english')::text[]; ``` ```ini Expected Response theme={null} text | text ----------------+------------ {i,am,running} | {i,am,run} (1 row) ``` # Remove Stopwords Source: https://docs.paradedb.com/documentation/token-filters/stopwords Remove language-specific stopwords from the index Stopwords are words that are so common or semantically insignificant in most contexts that they can be ignored during indexing. In English, for example, stopwords include "a", "and", "or", etc. All tokenizers besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer can be configured to automatically remove stopwords for one or more languages. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.simple('stopwords_language=english'))) WITH (key_field='id'); ``` Valid languages are `Danish`, `Dutch`, `English`, `Finnish`, `French`, `German`, `Hungarian`, `Italian`, `Norwegian`, `Polish`, `Portuguese`, `Russian`, `Spanish`, and `Swedish`. Language names are case-insensitive. ## Multiple Languages For documents containing multiple languages, you can specify multiple stopword languages as a comma-separated list: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.simple('stopwords_language=English,French'))) WITH (key_field='id'); ``` ```sql theme={null} SELECT 'the quick fox and le renard et'::pdb.simple('stopwords_language=English,French')::text[]; ``` ```ini Expected Response theme={null} text -------------------- {quick,fox,renard} (1 row) ``` ## Example To demonstrate this token filter, let's compare the output of the following two statements: ```sql theme={null} SELECT 'The cat in the hat'::pdb.simple::text[], 'The cat in the hat'::pdb.simple('stopwords_language=English')::text[]; ``` ```ini Expected Response theme={null} text | text ----------------------+----------- {the,cat,in,the,hat} | {cat,hat} (1 row) ``` # Token Length Source: https://docs.paradedb.com/documentation/token-filters/token-length Remove tokens that are above or below a certain byte length from the index The token length filter automatically removes tokens that are above or below a certain length in bytes. To remove all tokens longer than a certain length, append a `remove_long` configuration to the tokenizer: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.simple('remove_long=100'))) WITH (key_field='id'); ``` To remove all tokens shorter than a length, use `remove_short`: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.simple('remove_short=3'))) WITH (key_field='id'); ``` All tokenizers besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer accept these configurations. To demonstrate this token filter, let's compare the output of the following two statements: ```sql theme={null} SELECT 'A supersupersuperlong token'::pdb.simple::text[], 'A supersupersuperlong token'::pdb.simple('remove_short=2', 'remove_long=10')::text[]; ``` ```ini Expected Response theme={null} text | text -------------------------------+--------- {a,supersupersuperlong,token} | {token} (1 row) ``` # Trim Source: https://docs.paradedb.com/documentation/token-filters/trim Remove trailing and leading whitespace from a token The trim filter removes leading and trailing whitespace from a token (but not whitespace in the middle). If a token consists entirely of whitespace, the token is eliminated entirely. This filter is useful for tokenizers that don't already split on whitespace, like the [literal normalized](/documentation/tokenizers/available-tokenizers/literal-normalized) tokenizer or certain language-specific tokenizers. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.literal_normalized('trim=true'))) WITH (key_field='id'); ``` To demonstrate this token filter, let's compare the output of the following two statements: ```sql theme={null} SELECT ' token with whitespace '::pdb.literal_normalized::text[], ' token with whitespace '::pdb.literal_normalized('trim=true')::text[]; ``` ```ini Expected Response theme={null} text | text ----------------------------------+--------------------------- {" token with whitespace "} | {"token with whitespace"} (1 row) ``` # Chinese Compatible Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/chinese-compatible A simple tokenizer for Chinese, Japanese, and Korean characters The Chinese compatible tokenizer is like the [simple](/documentation/tokenizers/available-tokenizers/simple) tokenizer -- it lowercases non-CJK characters and splits on any non-alphanumeric character. Additionally, it treats each CJK character as its own token. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.chinese_compatible)) WITH (key_field='id'); ``` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql theme={null} SELECT 'Hello world! 你好!'::pdb.chinese_compatible::text[]; ``` ```ini Expected Response theme={null} text --------------------- {hello,world,你,好} (1 row) ``` # ICU Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/icu Splits text according to the Unicode standard The ICU (International Components for Unicode) tokenizer breaks down text according to the Unicode standard. It can be used to tokenize most languages and recognizes the nuances in word boundaries across different languages. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.icu)) WITH (key_field='id'); ``` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql theme={null} SELECT 'Hello world! 你好!'::pdb.icu::text[]; ``` ```ini Expected Response theme={null} text -------------------- {hello,world,你好} (1 row) ``` # Jieba Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/jieba The most advanced Chinese tokenizer that leverages both a dictionary and statistical models The Jieba tokenizer is a tokenizer for Chinese text that leverages both a dictionary and statistical models. It is generally considered to be better at identifying ambiguous Chinese word boundaries compared to the [Chinese Lindera](/documentation/tokenizers/available-tokenizers/lindera) and [Chinese compatible](/documentation/tokenizers/available-tokenizers/chinese-compatible) tokenizers, but the tradeoff is that it is slower. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.jieba)) WITH (key_field='id'); ``` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql theme={null} SELECT 'Hello world! 你好!'::pdb.jieba::text[]; ``` ```ini Expected Response theme={null} text -------------------------------- {hello," ",world,!," ",你好,!} (1 row) ``` # Lindera Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/lindera Uses prebuilt dictionaries to tokenize Chinese, Japanese, and Korean text The Lindera tokenizer is a more advanced CJK tokenizer that uses prebuilt Chinese, Japanese, or Korean dictionaries to break text into meaningful tokens (words or phrases) rather than on individual characters. Chinese Lindera uses the CC-CEDICT dictionary, Korean Lindera uses the KoDic dictionary, and Japanese Lindera uses the IPADIC dictionary. By default, non-CJK text is lowercased, but punctuation and whitespace are not ignored. ```sql Chinese Lindera theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.lindera(chinese))) WITH (key_field='id'); ``` ```sql Korean Lindera theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.lindera(korean))) WITH (key_field='id'); ``` ```sql Japanese Lindera theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.lindera(japanese))) WITH (key_field='id'); ``` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql theme={null} SELECT 'Hello world! 你好!'::pdb.lindera(chinese)::text[]; ``` ```ini Expected Response theme={null} text -------------------------------- {hello," ",world,!," ",你好,!} (1 row) ``` # Literal Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/literal Indexes the text in its raw form, without any splitting or processing The literal tokenizer is not ideal for text search queries like [match](/documentation/full-text/match) or [phrase](/documentation/full-text/phrase). If you need to do text search over a field that is literal tokenized, consider using [multiple tokenizers](/documentation/tokenizers/multiple-per-field). Because the literal tokenizer preserves the source text exactly, [token filters](/documentation/token-filters/overview) cannot be configured for this tokenizer. The literal tokenizer applies no tokenization to the text, preserving it as-is. It is the default for `uuid` fields (since exact UUID matching is a common use case), and is useful for doing exact string matching over text fields. It is also required if the text field is used as a sort field in a [Top N](/documentation/sorting/topn) query, or as part of an [aggregate](/documentation/aggregates/overview). ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.literal)) WITH (key_field='id'); ``` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql theme={null} SELECT 'Tokenize me!'::pdb.literal::text[]; ``` ```ini Expected Response theme={null} text ------------------ {"Tokenize me!"} (1 row) ``` # Literal Normalized Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/literal-normalized Like the literal tokenizer, but allows for token filters For all patch versions greater than `0.20.8` in the `20` minor version, and all patch versions greater than `0.21.4` in the `21` minor version, fields using the [literal normalized](/documentation/tokenizers/available-tokenizers/literal-normalized) tokenizer are also columnar indexed. This means that they can be used in [aggregates](/documentation/aggregates/overview) and [Top N queries](/documentation/sorting/topn). Indexes created prior to these versions must be reindexed to use this feature. The literal normalized tokenizer is similar to the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer in that it does not split the source text. All text is treated as a single token, regardless of how many words are contained. However, unlike the literal tokenizer, this tokenizer allows [token filters](/documentation/token-filters/overview) to be applied. By default, the literal normalized tokenizer also [lowercases](/documentation/token-filters/lowercase) the text. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.literal_normalized)) WITH (key_field='id'); ``` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql theme={null} SELECT 'Tokenize me!'::pdb.literal_normalized::text[]; ``` ```ini Expected Response theme={null} text ------------------ {"tokenize me!"} (1 row) ``` # Ngram Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/ngrams Splits text into small chunks called grams, useful for partial matching The ngram tokenizer splits text into "grams," where each "gram" is of a certain length. The tokenizer takes two arguments. The first is the minimum character length of a "gram," and the second is the maximum character length. Grams will be generated for all sizes between the minimum and maximum gram size, inclusive. For example, `pdb.ngram(2,5)` will generate tokens of size `2`, `3`, `4`, and `5`. To generate grams of a single fixed length, set the minimum and maximum gram size equal to each other. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.ngram(3,3))) WITH (key_field='id'); ``` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql theme={null} SELECT 'Tokenize me!'::pdb.ngram(3,3)::text[]; ``` ```ini Expected Response theme={null} text ------------------------------------------------- {tok,oke,ken,eni,niz,ize,"ze ","e m"," me",me!} (1 row) ``` ## Ngram Prefix Only The generate ngram tokens for only the first `n` characters in the text, set `prefix_only` to `true`. ```sql theme={null} SELECT 'Tokenize me!'::pdb.ngram(3,3,'prefix_only=true')::text[]; ``` ```ini Expected Response theme={null} text ------- {tok} (1 row) ``` ## Phrase and Proximity Queries with Ngram Because multiple ngram tokens can overlap, the ngram tokenizer does not store token positions. As a result, queries that rely on token positions like [phrase](/documentation/full-text/phrase), [phrase prefix](/documentation/query-builder/phrase/phrase-prefix), [regex phrase](/documentation/query-builder/phrase/regex-phrase) and [proximity](/documentation/full-text/proximity) are not supported over ngram-tokenized fields. An exception is if the min gram size equals the max gram size, which guarantees unique token positions. In this case, setting `positions=true` enables these queries. ```sql theme={null} SELECT 'Tokenize me!'::pdb.ngram(3,3,'positions=true')::text[]; ``` ### Exact Substring Matching with Phrase Queries With `positions=true`, [phrase queries](/documentation/full-text/phrase) over ngram fields perform exact substring matching. This is faster than using [match conjunction](/documentation/full-text/match#match-conjunction) on an ngram field, which creates a `Must` clause for every ngram token and intersects them independently. A phrase query uses a single positional intersection instead. The tradeoff is that phrase queries are stricter: they require tokens at consecutive positions within a single field value, while match conjunction only requires all tokens to appear somewhere in the document. ```sql theme={null} CREATE TABLE books (id SERIAL PRIMARY KEY, titles TEXT[]); INSERT INTO books (titles) VALUES (ARRAY['The Dragon Hatchling', 'Wings of Gold']), (ARRAY['Dragon Slayer', 'Hatchling Care']); CREATE INDEX ON books USING bm25 (id, (titles::pdb.ngram(4,4,'positions=true'))) WITH (key_field='id'); -- Phrase: matches exact substring "Dragon Hatchling" — only row 1 SELECT * FROM books WHERE titles ### 'Dragon Hatchling'; -- Match conjunction: matches all ngrams anywhere — also only row 1 here, -- but on larger datasets could match rows where the ngrams are scattered SELECT * FROM books WHERE titles ||| 'Dragon Hatchling'; DROP TABLE books; ``` When constructing queries as [JSON](/documentation/query-builder/json), use `tokenized_phrase` to achieve the same result as the `###` operator. It tokenizes the input string with the field's tokenizer and builds a phrase query from the resulting tokens: ```json theme={null} { "tokenized_phrase": { "field": "titles", "phrase": "Dragon Hatchling" } } ``` # Regex Patterns Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/regex Tokenizes text using a regular expression The `regex_pattern` tokenizer tokenizes text using a regular expression. The regular expression can be specified with the pattern parameter. For instance, the following tokenizer creates tokens only for words starting with the letter `h`: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.regex_pattern('(?i)\bh\w*'))) WITH (key_field='id'); ``` The regex tokenizer uses the Rust [regex](https://docs.rs/regex/latest/regex/) crate, which supports all regex constructs with the following exceptions: 1. Lazy quantifiers such as `+?` 2. Word boundaries such as `\b` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql theme={null} SELECT 'Hello world!'::pdb.regex_pattern('(?i)\bh\w*')::text[]; ``` ```ini Expected Response theme={null} text --------- {hello} (1 row) ``` # Simple Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/simple Splits on any non-alphanumeric character The simple tokenizer splits on any non-alphanumeric character (e.g. whitespace, punctuation, symbols). All characters are [lowercased](/documentation/token-filters/lowercase) by default. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.simple)) WITH (key_field='id'); ``` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql theme={null} SELECT 'Tokenize me!'::pdb.simple::text[]; ``` ```ini Expected Response theme={null} text --------------- {tokenize,me} (1 row) ``` # Source Code Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/source-code Tokenizes text that is actually code The source code tokenizer is intended for tokenizing code. In addition to splitting on whitespace, punctuation, and symbols, it also splits on common casing conventions like camel case and snake case. For instance, text like `my_variable` or `myVariable` would get split into `my` and `variable`. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.source_code)) WITH (key_field='id'); ``` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql theme={null} SELECT 'let my_variable = 2;'::pdb.source_code::text[]; ``` ```ini Expected Response theme={null} text --------------------- {let,my,variable,2} (1 row) ``` # Unicode Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/unicode The default text tokenizer in ParadeDB The unicode tokenizer splits text according to word boundaries defined by the [Unicode Standard Annex #29](https://www.unicode.org/reports/tr29/) rules. All characters are [lowercased](/documentation/token-filters/lowercase) by default. This tokenizer is the default text tokenizer. If no tokenizer is specified for a text field, the unicode tokenizer will be used (unless the text field is the [key field](/documentation/indexing/create-index#choosing-a-key-field), in which case the text is not tokenized). ```sql theme={null} -- The following two configurations are equivalent CREATE INDEX search_idx ON mock_items USING bm25 (id, description) WITH (key_field='id'); CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.unicode_words)) WITH (key_field='id'); ``` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql theme={null} SELECT 'Tokenize me!'::pdb.unicode_words::text[]; ``` ```ini Expected Response theme={null} text --------------- {tokenize,me} (1 row) ``` ## Remove Emojis By default, emojis in the source text are preserved. To remove emojis, set `remove_emojis` to `true`. ```sql theme={null} SELECT 'Tokenize me! 😊'::pdb.unicode_words('remove_emojis=true')::text[]; ``` ```ini Expected Response theme={null} text --------------- {tokenize,me} (1 row) ``` # Whitespace Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/whitespace Tokenizes text by splitting on whitespace The whitespace tokenizer splits only on whitespace. It also [lowercases](/documentation/token-filters/lowercase) characters by default. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.whitespace)) WITH (key_field='id'); ``` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql theme={null} SELECT 'Tokenize me!'::pdb.whitespace::text[]; ``` ```ini Expected Response theme={null} text ---------------- {tokenize,me!} (1 row) ``` # Multiple Tokenizers Per Field Source: https://docs.paradedb.com/documentation/tokenizers/multiple-per-field Apply different token configurations to the same field In many cases, a text field needs to be tokenized multiple ways. For instance, using the [unicode](/documentation/tokenizers/available-tokenizers/unicode) tokenizer for search, and the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer for [Top N ordering](/documentation/sorting/topn). To tokenize a field in more than one way, append an `alias=` argument to the additional tokenizer configurations. The alias name can be any string you like. For instance, the following statement tokenizes `description` using both the simple and literal tokenizers. ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 ( id, (description::pdb.literal), (description::pdb.simple('alias=description_simple')) ) WITH (key_field='id'); ``` Under the hood, two distinct fields are created in the index: a field called `description`, which uses the literal tokenizer, and an aliased field called `description_simple`, which uses the simple tokenizer. To query against the aliased field, cast it to `pdb.alias('alias_name')`: ```sql theme={null} -- Query against `description_simple` SELECT description, rating, category FROM mock_items WHERE description::pdb.alias('description_simple') ||| 'Sleek running shoes'; -- Query against `description` SELECT description, rating, category FROM mock_items WHERE description ||| 'Sleek running shoes'; ``` If a text field uses multiple tokenizers and one of them is [literal](/documentation/tokenizers/available-tokenizers/literal), we recommend aliasing the other tokenizers and leaving the literal tokenizer un-aliased. This is so queries that `GROUP BY`, `ORDER BY`, or aggregate the text field can reference the field directly: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 ( id, (description::pdb.literal), (description::pdb.simple('alias=description_simple')) ) WITH (key_field='id'); SELECT description, rating, category FROM mock_items WHERE description @@@ 'shoes' ORDER BY description LIMIT 5; ``` # How Tokenizers Work Source: https://docs.paradedb.com/documentation/tokenizers/overview Tokenizers split large chunks of text into small, searchable units called tokens Before text is indexed, it is first split into searchable units called tokens. The default tokenizer in ParadeDB is the [unicode tokenizer](/documentation/tokenizers/available-tokenizers/unicode). It splits text according to word boundaries defined by the Unicode Standard Annex #29 rules. All characters are lowercased by default. To visualize how this tokenizer works, you can cast a text string to the tokenizer type, and then to `text[]`: ```sql theme={null} SELECT 'Hello world!'::pdb.simple::text[]; ``` ```ini Expected Response theme={null} text --------------- {hello,world} (1 row) ``` On the other hand, the [ngrams](/documentation/tokenizers/available-tokenizers/ngrams) tokenizer splits text into "grams" of size `n`. In this example, `n = 3`: ```sql theme={null} SELECT 'Hello world!'::pdb.ngram(3,3)::text[]; ``` ```ini Expected Response theme={null} text ------------------------------------------------- {hel,ell,llo,"lo ","o w"," wo",wor,orl,rld,ld!} (1 row) ``` Choosing the right tokenizer is crucial to getting the search results you want. For instance, the simple tokenizer works best for whole-word matching like "hello" or "world", while the ngram tokenizer enables partial matching. To configure a tokenizer for a column in the index, simply cast it to the desired tokenizer type: ```sql theme={null} CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.ngram(3,3))) WITH (key_field='id'); ``` # AI Agents Source: https://docs.paradedb.com/welcome/ai-agents Integrate ParadeDB with AI coding assistants and LLM-powered agents. ParadeDB provides first-class support for AI coding assistants and LLM-powered agents. Whether you're using Claude, Cursor, VS Code Copilot, or building custom AI workflows, you can give your agent expert knowledge of ParadeDB. ## Agent Skills Agent skills are instruction files that teach AI coding assistants about ParadeDB. Once installed, your agent will automatically provide accurate, up-to-date guidance when you ask about BM25 indexing, full-text search, or any ParadeDB feature. The skill uses a **pointer-based approach**—instead of bundling static documentation that can go stale, it instructs agents to fetch the latest docs in real-time. Install the skill with: ```bash theme={null} npx skills add paradedb/agent-skills ``` This installs `paradedb-skill` into your agent's skills directory (for example, Codex uses `$CODEX_HOME/skills/paradedb-skill`). For manual and tool-specific setup instructions, see the [agent-skills repository](https://github.com/paradedb/agent-skills). ## MCP Integration ParadeDB documentation is available via the [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) for direct integration with MCP-compatible agents. **MCP Endpoint:** ``` https://docs.paradedb.com/mcp ``` This allows MCP-enabled tools to query ParadeDB documentation programmatically and provide contextual assistance. ## Built-in AI Features Every page in the ParadeDB documentation includes AI-powered features: ### Copy Page Menu Click **"Copy page"** in the top-right corner of any page to access quick integrations: * **Copy page** — Copy the page content as Markdown for LLMs * **Copy MCP Server** — Copy the MCP server URL to your clipboard * **Connect to Cursor** — Install the MCP server directly in Cursor * **Connect to VS Code** — Install the MCP server directly in VS Code * **Open in Perplexity / ChatGPT / Claude** — Ask questions about the current page in your preferred AI assistant # Architecture Source: https://docs.paradedb.com/welcome/architecture A deep dive into how ParadeDB is built on Postgres ParadeDB introduces modern query execution paths and data structures, optimized for high-ingest search and analytics workloads, to Postgres. ## Custom Index Custom Index Architecture In Postgres, indexes provide alternative data structures for accessing the data in a table (which Postgres calls a "heap table") more efficiently. ParadeDB introduces a custom index called the *BM25 index*. When a table row is inserted or updated, the BM25 index is immediately notified. These changes are recorded as part of the current transaction, ensuring that index updates are real-time. ## Data Model Data Model The BM25 index is laid out as an [LSM tree](#lsm-tree), where each segment in the tree consists of both an inverted index and columnar index. The inverted and columnar indexes optimize for fast reads, while the LSM tree optimizes for high-frequency writes. ### Inverted Index An inverted index is a structure that maps each term (i.e., tokenized word) to a list of documents that contain that term (called a "postings list") along with metadata like term frequency and document frequency. [This structure](https://github.com/quickwit-oss/tantivy/blob/main/ARCHITECTURE.md#the-inverted-search-index) allows ParadeDB to efficiently retrieve all documents matching a particular search term or phrase without scanning the entire table. ### Columnar Index Alongside the inverted index, ParadeDB also maintains a structure that stores fields in a column-oriented format. Columnar formats are standard for analytical (i.e. OLAP) databases because they store values contiguously and enable efficient scans over large datasets compared to Postgres' row-oriented layout. All text fields which use the [literal](/documentation/tokenizers/available-tokenizers/literal) or [literal normalized](/documentation/tokenizers/available-tokenizers/literal-normalized) tokenizer, or are non-text, are stored in the columnstore. In Tantivy these structures are referred to as [fast fields](https://docs.rs/tantivy/latest/tantivy/fastfield/index.html), but they are largely transparent in ParadeDB. ### LSM Tree To support real-time updates, the BM25 index uses a [Log-Structured Merge (LSM) tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree). An LSM tree is a write-optimized data structure commonly used in systems like RocksDB and Cassandra. The core idea behind an LSM tree is to turn random writes into sequential ones. Incoming writes are first stored in an in-memory buffer, which is fast to update. Once the buffer fills up or the current statement finishes, it is flushed to disk as an immutable "segment" file. These segment files are organized by size into layers or levels. Newer data is written to the topmost layer. Over time, data is gradually pushed down into lower levels through a process called merging or compaction, where data from smaller segments is merged, deduplicated, and rewritten into larger segments. In ParadeDB, every `INSERT`/`UPDATE`/`COPY` statement creates a new segment. Each segment has its own inverted index and columnar index, which means that the BM25 index is actually a collection of many inverted/columnar indexes, each of which allows for very dense intersection queries to rapidly filter matches. ## Query Execution ### Custom Operators ParadeDB introduces several new text search operators to Postgres. For example, `|||` is used for [match disjunction](/documentation/full-text/match) queries, whereas `###` is for [phrase](/documentation/full-text/phrase) queries. ```sql theme={null} SELECT * FROM mock_items WHERE description ||| 'running shoes'; ``` ParadeDB’s custom query execution paths are only triggered when at least one of ParadeDB's operators is present in the query. Otherwise, it is executed entirely by native Postgres. ### Custom Scan Whenever a ParadeDB operator is present in a query, ParadeDB will execute the query using a [custom scan](https://www.postgresql.org/docs/current/custom-scan.html). Custom scans are execution nodes set aside by Postgres that allow extensions to run custom logic during a query. They are more powerful and versatile than typical Postgres index scans because they allow the extension to "take over" large parts of the query, including aggregates, `WHERE`, and even [`GROUP BY` clauses](/welcome/roadmap#analytics). From a performance perspective, custom scans significantly speed up queries by pushing down filters, aggregates, and other operations directly into the index, rather than applying them afterward in separate phases. To understand what kind of scan is used, run `EXPLAIN`: ```sql theme={null} -- Native Postgres scan, no ParadeDB operator EXPLAIN SELECT * FROM mock_items WHERE description = 'running shoes' AND rating <= 5; -- Custom scan, ParadeDB operator used EXPLAIN SELECT * FROM mock_items WHERE description ||| 'running shoes' AND rating <= 5; ``` As a rule of thumb: if `EXPLAIN` shows a custom scan (or, in rare cases, a BM25 index scan), then that part of query is going through ParadeDB. Otherwise, the query passes through standard Postgres. ### Parallelization For queries that need to read large amounts of data like [Top N](/documentation/sorting/topn) or aggregate queries, the custom scan automatically spawns additional workers to execute the query in parallel. To see if a query was parallelized, run `EXPLAIN ANALYZE`: ```sql theme={null} -- "Top N" queries may be parallelized EXPLAIN ANALYZE SELECT * FROM mock_items WHERE description ||| 'running shoes' ORDER BY rating LIMIT 5; ``` Parallelization also depends on the [number of available workers](/documentation/performance-tuning/reads). Parallel workers are another reason why the BM25 index is significantly faster than Postgres' native text search and aggregates, which are mostly not capable of parallelization. ## Design Philosophy * **Keep it Boring**. Use robust extension points in Postgres vs. hacking around the internals. Adopt battle-tested tools, like industry standard file formats and query engine libraries, instead of cutting-edge but less-proven alternatives. * **Behave Exactly Like Postgres**. This extends from user-facing aspects, like the SQL query syntax and ORM compatibility, all the way down to low-level integrations with Postgres' storage system and query planner. * **Works Out of the Box**. Users should be able to get satisfying search results and performance with minimal tuning or configuration. ## Dependencies The two main dependencies of `pg_search` are [`pgrx`](https://github.com/pgcentralfoundation/pgrx/tree/develop), the library for writing Postgres extensions in Rust, and [Tantivy](https://github.com/quickwit-oss/tantivy), a Rust-based search library inspired by [Lucene](https://github.com/apache/lucene). # Guarantees Source: https://docs.paradedb.com/welcome/guarantees ParadeDB ensures ACID compliance, concurrency, data integrity, and replication safety ### ACI(D) All reads and writes go through Postgres’ transaction engine. This means that inserts, updates, and deletes to indexed columns are atomic, consistent, and respect Postgres' [isolation levels](https://www.postgresql.org/docs/current/transaction-iso.html). Durability — the "D" in ACID — means that once a transaction is committed, its changes will survive crashes or failovers. In PostgreSQL, this guarantee is provided by the write-ahead log (WAL), which ensures that all changes are safely recorded before being applied to disk. [ParadeDB Community](https://github.com/paradedb/paradedb) does **not** write to the WAL, and therefore does not guarantee durability in the face of crashes. For production use cases that require full durability, [ParadeDB Enterprise](/deploy/enterprise) — a closed-source fork of ParadeDB for enterprise customers — includes full WAL integration. ### Concurrency ParadeDB is designed to support concurrent reads and writes in the same way that Postgres does — by adhering to Postgres' [multi-version concurrency control (MVCC)](https://www.postgresql.org/docs/current/mvcc.html) rules. We maintain an internal testing suite that rigorously measures the read and write throughput of the BM25 index under concurrent load. Both read and write throughput under concurrent load can be improved by tuning Postgres' settings. For instance, read throughput can be improved by configuring the [`max_parallel_worker` pool](/documentation/performance-tuning/reads#raise-parallel-workers) and [buffer cache size](/documentation/performance-tuning/reads#raise-shared-buffers), whereas writes can be improved by increasing [per-statement memory](/documentation/performance-tuning/writes#increase-memory-for-bulk-updates). ### Correctness vs. Performance While ParadeDB optimizes heavily around performance, there are some situations where the database can squeeze more performance by relaxing correctness constraints. In these cases, ParadeDB — like Postgres — will guarantee correctness, even if it comes at the cost of slower query execution. ### Replication Safety ParadeDB distinguishes between logical replication and physical replication. Logical replication refers to replicating changes from a standard Postgres primary (e.g. AWS RDS) into a ParadeDB instance. This is commonly used when ParadeDB acts as a search node built from upstream Postgres changes. Physical replication refers to running ParadeDB itself in a multi-node, high-availability (HA) setup using write-ahead log (WAL) shipping. [ParadeDB Community](https://github.com/paradedb/paradedb) supports logical replication, but not physical replication: * It can act as a logical replica, ingesting changes from a Postgres primary and indexing them transactionally. * The BM25 index does not get physically replicated and won't be available on other nodes in a high availability setup. [ParadeDB Enterprise](/deploy/enterprise) supports both: * It can act as a logical replica, ingesting changes from a Postgres primary and indexing them transactionally. * It supports physical replication and high availability, ensuring that the BM25 index remains consistent and crash-safe across nodes. If your deployment requires high availability, or failover, we recommend using [ParadeDB Enterprise](/deploy/enterprise). ### Data Integrity All data inserted into the BM25 index must conform to the column’s declared type. ParadeDB relies on Postgres’ type system and input/output functions to ensure validity. For example, invalid data will result in a Postgres error at insert time, not at query time. # Simple, Elastic-Quality Search for Postgres Source: https://docs.paradedb.com/welcome/introduction ParadeDB is the modern Elastic alternative built as a Postgres extension. ParadeDB Banner ## Who is ParadeDB for? You are likely a good fit for ParadeDB if you identify with the following: 1. Your **primary database is Postgres**, either managed (e.g. AWS RDS) or self-managed 2. You **have used Postgres' built-in search** capabilities via `tsvector` and the GIN index, but have reached a scale where you're limited by **performance bottlenecks** or **missing features** like BM25 scoring or fuzzy search 3. You are evaluating a search engine like Elasticsearch, but **don't want to introduce another cumbersome dependency** to your stack ## Why ParadeDB? For teams that already use Postgres, ParadeDB is the simplest way to bring Elastic-quality search to your application. ### Zero ETL Required Syncing Postgres with an external search engine like Elastic can be a time-consuming, error-prone process that involves babysitting ETL pipelines and debugging data inconsistency issues. ParadeDB eliminates this class of problems because you can: * [Install](/deploy/self-hosted/extension) the ParadeDB extension directly inside your Postgres, if it is self-managed * [Run ParadeDB as a logical replica](/deploy/self-hosted/logical-replication/getting-started) of your primary Postgres, if you use managed Postgres providers like RDS ### Search That Feels Like Postgres In ParadeDB, writing a search query is as simple as writing SQL. ParadeDB supports JOINs, which removes the complexity of denormalizing your existing schema. ### As Reliable As Postgres ParadeDB supports Postgres transactions and ACID guarantees. This means that data is searchable immediately after it's written to ParadeDB, and durable thanks to Postgres write-ahead logging. ## ParadeDB vs. Alternatives People usually compare ParadeDB to two other types of systems: OLTP databases like vanilla Postgres and search engines like Elastic. | | **OLTP database** | **Search engine** | **ParadeDB** | | --------------------------- | ------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------- | | **Primary role** | System of record | Search and retrieval engine | System of record **and** search/analytics engine | | **Examples** | Postgres, MySQL | Elasticsearch, OpenSearch | | | **Search features** | Basic FTS (no BM25, weak ranking) | Rich search features (BM25, fuzzy matching, faceting, hybrid search) | Rich search features (BM25, fuzzy matching, faceting, hybrid search) | | **Analytics features** | Not an analytical DB (no column store, batch processing, etc.) | Column store, batch processing, parallelization via sharding | Column store, batch processing, parallelization via Postgres [parallel workers](/documentation/performance-tuning/writes) | | **Lag** | None in a single cluster | At least network, ETL transformation, and indexing time | None in a single cluster | | **Operational complexity** | Simple (single datastore) | Complex (ETL pipelines, managing multiple systems) | Simple (single datastore) | | **Scalability** | Vertical scaling in a single node, horizontal scaling through Kubernetes | Horizontal scaling through sharding | Vertical scaling in a single node, horizontal scaling through [Kubernetes](/deploy/self-hosted/kubernetes) | | **Language** | SQL | Custom DSL | Standard SQL with custom search operators | | **ACID guarantees** | Full ACID compliance, read-after-write guarantees | No transactions, atomic only per-document, eventual consistency, durability not guaranteed until flush | Full ACID compliance, read-after-write guarantees | | **Update & delete support** | Built for fast-changing data | Struggles with updates/deletes | Built for fast-changing data | ## Production Readiness As a company, ParadeDB is over two years old. ParadeDB launched in the [Y Combinator (YC)](https://ycombinator.com) S23 batch and has been validated in production since December 2023. [ParadeDB Community](https://github.com/paradedb/paradedb), the open-source version of ParadeDB, has been deployed over 400,000 times in the past 12 months. ParadeDB Enterprise, the durable and production-hardened edition of ParadeDB, powers core search and analytics use cases at enterprises ranging from Fortune 500s to fast-growing startups. A few examples include: * **Alibaba Cloud**, the largest Asia-Pacific cloud provider, uses ParadeDB to power search inside their data warehouse. [Case study available](https://www.paradedb.com/customers/case-study-alibaba). * **Bilt Rewards**, a rent payments technology company that processed over \$36B in payments in 2024. [Case study available](https://www.paradedb.com/customers/case-study-bilt). * **Modern Treasury**1, a financial technology company that automates the full cycle of money movement. * **Span**1, one of the fastest-growing AI developer productivity platforms * **TCDI**1, a giant in the legal software and litigation management space. *1. Case study coming soon* ## Next Steps You're now ready to jump into our guides. Get started with ParadeDB in under five minutes. Learn how ParadeDB is built. API reference for full text search and analytics. Deploy ParadeDB as a Postgres extension or standalone database. # Limitations & Tradeoffs Source: https://docs.paradedb.com/welcome/limitations Understand ParadeDB's key limitations and tradeoffs ## Distributed Workloads ParadeDB is designed to scale vertically on a single Postgres node with potentially many read replicas, and many production deployments comfortably operate in the 1–10TB range. The largest single ParadeDB database we’ve seen in production is 10TB. For datasets that significantly exceed this scale, ParadeDB supports partitioned tables and can be deployed in sharded Postgres configurations. ParadeDB is fully compatible with [Citus](https://github.com/citusdata/citus) for distributed search workloads — you can create BM25 indexes on distributed tables and run search queries across shards. See our [Citus deployment guide](/deploy/citus) for more details. If you're working with very large datasets, please [reach out to us](mailto:support@paradedb.com). We'd be happy to provide guidance and share our roadmap for future distributed query support. ## Join Support ParadeDB supports all PostgreSQL `JOIN`s: * `INNER JOIN` * `LEFT / RIGHT / FULL OUTER JOIN` * `CROSS JOIN` * `LATERAL` * Semi and Anti `JOIN`s For the most part you can mix search and relational queries without changing your SQL. However, `JOIN`s do incur some performance tradeoffs. See the [joins guide](/documentation/performance-tuning/joins) for more details. ## Covering Index The BM25 index in ParadeDB is a covering index, which means it stores all indexed columns inside a single index per table. This decision is intentional -- by colocating all the relevant data, ParadeDB optimizes for fast reads and boolean conditions. However, this means that all columns must be defined up front at index creation time. Adding or removing columns requires a `REINDEX`. ## DDL Replication A commonly known limitation of Postgres logical replication is that DDL (Data Definition Language) statements are not replicated. This includes operations like `CREATE TABLE` or `CREATE INDEX`. If ParadeDB is running as a logical replica of a primary Postgres, DDL statements from the primary must be executed manually on the replica. We recommend version-controlling your schema changes and applying them in a coordinated, repeatable way — either through a migration tool or deployment automation — to keep source and target databases in sync. # Roadmap Source: https://docs.paradedb.com/welcome/roadmap The main features that we are currently working on We're a lean team that likes to ship at [incredibly high velocity](https://github.com/paradedb/paradedb/releases). ## In Progress ### JOIN Improvements * **Scoring and highlighting across JOINs**. BM25 score and snippet functions can be used in `JOIN` queries. * **Smarter JOIN planning for search indexes**. Apply index-aware optimizations and cost estimation strategies when multiple BM25-indexed tables are joined. * **Faster JOIN performance through predicate pushdown**. Search predicates are selectively pushed down to relevant tables based on indexability and selectivity, improving `JOIN` query speed. ### Ecosystem Integrations * **ORMs**. Official support for popular ORMs, like Django, Prisma, Active Record, and many others is coming. * **AI Frameworks**. Official support for LangChain, LLamaIndex, CrewAI, and others are coming. * **PaaS Providers**. Official tutorials for hosting ParadeDB on platform-as-a-service providers like Railway, DigitalOcean, Porter.run, and many others is coming. ## Long Term ### Deeper Analytics Improvements * **Push Postgres visibility rules into the index**. This is currently a filter applied post index scan that adds overhead to large scans. * **Evaluate more industry-standard OLAP tools**. A new file format? Query execution library? ### Vector Search Improvements * Postgres (and by extension, ParadeDB) uses [`pgvector`](https://github.com/pgvector/pgvector) for vector search. Contingent on demand and internal resources, we may investigate what improvements can be made to the known limitations of `pgvector`. ### Managed Cloud * Today, you can [deploy ParadeDB](/deploy/overview) either self-hosted or with ParadeDB BYOC. We are working on a fully managed cloud offering, with a focus on scalability and supporting distributed workloads. ## Completed ### Analytics * **A custom scan node for aggregates**. This will allow "plain SQL" aggregates to go through the same fast execution path as our [aggregate UDFs](/documentation/aggregates/tantivy), further accelerating aggregates like `COUNT`, and SQL clauses like `GROUP BY`. ### Write Throughput * **Background merging**. Improves write performance by merging index segments asynchronously without blocking inserts. * **Pending list**. Buffers recent write before flushing them to the LSM tree. ### Improved UX * **More intuitive index configuration**. Overhaul the complicated JSON `WITH` index options. * **More ORM friendly**. Overhaul the [query builder functions](/documentation/query-builder/overview) to use actual column references instead of string literals. * **New operators**. In addition to the existing `@@@` operator, introduce new operators for different query types (e.g. phrase, term, conjunction/disjunction). ## We're Hiring We're tackling some of the hardest and (in our opinion) most impactful problems in Postgres. If you want to be a part of it, please check out our [open roles](https://paradedb.notion.site)! # Help and Support Source: https://docs.paradedb.com/welcome/support How to obtain support for ParadeDB For questions regarding enterprise support or commercial licensing, please [contact sales](mailto:sales@paradedb.com). For community support and general questions, please join the [ParadeDB Community Slack](https://join.slack.com/t/paradedbcommunity/shared_invite/zt-32abtyjg4-yoYoi~RPh9MSW8tDbl0BQw). ## Ask a Question Use the **"Ask a question..."** bar at the bottom of any page to get instant answers about ParadeDB. The AI assistant has full context of the documentation and can help with queries, troubleshooting, and best practices. # ParadeDB BYOC Source: https://docs.paradedb.com/deploy/byoc Deploy ParadeDB Bring Your Own Cloud (BYOC) within your cloud environment For access to ParadeDB BYOC, [contact sales](mailto:sales@paradedb.com). ParadeDB BYOC (Bring Your Own Cloud) is a managed deployment of ParadeDB within your cloud environment. It combines the benefits of a managed platform with the security posture of a self-hosted deployment. ParadeDB BYOC is supported on GCP and AWS, including GovCloud regions and airgapped environments. To request access for Azure, Oracle Cloud, or another cloud platform please contact [sales@paradedb.com](mailto:sales@paradedb.com). ## How BYOC Works ParadeDB BYOC provisions a Kubernetes cluster in your cloud environment with [high availability](/deploy/self-hosted/high-availability/configuration) preconfigured. It also configures [logical replication](/deploy/self-hosted/logical-replication/getting-started) with your primary Postgres, backups, connection pooling, monitoring, access control, and audit logging. ParadeDB BYOC can be deployed and managed in one of two ways: * **Fully Managed**: ParadeDB will deploy and manage the ParadeDB BYOC module for you. ParadeDB requires a sub-account or project within your cloud provider via an IAM user or a service account. * **Just-in-Time Managed**: You will deploy the ParadeDB BYOC module and can choose to provide just-in-time access to the ParadeDB team when support is required. This is typically useful for airgapped environments. ParadeDB BYOC Topology ## Getting Started This section assumes that you have received access to the ParadeDB BYOC module and are deploying it yourself on AWS or GCP. In a fully managed deployment, these steps will be performed by ParadeDB on your behalf. ### Install Dependencies First, ensure that you are in the BYOC module repository. Next, install Terraform, Kubectl, PostgreSQL, and the CLI for your desired cloud provider: ```bash macOS theme={null} brew install terraform kubectl postgresql ``` ```bash Ubuntu theme={null} sudo apt-get install -y terraform kubectl postgresql ``` ### Authenticate CLI Install and authenticate with either the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) or [GCP CLI](https://cloud.google.com/sdk/docs/install#deb). ```bash AWS theme={null} aws configure ``` ```bash GCP theme={null} gcloud init gcloud auth application-default login ``` ### Provision ParadeDB Our Terraform project will provision a Kubernetes cluster (EKS or GKE) along with all the necessary infrastructure to run ParadeDB. First, copy either `aws.example.tfvars` or `gcp.example.tfvars` into a new file called `byoc.tfvars`. ```bash AWS theme={null} cp aws.example.tfvars byoc.tfvars ``` ```bash GCP theme={null} cp gcp.example.tfvars byoc.tfvars ``` Next, open and configure `byoc.tfvars`. Configuration instructions can be found directly within the file. ```bash theme={null} open byoc.tfvars || xdg-open byoc.tfvars ``` ### Run Terraform First, initialize Terraform. ```bash AWS theme={null} terraform -chdir=infrastructure/aws init ``` ```bash GCP theme={null} terraform -chdir=infrastructure/gcp init ``` Next, run Terraform `apply`. ```bash AWS theme={null} terraform -chdir=infrastructure/aws apply -var-file=../../byoc.tfvars ``` ```bash GCP theme={null} terraform -chdir=infrastructure/gcp apply -var-file=../../byoc.tfvars ``` It may take up to 30 minutes to provision all the necessary infrastructure. When this command is complete, you will see a `kubectl` command printed as Terraform output to the terminal. Run this command, which will add the EKS or GKE cluster configuration to your local `.kubeconfig` file. That's it! You're now ready to connect to ParadeDB. ### Connect to ParadeDB #### Access the Grafana Dashboard First, port-forward the Grafana service to localhost. ```bash theme={null} kubectl --namespace monitoring port-forward service/prometheus-grafana 8080:80 ``` Then, go to `http://localhost:8080`. Your Grafana credentials have been printed in the terminal output of the above Terraform `apply` command. You can find the ParadeDB dashboard by typing `CloudNativePG` in the search bar, and selecting `paradedb` for the Database Namespace. By default, the dashboard will display metrics over the last 7 days. If you've just spun up the cluster, change it to the last 15 minutes to start seeing results immediately. #### Access the ParadeDB Instance First, retrieve the database credentials. ```bash theme={null} kubectl --namespace paradedb get secrets paradedb-superuser -o json | jq -r '.data | map_values(@base64d) | .uri |= sub("\\*"; "paradedb") | .dbname = "paradedb"' ``` Next, port-forward the ParadeDB service to localhost. ```bash theme={null} kubectl --namespace paradedb port-forward service/paradedb-rw 5432:5432 ``` Now you can connect to the ParadeDB instance using the credentials you've retrieved. ```bash theme={null} PGPASSWORD= psql -h localhost -d paradedb -p 5432 -U ``` # GitHub Actions Source: https://docs.paradedb.com/deploy/ci/github-actions How to run ParadeDB in Github Actions CI ## Sample GitHub Actions Workflow ```yaml theme={null} name: ParadeDB in GitHub Actions on: pull_request: branches: - main workflow_dispatch: jobs: paradedb-in-github-actions: name: ParadeDB in GitHub Actions runs-on: ubuntu-latest services: paradedb: # The list of available tags can be found at https://hub.docker.com/r/paradedb/paradedb/tags image: paradedb/paradedb:latest env: POSTGRES_USER: testuser POSTGRES_PASSWORD: testpassword POSTGRES_DB: testdb ports: - 5432:5432 options: >- --health-cmd="pg_isready -U postgres" --health-interval=10s --health-timeout=5s --health-retries=5 steps: - name: Checkout repository uses: actions/checkout@v6 - name: Wait for PostgreSQL to be ready run: | for i in {1..10}; do if psql -h localhost -U testuser -d testdb -c "SELECT 1;" > /dev/null 2>&1; then echo "Database is ready!" break fi echo "Waiting for database..." sleep 5 done - name: Run ParadeDB example queries run: | psql -h localhost -U testuser -d testdb -c "CALL paradedb.create_bm25_test_table(schema_name => 'public', table_name => 'mock_items');" psql -h localhost -U testuser -d testdb -c "SELECT description, rating, category FROM mock_items LIMIT 3;" psql -h localhost -U testuser -d testdb -c "CREATE INDEX search_idx ON mock_items USING bm25 (id, description, category, rating, in_stock, created_at, metadata, weight_range) WITH (key_field='id');" psql -h localhost -U testuser -d testdb -c "SELECT description, rating, category FROM mock_items WHERE description @@@ 'shoes' OR category @@@ 'footwear' AND rating @@@ '>2' ORDER BY description LIMIT 5;" ``` # GitLab CI Source: https://docs.paradedb.com/deploy/ci/gitlab-ci How to run ParadeDB in Gitlab CI ## Sample GitLab CI Workflow ```yaml theme={null} paradedb-in-gitlab-ci: # The list of available tags can be found at https://hub.docker.com/r/paradedb/paradedb/tags image: paradedb/paradedb:latest services: - postgres variables: POSTGRES_USER: testuser POSTGRES_DB: testdb POSTGRES_HOST_AUTH_METHOD: trust script: - psql -h "postgres" -U testuser -d testdb -c "CALL paradedb.create_bm25_test_table(schema_name => 'public', table_name => 'mock_items');" - psql -h "postgres" -U testuser -d testdb -c "SELECT description, rating, category FROM mock_items LIMIT 3;" - psql -h "postgres" -U testuser -d testdb -c "CREATE INDEX search_idx ON mock_items USING bm25 (id, description, category, rating, in_stock, created_at, metadata, weight_range) WITH (key_field='id');" - psql -h "postgres" -U testuser -d testdb -c "SELECT description, rating, category FROM mock_items WHERE description @@@ 'shoes' OR category @@@ 'footwear' AND rating @@@ '>2' ORDER BY description LIMIT 5;" ``` # Using ParadeDB with Citus Source: https://docs.paradedb.com/deploy/citus Distributed full-text search with Citus and ParadeDB [Citus](https://github.com/citusdata/citus) transforms PostgreSQL into a distributed database with horizontal sharding. ParadeDB is fully compatible with Citus, enabling distributed full-text search across sharded tables. ## What's Supported * **BM25 indexes on distributed tables** — Create search indexes after distributing tables with `create_distributed_table()` * **Distributed queries with search operators** — Use the `|||` (match disjunction) and `&&&` (match conjunction) operators in queries across sharded tables * **Subqueries with LIMIT** — Complex queries with subqueries and LIMIT clauses work correctly * **JOIN queries** — Search with JOINs across distributed tables ## Installation Both `citus` and `pg_search` must be added to `shared_preload_libraries` in the correct order: ```bash theme={null} # Install Citus first curl https://install.citusdata.com/community/deb.sh | sudo bash apt-get install -y postgresql-17-citus-13.0 # Add both extensions to shared_preload_libraries sed -i "s/^shared_preload_libraries = .*/shared_preload_libraries = 'citus,pg_search'/" /var/lib/postgresql/data/postgresql.conf # Restart PostgreSQL # Then create extensions in your database ``` The order in `shared_preload_libraries` matters. Always list `citus` before `pg_search` to ensure proper planner hook chaining. ## Usage Example Here's a complete example of setting up distributed search with Citus: ```sql theme={null} CREATE EXTENSION citus; CREATE EXTENSION pg_search; -- Create a table with a distribution key CREATE TABLE articles ( id SERIAL, author_id INT NOT NULL, title TEXT, body TEXT, PRIMARY KEY (author_id, id) -- Must include distribution column ); -- Distribute the table across shards SELECT create_distributed_table('articles', 'author_id'); -- Create a BM25 index on the distributed table CREATE INDEX articles_search_idx ON articles USING bm25 (id, title, body) WITH (key_field='id'); -- Insert some data INSERT INTO articles (author_id, title, body) VALUES (1, 'PostgreSQL Performance', 'Optimizing PostgreSQL queries for large datasets'), (1, 'Distributed Databases', 'Understanding sharding and replication strategies'), (2, 'Full-Text Search', 'Building search engines with PostgreSQL'); -- Search across shards SELECT id, title FROM articles WHERE body ||| 'PostgreSQL distributed' ORDER BY id; -- Results: -- id | title -- ----+------------------------ -- 1 | PostgreSQL Performance -- 3 | Full-Text Search ``` ### Verify Distributed Execution You can verify that both ParadeDB and Citus are working together by examining the query plan: ```sql theme={null} EXPLAIN (VERBOSE) SELECT id, title FROM articles WHERE body ||| 'PostgreSQL distributed' ORDER BY id; ``` The plan should show: 1. **Citus Adaptive Custom Scan** — Coordinating distributed query execution across shards 2. **ParadeDB Scan** — Using the BM25 index within each shard 3. **Task Count: 32** — Query distributed across 32 shards (default Citus shard count) ``` Sort (cost=11041.82..11291.82 rows=100000 width=36) Output: remote_scan.id, remote_scan.title Sort Key: remote_scan.id -> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=36) Output: remote_scan.id, remote_scan.title Task Count: 32 Tasks Shown: One of 32 -> Task Query: SELECT id, title FROM public.articles_102008 articles WHERE (id OPERATOR(pg_catalog.@@@) ...) Node: host=localhost port=5432 dbname=postgres -> Custom Scan (ParadeDB Scan) on public.articles_102008 articles (cost=10.00..10.01 rows=1 width=36) Output: id, title Table: articles_102008 Index: articles_search_idx_102008 Tantivy Query: {"with_index":{"query":{"with_index":{"query":{"match":{"field":"body","value":"PostgreSQL distributed"}}}}}} ``` ## Distributed JOINs with Search ParadeDB search operators work seamlessly with Citus distributed JOINs: ```sql theme={null} -- Create and distribute a second table CREATE TABLE authors ( id INT PRIMARY KEY, name TEXT, bio TEXT ); SELECT create_distributed_table('authors', 'id'); -- JOIN with search operators SELECT a.name, ar.title FROM authors a JOIN articles ar ON a.id = ar.author_id WHERE ar.body ||| 'PostgreSQL' ORDER BY a.name; -- Results: -- name | title -- -------+------------------------ -- Alice | PostgreSQL Performance -- Bob | Full-Text Search ``` ### Verify Distributed JOIN Execution Check the execution plan for distributed JOINs with search: ```sql theme={null} EXPLAIN (VERBOSE) SELECT a.name, ar.title FROM authors a JOIN articles ar ON a.id = ar.author_id WHERE ar.body ||| 'PostgreSQL' ORDER BY a.name; ``` ``` Sort (cost=12067.32..12317.32 rows=100000 width=64) Output: remote_scan.name, remote_scan.title Sort Key: remote_scan.name -> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=64) Output: remote_scan.name, remote_scan.title Task Count: 32 Tasks Shown: One of 32 -> Task Query: SELECT a.name, ar.title FROM (public.authors_102040 a JOIN public.articles_102008 ar ON (...)) Node: host=localhost port=5432 dbname=postgres -> Nested Loop (cost=10.15..18.20 rows=1 width=64) Output: a.name, ar.title Inner Unique: true -> Custom Scan (ParadeDB Scan) on public.articles_102008 ar (cost=10.00..10.01 rows=1 width=36) Output: ar.title, ar.author_id Table: articles_102008 Index: articles_search_idx_102008 Tantivy Query: {"with_index":{"query":{"with_index":{"query":{"match":{"field":"body","value":"PostgreSQL"}}}}}} -> Index Scan using authors_pkey_102040 on public.authors_102040 a (cost=0.15..8.17 rows=1 width=36) Output: a.id, a.name, a.bio Index Cond: (a.id = ar.author_id) ``` Key indicators: * `Nested Loop` shows efficient JOIN execution on each shard * `Custom Scan (ParadeDB Scan)` on the outer side of the JOIN uses BM25 for filtering * `Index Scan` on authors table uses the primary key for lookups * JOINs execute **locally on each shard** for optimal performance ## Known Limitations * ❌ **Citus columnar tables** — BM25 indexes and other PostgreSQL indexes (like GiST, GIN) cannot be created on Citus columnar tables due to limitations in Citus's columnar storage implementation. However, you can use regular distributed tables with BM25 indexes alongside columnar tables for analytics. ## Performance Considerations When using ParadeDB with Citus: * **Index creation** happens locally on each shard, enabling parallel index building * **Search queries** execute in parallel across shards and results are merged by the coordinator * **Distribution column** should be chosen based on your query patterns to minimize cross-shard operations For more guidance on optimizing distributed search workloads, please reach out to us in the [ParadeDB Community Slack](https://join.slack.com/t/paradedbcommunity/shared_invite/zt-32abtyjg4-yoYoi~RPh9MSW8tDbl0BQw) or via [email](mailto:support@paradedb.com). # ParadeDB Enterprise Source: https://docs.paradedb.com/deploy/enterprise Feature comparison between ParadeDB Community and Enterprise If you're a non-profit or a non-commercial open source project and are interested in ParadeDB Enterprise, please [contact sales](mailto:sales@paradedb.com). We provide complimentary access on a case-by-case basis. ParadeDB ships in two versions: ParadeDB Community and ParadeDB Enterprise. [ParadeDB Community](https://github.com/paradedb/paradedb) is our open source product, licensed under [AGPL-3.0](https://github.com/paradedb/paradedb/blob/dev/LICENSE). This license permits free use, modification, and distribution of the software, provided that distributed, derivative works of the software are released under the same license (copyleft provision). In addition to all of the features of ParadeDB Community, ParadeDB Enterprise: 1. Waives the copyleft provision of AGPL-3.0 2. Contains several close-source features that are recommended for ParadeDB to service enterprise, production workloads For access to ParadeDB Enterprise, please [contact sales](mailto:sales@paradedb.com). ## Feature Comparison | | ParadeDB Community | ParadeDB Enterprise | | ------------------------------------ | ------------------ | ------------------- | | **Index Configuration** | | | | Support for most Postgres types | ✅ | ✅ | | Custom tokenizers and filters | ✅ | ✅ | | Multiple tokenizers per field | ✅ | ✅ | | **Full Text Search and Analytics** | | | | Query builder API | ✅ | ✅ | | Efficient "Top N" ordering | ✅ | ✅ | | BM25 scoring | ✅ | ✅ | | Highlighting | ✅ | ✅ | | Hybrid search | ✅ | ✅ | | Parallelized fast field aggregates | ✅ | ✅ | | **Concurrency and Consistency** | | | | Postgres MVCC-safe1 | ✅ | ✅ | | Concurrent, non-blocking writes | ✅ | ✅ | | Block storage integration | ✅ | ✅ | | Buffer cache integration2 | ✅ | ✅ | | **Deployment** 3 | | | | Maximum cluster size 4 | 1 | Unlimited | | Physical (i.e. WAL) Replication | ❌ | ✅ | | Crash Recovery | ❌ | ✅ | | Point in Time Recovery | ❌ | ✅ | | Logical Replication | ✅ | ✅ | **Footnotes**

1. The BM25 index supports Postgres' multi-version concurrency control (MVCC) rules. The index reflects the current state of the underlying table at all times, changes to the index are atomic, and queries are transactionally consistent with the table. 2. The BM25 index is built on block storage, Postgres' native storage API. This means that it leverages the Postgres buffer cache, which minimizes disk I/O. 3. All listed deployment features and limitations are specific to the BM25 index. For instance, ParadeDB Community supports physical/logical replication, crash recovery, etc. for heap tables and other Postgres indexes like B-Tree. 4. In a primary-replica topology, BM25 indexes in ParadeDB Community are only available on the primary, as the Community edition does not support physical (WAL) replication.

# Deploying ParadeDB Source: https://docs.paradedb.com/deploy/overview Explore the different ways to deploy ParadeDB into production There are two ways to deploy ParadeDB: [self-hosting ParadeDB](/deploy/self-hosted/kubernetes) or using [ParadeDB BYOC](/deploy/byoc) (Bring Your Own Cloud). ## Self-Hosted ParadeDB For hobby, development, or staging environments, [ParadeDB Community](https://github.com/paradedb/paradedb) can be deployed as a [Docker image](/deploy/self-hosted/docker), inside [Kubernetes](/deploy/self-hosted/kubernetes), or as an [extension](/deploy/self-hosted/extension) inside an existing self-hosted Postgres. For production environments, we recommend deploying the [ParadeDB Enterprise](/deploy/enterprise) binary inside Kubernetes. This unlocks support for [logical replication](/deploy/self-hosted/logical-replication/getting-started) and [high availability](/deploy/self-hosted/high-availability/configuration), which must be configured separately. Running ParadeDB Community in a production application that serves paying customers is discouraged. This is because ParadeDB Community [does not have write-ahead log (WAL) support](/deploy/enterprise). Without WALs, data can be lost or corrupted if the server crashes or restarts, which would necessitate a reindex and incur downtime for your application. For more details, see [guarantees](/welcome/guarantees#aci-d). When you are ready to deploy in ParadeDB to production, [contact us](mailto:sales@paradedb.com) for access to ParadeDB Enterprise, which has WAL support. ## ParadeDB BYOC [ParadeDB BYOC (Bring Your Own Cloud)](/deploy/byoc) is a managed deployment of ParadeDB Enterprise inside your AWS or GCP account. Please [contact sales](mailto:sales@paradedb.com) for access. # Docker Source: https://docs.paradedb.com/deploy/self-hosted/docker How to run the ParadeDB Docker container We recommend running ParadeDB Enterprise, not Community, in production to maximize uptime. See [overview](/deploy/overview#self-hosted). To deploy ParadeDB via Docker, pull and run the `paradedb/paradedb` image locally. This is the recommended deployment method for testing and hobby deployments. We do not recommend running the ParadeDB Docker image in production, as it does not provide [high availability](/deploy/self-hosted/high-availability/configuration). ```bash theme={null} docker run \ --name paradedb \ -e POSTGRES_USER=myuser \ -e POSTGRES_PASSWORD=mypassword \ -e POSTGRES_DB=mydatabase \ -v paradedb_data:/var/lib/postgresql/ \ -p 5432:5432 \ -d \ paradedb/paradedb:latest ``` You may replace `myuser`, `mypassword`, and `mydatabase` with whatever values you want. These will be your database connection credentials. The `paradedb/paradedb:latest` tag uses Postgres 17. Docker images are available for Postgres 15+. To specify a different Postgres version, please refer to the available tags on [Docker Hub](https://hub.docker.com/r/paradedb/paradedb/tags). To connect to ParadeDB, install the `psql` client and run ```bash theme={null} docker exec -it paradedb psql -U myuser -d mydatabase -W ``` # Extension Source: https://docs.paradedb.com/deploy/self-hosted/extension How to install ParadeDB as an extension inside an existing self-managed Postgres We recommend running ParadeDB Enterprise, not Community, in production to maximize uptime. See [overview](/deploy/overview#self-hosted). If you already self-manage Postgres, you may prefer to install ParadeDB directly within your self-managed Postgres instead of deploying the ParadeDB Helm chart. This can be done by installing the `pg_search` extension, which powers all of ParadeDB's custom functionalities. ## Prerequisites Ensure that you have superuser access to the Postgres database. ## Install the ParadeDB Postgres Extension ### ParadeDB Community ParadeDB provides prebuilt binaries of our extension for Postgres 15+ on: * Debian 12 (Bookworm) and 13 (Trixie) * Ubuntu 22.04 (Jammy) and 24.04 (Noble) * macOS 14 (Sonoma) and 15 (Sequoia) * Red Hat Enterprise Linux 9 and 10 If you are using a different version of Postgres or a different operating system, you will need to build the extension from source. #### pg\_search The prebuilt releases can be found in [GitHub Releases](https://github.com/paradedb/paradedb/releases). You can replace `0.21.8` with the `pg_search` version you wish to install and `17` with the version of Postgres you are using. ```bash Ubuntu 24.04 theme={null} # Available arch versions are amd64, arm64 curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/postgresql-17-pg-search_0.21.8-1PARADEDB-noble_amd64.deb" -o /tmp/pg_search.deb sudo apt-get install -y /tmp/*.deb ``` ```bash Ubuntu 22.04 theme={null} # Available arch versions are amd64, arm64 curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/postgresql-17-pg-search_0.21.8-1PARADEDB-jammy_amd64.deb" -o /tmp/pg_search.deb sudo apt-get install -y /tmp/*.deb ``` ```bash Debian 13 theme={null} # Available arch versions are amd64, arm64 curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/postgresql-17-pg-search_0.21.8-1PARADEDB-trixie_amd64.deb" -o /tmp/pg_search.deb sudo apt-get install -y /tmp/*.deb ``` ```bash Debian 12 theme={null} # Available arch versions are amd64, arm64 curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/postgresql-17-pg-search_0.21.8-1PARADEDB-bookworm_amd64.deb" -o /tmp/pg_search.deb sudo apt-get install -y /tmp/*.deb ``` ```bash RHEL 10 theme={null} # Available arch versions are x86_64, aarch64 curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/pg_search_17-0.21.8-1PARADEDB.el10.x86_64.rpm" -o /tmp/pg_search.rpm sudo dnf install -y /tmp/*.rpm ``` ```bash RHEL 9 theme={null} # Available arch versions are x86_64, aarch64 curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/pg_search_17-0.21.8-1PARADEDB.el9.x86_64.rpm" -o /tmp/pg_search.rpm sudo dnf install -y /tmp/*.rpm ``` ```bash macOS 15 (Sequoia) theme={null} # Available arch version is arm64 curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/pg_search@17--0.21.8.arm64_sequoia.pkg" -o ~/Downloads/pg_search.pkg sudo installer -pkg ~/Downloads/pg_search.pkg -target / ``` ```bash macOS 14 (Sonoma) theme={null} # Available arch version is arm64 curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/pg_search@17--0.21.8.arm64_sonoma.pkg" -o ~/Downloads/pg_search.pkg sudo installer -pkg ~/Downloads/pg_search.pkg -target / ``` ### ParadeDB Enterprise If you are a [ParadeDB Enterprise](/deploy/enterprise) user, you should have received a copy of the enterprise binaries. Please [contact sales](mailto:sales@paradedb.com) for access. ## Update `postgresql.conf` Next, add the extension(s) to `shared_preload_libraries` in `postgresql.conf`. Adding `pg_search` to `shared_preload_libraries` is unnecessary if your Postgres version is 17 or higher. ```ini theme={null} shared_preload_libraries = 'pg_search' ``` Reload the Postgres server for these changes to take effect. ## Load the Extension Finally, connect to your Postgres database via your client of choice (e.g. `psql`) and run the following command: ```sql theme={null} CREATE EXTENSION pg_search; ``` `pg_search` can be combined with `pgvector` for hybrid search. You can find the instructions for installing `pgvector` [on the `pgvector` GitHub repository](https://github.com/pgvector/pgvector?tab=readme-ov-file#installation). # Configuring High Availability Source: https://docs.paradedb.com/deploy/self-hosted/high-availability/configuration Use read replicas to minimize downtime in production High availability (HA) minimizes downtime in the event of failures and is crucial for production deployments. To achieve high availability, you need to have [ParadeDB Enterprise](/deploy/enterprise) deployed inside a [CNPG Kubernetes cluster](/deploy/self-hosted/kubernetes). ## How High Availability Works In a highly available configuration, ParadeDB deploys as a cluster of Postgres instances. One instance is designated as the **primary** while the other instances are designated as **standby** instances. The primary server sends write-ahead logs (WAL) to the standby servers, which replicate the primary by replaying these logs. If the primary server goes down, a standby server is promoted to become the new primary server. This process is called failover. For a thorough architecture overview, please consult the [CloudNativePG Architecture documentation](https://cloudnative-pg.io/docs/1.28/architecture). ## Enable High Availability Prior to starting the CNPG cluster, modify the `values.yaml` file to increase the number of instances. ```yaml ParadeDB Enterprise theme={null} type: paradedb-enterprise mode: standalone cluster: instances: 3 storage: size: 256Mi ``` The number of replicas is equal to `instances - 1`. Having at least `3` instances guarantees that a standby will be available even while a failover process is occurring. ## Synchronous Replication Between physical replicas, ParadeDB requires the use of a few settings (which are automatically set by [CNPG](/deploy/self-hosted/kubernetes)) in order to avoid query cancellation due to ongoing reorganization of the data on the primary replica. * `hot_standby_feedback=on` - The [`hot_standby_feedback`](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-HOT-STANDBY-FEEDBACK) setting controls whether nodes acting as `hot_standby`s (the replicas in physical replication) send feedback to the leader about their current transaction status. ParadeDB uses this transaction status to determine when it is safe for the primary to garbage collect its segments. * `primary_slot_name=$something` - The [`primary_slot_name`](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-PRIMARY-SLOT-NAME) setting declares the name of the replication slot that a replica should use when it connects to the primary. In order for `hot_standby_feedback` to be used and persistent, a replication slot must be used. Without these settings, ParadeDB physical replicas will see much more frequent query cancels, and will report a message recommending that they are used. ## Synchronous Replication By default, ParadeDB ships with asynchronuous replication, meaning transactions on the primary **do not** wait for confirmation from the standby instances before committing. **Quorum-based synchronous replication** ensures that a transaction is successfully written to standbys before it completes. Please consult the [CloudNativePG Replication documentation](https://cloudnative-pg.io/docs/1.28/replication#synchronous-replication) for details. ## Backup and Disaster Recovery ParadeDB supports backups to cloud object stores (e.g. S3, GCS, etc.) and point-in-time-recovery via [Barman](https://pgbarman.org/). To configure the frequency and location of backups, please consult the [CloudNativePG Backup documentation](https://cloudnative-pg.io/docs/1.28/backup). # Kubernetes Source: https://docs.paradedb.com/deploy/self-hosted/kubernetes How to deploy ParadeDB as a Kubernetes cluster into production Kubernetes is the recommended way to run ParadeDB in production. Both ParadeDB Community and Enterprise binaries can be deployed on Kubernetes. We recommend running ParadeDB Enterprise, not Community, with Kubernetes in production to maximize uptime. See [overview](/deploy/overview#self-hosted). This guide uses the [ParadeDB Helm Chart](https://github.com/paradedb/charts). The chart is also available on [Artifact Hub](https://artifacthub.io/packages/helm/paradedb/paradedb). ## Prerequisites This guide assumes you have installed [Helm](https://helm.sh/docs/intro/install/) and have a Kubernetes cluster running v1.25+. For local testing, we recommend [Minikube](https://minikube.sigs.k8s.io/docs/start/). ## Install the Prometheus Stack The ParadeDB Helm chart supports monitoring via Prometheus and Grafana. To enable this, you need to have the Prometheus CRDs installed before installing the CloudNativePG operator. If you do not yet have the Prometheus CRDs installed on your Kubernetes cluster, you can install it with: ```bash theme={null} helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm upgrade --atomic --install prometheus-community \ --create-namespace \ --namespace prometheus-community \ --values https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/main/docs/src/samples/monitoring/kube-stack-config.yaml \ prometheus-community/kube-prometheus-stack ``` ## Install the CloudNativePG Operator Skip this step if the CloudNativePG operator is already installed in your cluster. If you do not wish to monitor your cluster, omit the `--set` commands. ```bash theme={null} helm repo add cnpg https://cloudnative-pg.github.io/charts helm upgrade --atomic --install cnpg \ --create-namespace \ --namespace cnpg-system \ --set monitoring.podMonitorEnabled=true \ --set monitoring.grafanaDashboard.create=true \ cnpg/cloudnative-pg ``` ## Start a ParadeDB CNPG Cluster Create a `values.yaml` and configure it to your requirements. Here is a basic example: ```yaml ParadeDB Community theme={null} type: paradedb mode: standalone cluster: instances: 1 storage: size: 256Mi ``` ```yaml ParadeDB Enterprise theme={null} type: paradedb-enterprise mode: standalone cluster: instances: 1 storage: size: 256Mi ``` If you are using ParadeDB Enterprise, `instances` should be set to a number greater than `1` for [high availability](/deploy/self-hosted/high-availability/configuration). Next, create a namespace for this step or use an existing namespace. The namespace can be any value. ```bash theme={null} kubectl create namespace ``` For ParadeDB Enterprise, you should have received an enterprise Docker username and personal access token. The following step passes these credentials to Kubernetes and should be skipped if you are deploying ParadeDB Community. ```bash ParadeDB Enterprise theme={null} kubectl create secret docker-registry paradedb-enterprise-registry-cred --namespace --docker-server="https://index.docker.io/v1/" --docker-username="" --docker-password="" ``` Finally, launch the ParadeDB cluster. ```bash theme={null} helm repo add paradedb https://paradedb.github.io/charts helm upgrade --atomic --install paradedb \ --namespace \ --values values.yaml \ --set cluster.monitoring.enabled=true \ paradedb/paradedb ``` ## Connect to the Cluster The command to connect to the primary instance of the cluster will be printed in your terminal. If you do not modify any settings, it will be: ```bash theme={null} kubectl --namespace paradedb exec --stdin --tty services/paradedb-rw -- bash ``` This will launch a Bash shell inside the instance. You can connect to the ParadeDB database via `psql` with: ```bash theme={null} psql -d paradedb ``` ## Connect to the Grafana Dashboard To connect to the Grafana dashboard for your cluster, we suggested port forwarding the Kubernetes service running Grafana to localhost: ```bash theme={null} kubectl --namespace prometheus-community port-forward svc/prometheus-community-grafana 3000:80 ``` `You can then access the Grafana dasbhoard at `localhost:3000` using the credentials`admin`as username and`prom-operator` as password. These default credentials are defined in the [`kube-stack-config.yaml`](https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/main/docs/src/samples/monitoring/kube-stack-config.yaml) file used as the `values.yaml`file in [Installing the Prometheus CRDs](#installing-the-prometheus-stack) and can be modified by providing your own`values.yaml\` file. A more detailed guide on monitoring the cluster can be found in the [CloudNativePG documentation](https://cloudnative-pg.io/docs/1.28/monitoring). # Schema Changes Source: https://docs.paradedb.com/deploy/self-hosted/logical-replication/configuration Handle DDL/schema changes when running ParadeDB as a logical replica This section assumes that you have successfully completed the [getting started](/deploy/self-hosted/logical-replication/getting-started) guide. ## Schema Changes ParadeDB leverages PostgreSQL’s built-in logical replication to provide flexible and efficient data synchronization, and is subject to the same limitations. A well-known caveat of logical replication is that schema changes (DDL commands) are not replicated. This means that any changes to the schema on the source database, such as adding new columns or tables, will not be automatically applied to the subscriber. ```sql theme={null} -- On Publisher ALTER TABLE mock_items ADD COLUMN num_stock; INSERT INTO mock_items (description, category, in_stock, latest_available_time, last_updated_date, metadata, created_at, rating, num_stock) VALUES ('Green running shoes', 'Footwear', true, '14:00:00', '2024-07-09', '{}', '2024-07-09 14:00:00', 2, 900); -- On Subscriber ERROR: logical replication target relation "public.mock_items" is missing some replicated columns ``` To work around this, pause the subscription on the subscriber, manually apply the schema changes, then resume the subscription: ```sql theme={null} -- On Subscriber ALTER SUBSCRIPTION mock_items_sub DISABLE; ALTER TABLE entries ADD COLUMN num_stock int; ALTER SUBSCRIPTION mock_items_sub ENABLE; ``` If new tables are added and your publication is not `FOR ALL TABLES`, add them to the publication manually: ```sql theme={null} -- On Publisher ALTER PUBLICATION mock_items_pub ADD TABLE newly_added_table; -- On Subscriber ALTER SUBSCRIPTION mock_items_sub REFRESH PUBLICATION; ``` # Configuring Logical Replication Source: https://docs.paradedb.com/deploy/self-hosted/logical-replication/getting-started Set up ParadeDB as a logical replica of an existing Postgres like Amazon RDS In order for ParadeDB to run as a logical replica, ParadeDB must be using Postgres 17+. Logical replication can be configured to replicate specific tables, column lists, or rows. Please see the [Postgres documentation](https://www.postgresql.org/docs/current/logical-replication.html). In production, ParadeDB is designed to be deployed as a logical replica of your primary Postgres. This allows ParadeDB to stay in sync with managed Postgres providers like AWS RDS, Google CloudSQL, and Azure Postgres with zero additional infrastructure overhead. It also isolates search and analytical workloads from your primary Postgres, which can prevent downtime caused by long-running queries. ParadeDB supports logical replication from any primary Postgres. This could be AWS RDS/Aurora, Azure Database, GCP CloudSQL for PostgreSQL, AlloyDB for PostgreSQL, or a self-hosted Postgres. ## AWS RDS/Aurora Please follow AWS' [tutorial for configuring logical replication](https://aws.amazon.com/blogs/database/using-logical-replication-to-replicate-managed-amazon-rds-for-postgresql-and-amazon-aurora-to-self-managed-postgresql/). In this tutorial, AWS RDS/Aurora is the **publisher** and ParadeDB is the **subscriber**. ## Azure Database Please follow Azure's [tutorial for configuring logical replication](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-logical). In this tutorial, Azure Database is the **publisher** and ParadeDB is the **subscriber**. Azure Cosmos DB for PostgreSQL [does not support logical replication](https://learn.microsoft.com/en-us/answers/questions/1193391/does-azure-cosmos-db-for-postgresql-support-logica). ## GCP CloudSQL for PostgreSQL Please follow Google's [tutorial for configuring logical replication](https://cloud.google.com/sql/docs/postgres/replication/configure-logical-replication#set-up-native-postgresql-logical-replication). In this tutorial, GCP CloudSQL for PostgreSQL is the **publisher** and ParadeDB is the **subscriber**. ## AlloyDB for PostgreSQL Please follow Google's [tutorial for configuring logical replication](https://cloud.google.com/alloydb/docs/omni/replicate-data-omni-other-db). In this tutorial, AlloyDB is the **publisher** and ParadeDB is the **subscriber**. ## Self-Hosted Postgres This guide demonstrates how to configure replication between a self-hosted primary Postgres and ParadeDB. ### Environment Setup We'll use the following environment setup for our demonstration: **Primary Server** * **OS**: Ubuntu 24.04 * **IP**: 192.168.0.30 * **Database Name**: `marketplace` * **Replication User**: `replicator` * **Replication Password**: `passw0rd` **Standby Servers** * **OS**: Ubuntu 24.04 * **IP**: `192.168.0.31`, `192.168.0.32` ### Primary Server Setup #### 1. Edit `postgresql.conf` The `postgresql.conf` file is the main configuration file for PostgreSQL. It contains all the server settings that control the behavior and performance of your PostgreSQL instance. Ensure that your `postgresql.conf` has the following settings applied: ```ini theme={null} listen_addresses = 'localhost,192.168.0.30' max_wal_senders = 10 ``` * `listen_addresses` specifies the IP addresses on which PostgreSQL listens for connections. By default, PostgreSQL only listens on `localhost`. To allow other servers (like your standby servers) to connect for replication, you need to include their IP addresses. * `max_wal_senders` determines the maximum number of concurrent connections that can send WAL (Write-Ahead Log) data. If you are running `pg_search` on the primary server, make sure to add it to `shared_preload_libraries` if your Postgres version is less than 17. If you are installing it only on the standby server as a search replica, you should skip this step. ```ini theme={null} # Include this only if pg_search is installed on the primary shared_preload_libraries = 'pg_search' ``` #### 2. Edit `pg_hba.conf` The `pg_hba.conf` file (PostgreSQL Host-Based Authentication file) controls client authentication, defining who can connect to the PostgreSQL server, how they can connect, and from where. It ensures that only authorized users and systems can access the database. We need to allow the replication user to connect from the standby servers' IP addresses. Add the following lines to allow replication from the local network and localhost. This configuration ensures that the replication user can connect from the specified IP range. ```ini theme={null} local replication all peer host replication all 127.0.0.1/32 scram-sha-256 host replication all ::1/128 scram-sha-256 host replication all 192.168.0.0/24 scram-sha-256 ``` #### 3. Create the Replication User In PostgreSQL, a user is required to handle replication connections securely. This user needs specific permissions to read the WAL and perform replication tasks. Creating a dedicated replication user ensures that the replication process is secure and isolated from other database operations. This user will have the necessary permissions to perform replication but will not have broader administrative privileges. ```bash theme={null} sudo -u postgres createuser --pwprompt --replication replicator ``` `--pwprompt` prompts you to enter a password for the new user. In this tutorial, we will be setting the password to `passw0rd`. `--replication` grants the replication privilege to the new user, allowing it to handle replication tasks. #### 4. Create Mock Table on Primary Create a database: ```bash theme={null} sudo -u postgres -H createdb marketplace ``` Insert some data: ```sql theme={null} CREATE TABLE mock_items ( id SERIAL PRIMARY KEY, description TEXT, rating INTEGER CHECK (rating BETWEEN 1 AND 5), category VARCHAR(255), in_stock BOOLEAN, metadata JSONB, created_at TIMESTAMP, last_updated_date DATE, latest_available_time TIME ); INSERT INTO mock_items (description, category, in_stock, latest_available_time, last_updated_date, metadata, created_at, rating) VALUES ('Red sports shoes', 'Footwear', true, '12:00:00', '2024-07-10', '{}', '2024-07-10 12:00:00', 1); ``` If you have installed `pg_search` on the primary, you can build a BM25 index over the table. ```sql theme={null} -- Skip this if pg_search is not installed on the primary CREATE INDEX mock_items_bm25_idx ON public.mock_items USING bm25 (id, description) WITH (key_field='id'); ``` ### Standby Server Setup Standby servers do not have any specific configuration necessary other than the tables in which you are ingesting logical replication must start off empty. Configuring replication consists of two steps: * Running pg\_basebackup, which will create a snapshot of your primary onto your standby servers. This makes the initial replication much faster. * Subscribing via logical replication, which will create a subscription that continuously propagates changes on the primary server to the standby. #### 1. Run pg\_basebackup You should run `pg_basebackup` on your standby server to create a physical byte-for-byte replica of your primary cluster. The `--pgdata` directory specifies the where the standby cluster will be created. The directory must exist, and must be empty. ```bash theme={null} mkdir -p /var/lib/postgresql/17/main pg_basebackup --create-slot --slot standby1 --host 192.168.0.30 --pgdata /var/lib/postgresql/17/main --progress --username replicator --write-recovery-conf pg_basebackup --create-slot --slot standby2 --host 192.168.0.30 --pgdata /var/lib/postgresql/17/main --progress --username replicator --write-recovery-conf ``` #### 2. Start PostgreSQL on Standby Server Start PostgreSQL on standby servers: ```bash theme={null} sudo systemctl start postgresql ``` #### 3. Verify Replication Now you can verify the replication. The data inserted on the primary server should appear on the standby servers, confirming that streaming replication with ParadeDB is working correctly. ```sql theme={null} SELECT * FROM mock_items WHERE description @@@ 'shoes'; ``` ### Logical Replication Setup Logical replication allows you to replicate changes to specific tables between PostgreSQL instances. After creating a physical replica with `pg_basebackup`, you can set up a subscription to propagate changes on the primary server to the standby. #### 1. Create Publication on Primary ```sql theme={null} CREATE PUBLICATION mock_items_pub FOR TABLE mock_items; ``` #### 2. Create Subscription on Standby ```sql theme={null} CREATE SUBSCRIPTION sub_marketplace CONNECTION 'host=172.16.0.1 port=5432 dbname=marketplace user=replicator password=passw0rd' PUBLICATION pub_marketplace; ``` #### 3. Insert Data on Publisher ```sql theme={null} INSERT INTO mock_items (description, category, in_stock, latest_available_time, last_updated_date, metadata, created_at, rating) VALUES ('Blue running shoes', 'Footwear', true, '14:00:00', '2024-07-10', '{}', '2024-07-10 14:00:00', 2) ``` #### 4. Verify Replication Verify the replication on the Subscriber server: ```sql theme={null} SELECT * FROM mock_items WHERE description @@@ '"running shoes"'; ``` The data inserted on the Publisher server should appear on the Subscriber server. # Multi-Database Replication for Microservices Source: https://docs.paradedb.com/deploy/self-hosted/logical-replication/multi-database Consolidate multiple microservice databases into a single ParadeDB instance for app-wide search and cross-database joins ## Problem Statement Organizations often have multiple Postgres databases, each connected to a different microservice. The goal is to logically replicate all of these databases into a single ParadeDB instance. This enables: * App-wide search across all microservices * Cross-database joins for analytics and reporting * Centralized data access without modifying individual microservices However, table naming collisions can occur since each microservice and its database operate independently. ## Logical Replication Background Postgres' Logical Replication is designed from the perspective of one source database and one destination database. Logical replication carries `table_name` and `schema_name` as part of the WAL (Write-Ahead Log) being emitted. It does not have native primitives to allow any schema or table name mutations in the middle. ## Solution For logical replication to work, all source database tables need to have a unique signature that avoids name collisions. They also need to be identifiable by their source database. This can be achieved by using a different schema in each database instead of the `public` schema. The schema name should match the database name. ### Architecture The solution involves replicating multiple independent microservice databases into a single ParadeDB instance. Each source database uses a schema named after the database itself, ensuring no naming conflicts. Multi-database replication architecture As shown in the diagram: * Each microservice database (db1, db2, db3) uses a schema matching its database name * All databases replicate to a single ParadeDB instance via logical replication * In ParadeDB, tables are accessible with fully-qualified names (e.g., `db1.table1`, `db2.table1`) * This enables cross-database joins like: `SELECT db1.users.user_id FROM db1.users, db2.orders WHERE db1.users.id = db2.orders.user_id` Instead of having all tables in the `public` schema across multiple databases: ``` Database: users_service Schema: public - users - profiles Database: orders_service Schema: public - orders - payments ``` Reorganize each database to use a dedicated schema: ``` Database: users_service Schema: users_service - users - profiles Database: orders_service Schema: orders_service - orders - payments ``` This approach ensures that when replicated to ParadeDB, all tables have unique fully-qualified names and you can identify the source of each table. ## Zero-Downtime Migration This migration strategy reorganizes tables from the `public` schema into dedicated schemas while maintaining complete backwards compatibility through updatable views. ### Migration Steps For each microservice database, execute the following: ```sql theme={null} BEGIN; -- Create new schema named after the database CREATE SCHEMA IF NOT EXISTS ; -- Move tables to new schema ALTER TABLE public.table1 SET SCHEMA ; ALTER TABLE public.table2 SET SCHEMA ; -- Repeat for all tables... -- Create backwards-compatible views in public schema CREATE OR REPLACE VIEW public.table1 AS SELECT * FROM .table1; CREATE OR REPLACE VIEW public.table2 AS SELECT * FROM .table2; -- Repeat for all tables... COMMIT; ``` ### Example For a `users_service` database: ```sql theme={null} BEGIN; -- Create new schema CREATE SCHEMA IF NOT EXISTS users_service; -- Move tables ALTER TABLE public.users SET SCHEMA users_service; ALTER TABLE public.profiles SET SCHEMA users_service; -- Create backwards-compatible views CREATE OR REPLACE VIEW public.users AS SELECT * FROM users_service.users; CREATE OR REPLACE VIEW public.profiles AS SELECT * FROM users_service.profiles; COMMIT; ``` ### Benefits of This Approach * **Zero Downtime**: Existing applications continue to function without modification during the transition period for all queries (SELECT, INSERT, UPDATE, DELETE) * **Gradual Migration**: Application queries can be updated over time to reference the new schema directly * **Rollback Capability**: Each migration step is reversible if needed * **View Cleanup**: Once applications are updated, views in the `public` schema can be safely removed ### Setting Up Logical Replication After completing the schema migration for all source databases: 1. Configure each source database as a publisher following the [getting started guide](/deploy/self-hosted/logical-replication/getting-started) 2. Set up ParadeDB as a subscriber for all source databases 3. Create publications on each source database for their respective schemas: ```sql theme={null} -- On users_service database CREATE PUBLICATION users_pub FOR TABLES IN SCHEMA users_service; -- On orders_service database CREATE PUBLICATION orders_pub FOR TABLES IN SCHEMA orders_service; ``` 4. Create subscriptions on ParadeDB for each source database: ```sql theme={null} -- On ParadeDB instance CREATE SUBSCRIPTION users_sub CONNECTION 'host=users_db port=5432 dbname=users_service user=replicator password=...' PUBLICATION users_pub; CREATE SUBSCRIPTION orders_sub CONNECTION 'host=orders_db port=5432 dbname=orders_service user=replicator password=...' PUBLICATION orders_pub; ``` ## Trade-offs ### Pros * **Multi Database BM25 Search**: Perform full-text search across tables distributed across multiple microservice databases in a single query * **Avoid Distributed Joins in Application**: Execute cross-database joins directly in ParadeDB instead of implementing complex join logic in your application * **Simple Architecture**: Uses standard PostgreSQL logical replication without extra infrastructure * **Namespace Isolation**: Schema-based separation prevents naming conflicts * **No Source Database Changes**: Microservices continue operating independently; ParadeDB acts as a read replica ### Cons * Source databases will access tables from their dedicated schema (e.g., `users_service`) instead of `public` * Requires coordination across microservice teams for initial migration * Existing database tooling may need configuration updates to work with non-public schemas # Installing Third Party Extensions Source: https://docs.paradedb.com/deploy/third-party-extensions How to install additional extensions into ParadeDB [Foreign data wrapper](https://www.postgresql.org/docs/current/ddl-foreign-data.html) extensions can be used to query AWS S3 and other external data stores directly from ParadeDB. Postgres has a rich ecosystem of extensions. ParadeDB is designed to work alongside other PostgreSQL extensions for a complete data platform. ## Pre-installed Extensions To keep the ParadeDB Docker image size manageable, the following extensions are pre-installed: * **`pg_search`** — Full-text and hybrid search with BM25 * **`pgvector`** — Vector similarity search * **`postgis`** — Geospatial queries and indexing * **`pg_ivm`** — Incremental materialized views * **`pg_cron`** — Scheduled jobs and background tasks `pg_cron` is configured on the default `postgres` database and cannot be changed. ## Compatible Extensions ParadeDB has been tested with and supports the following popular extensions: * **[Citus](/deploy/citus)** — Distributed PostgreSQL for horizontal scaling * **`pg_partman`** — Automated partition management * **`pg_stat_statements`** — Query performance monitoring * **`postgres_fdw`** — Foreign data wrappers for federated queries If you encounter any issues with extension compatibility, please [open an issue](https://github.com/paradedb/paradedb/issues) or reach out to our [community](https://join.slack.com/t/paradedbcommunity/shared_invite/zt-32abtyjg4-yoYoi~RPh9MSW8tDbl0BQw). ## Installing Third Party Extensions The process for installing an extension varies by extension. Generally speaking, it requires: * Download the prebuilt binaries inside ParadeDB * Install the extension binary and any dependencies inside ParadeDB * Add the extension to `shared_preload_libraries` in `postgresql.conf`, if required by the extension * Run `CREATE EXTENSION ` We recommend installing third party extensions from prebuilt binaries to keep the image size small. As an example, let's install [pg\_partman](https://github.com/pgpartman/pg_partman), an extension for managing table partition sets. ### Install Prebuilt Binaries First, enter a shell with root permissions in the ParadeDB image. ```bash theme={null} docker exec -it --user root paradedb bash ``` This command assumes that your ParadeDB container name is `paradedb`. Next, install the [prebuilt binaries](https://pkgs.org/search/?q=partman). Most popular Postgres extensions can be installed with `apt-get install`. ```bash theme={null} apt-get update apt-get install -y --no-install-recommends postgresql-17-partman ``` If the extension is not available with `apt-get install`, you can usually `curl` the prebuilt binary from a GitHub Release page. You will need to first install `curl` via `apt-get install` if you are taking this approach. ### Add to `shared_preload_libraries` If you are installing an extension which requires this step, you can do so via the following command, replacing `` with your extension's name: ```bash theme={null} sed -i "/^shared_preload_libraries/s/'\([^']*\)'/'\1,'/" /var/lib/postgresql/data/postgresql.conf ``` For `pg_partman`, the command is: ```bash theme={null} sed -i "/^shared_preload_libraries/s/'\([^']*\)'/'\1,pg_partman_bgw'/" /var/lib/postgresql/data/postgresql.conf ``` Postgres must be restarted afterwards. We recommend simply restarting the Docker container. ### Create the Extension Connect to ParadeDB via `psql` and create the extension. ```sql theme={null} CREATE EXTENSION pg_partman; ``` `pg_partman` is now ready to use! Note that this is a simple example of installing `pg_partman`. The full list of settings and optional dependencies can be found in the [official installation instructions](https://github.com/pgpartman/pg_partman?tab=readme-ov-file#installation). # Upgrading ParadeDB Source: https://docs.paradedb.com/deploy/upgrading How to update ParadeDB to the latest version ## Overview ParadeDB ships its functionality inside a Postgres extension, `pg_search`. Upgrading ParadeDB is as simple as updating the `pg_search` extension. ParadeDB uses `pgvector` for vector search. This extension is not managed by ParadeDB. Please refer to the [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file#upgrading) for instructions on how to upgrade it. ## Getting the Current Version To inspect the current version of an extension, run the following command. ```sql theme={null} SELECT extversion FROM pg_extension WHERE extname = 'pg_search'; ``` Verify that it matches `paradedb.version_info()`: ```sql theme={null} SELECT * FROM paradedb.version_info(); ``` The reason that there are two statements is because `paradedb.version_info()` is the actual version of `pg_search` that is installed, whereas `pg_extension` is what Postgres' catalog thinks the version of the extension is. If `paradedb.version_info()` is greater than `pg_extension`, it typically means that `ALTER EXTENSION` was not run after the previous upgrade, and that the SQL upgrade scripts were not applied. If `pg_extension` is greater than `paradedb.version_info()`, it means that the extension didn't fully upgrade, and that Postgres needs to be restarted. ## Getting the Latest Version The latest version of `pg_search` is `0.21.8`. Please refer to the [releases](https://github.com/paradedb/paradedb/releases) page for all available versions of `pg_search`. ## Updating ParadeDB ### Helm Chart To upgrade the ParadeDB Helm chart: 1. Update the `paradedb` chart to the latest version. ```bash theme={null} helm repo update ``` 2. Get the latest version of the `paradedb` chart. ```bash theme={null} helm search repo paradedb ``` 3. Get the latest version of the ParadeDB extension, which is the value of `version.paradedb` in the chart [README](https://github.com/paradedb/charts/tree/dev/charts/paradedb#values). 4. Run `helm upgrade` with the latest version of the chart and the latest version of the extension. ```bash theme={null} helm upgrade paradedb paradedb/paradedb --namespace paradedb --reuse-values --version --set version.paradedb= --atomic ``` Replace `` with the latest version of the chart and `` with the latest version of the extension. 5. If you are using [ParadeDB BYOC](/deploy/byoc), an automatic rollout will begin. One by one, the pods will be restarted to apply the new version of the extension. ### Docker Image To upgrade the ParadeDB Docker image while preserving your data volume: 1. Stop the ParadeDB Docker image via `docker stop paradedb`. 2. Run the following command to pull a specific version of the Docker image. You can set the version number to `latest` to pull the latest Docker image. You can find the full list of available tags on [Docker Hub](https://hub.docker.com/r/paradedb/paradedb/tags). ```bash theme={null} docker pull paradedb/paradedb:0.21.8 ``` The latest version of the Docker image should be `0.21.8`. 3. Start the new ParadeDB Docker image via `docker run paradedb`. ### Self-Managed Postgres To upgrade the extensions running in a self-managed Postgres: 1. Stop Postgres (e.g. `pg_ctl stop -D `). 2. Download and install the extension you wish to upgrade in the same way that it was initially installed. 3. Start Postgres (e.g. `pg_ctl start -D `). ## Alter Extension After ParadeDB has been upgraded, connect to it and run the following command in all databases that `pg_search` is installed in. This step is required regardless of the environment that ParadeDB is installed in (Helm, Docker, or self-managed Postgres). ```sql theme={null} ALTER EXTENSION pg_search UPDATE TO '0.21.8'; ``` ## Verify the Upgrade After upgrading the extension and restarting Postgres, verify that the version numbers returned by the following commands match: ```sql theme={null} SELECT extversion FROM pg_extension WHERE extname = 'pg_search'; SELECT * FROM paradedb.version_info(); ``` If the two versions do not match, restart Postgres and try again.