# Date Histogram
Source: https://docs.paradedb.com/documentation/aggregates/bucket/datehistogram

Count the number of occurrences over fixed time intervals

The date histogram aggregation constructs a histogram for date fields.

```sql theme={null}
SELECT pdb.agg('{"date_histogram": {"field": "created_at", "fixed_interval": "30d"}}')
FROM mock_items
WHERE id @@@ pdb.all();
```

```ini Expected Response theme={null}

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {"buckets": [{"key": 1679616000000.0, "doc_count": 14, "key_as_string": "2023-03-24T00:00:00Z"}, {"key": 1682208000000.0, "doc_count": 27, "key_as_string": "2023-04-23T00:00:00Z"}]}
(1 row)
```

See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/bucket/struct.DateHistogramAggregationReq.html)
for all available options.


# Filters
Source: https://docs.paradedb.com/documentation/aggregates/bucket/filters

Compute aggregations over multiple filters in one query

The filters aggregation allows a single query to return aggregations for multiple search queries at a time.
To use this aggregation, pass `pdb.agg` to the left-hand side of `FILTER` and a search query to the right-hand side.
For example:

```sql theme={null}
SELECT
    pdb.agg('{"value_count": {"field": "id"}}')
    FILTER (WHERE category === 'electronics') AS electronics_count,
    pdb.agg('{"value_count": {"field": "id"}}')
    FILTER (WHERE category === 'footwear') AS footwear_count
FROM mock_items;
```

```ini Expected Response theme={null}
 electronics_count | footwear_count
-------------------+----------------
 {"value": 5.0}    | {"value": 6.0}
(1 row)
```


# Histogram
Source: https://docs.paradedb.com/documentation/aggregates/bucket/histogram

Count the number of occurrences over some interval

The histogram aggregation dynamically creates buckets for a given `interval` and counts the number of occurrences
in each bucket.

Each value is rounded down to its bucket. For instance, a rating of `18` with an interval of `5` rounds down to a bucket
with key `15`.

```sql theme={null}
SELECT pdb.agg('{"histogram": {"field": "rating", "interval": "1"}}')
FROM mock_items
WHERE id @@@ pdb.all();
```

```ini Expected Response theme={null}
                                                                                  agg
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {"buckets": [{"key": 1.0, "doc_count": 1}, {"key": 2.0, "doc_count": 3}, {"key": 3.0, "doc_count": 9}, {"key": 4.0, "doc_count": 16}, {"key": 5.0, "doc_count": 12}]}
(1 row)
```

See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/bucket/struct.HistogramAggregation.html)
for all available options.


# Range
Source: https://docs.paradedb.com/documentation/aggregates/bucket/range

Count the number of occurrences over user-defined buckets

The range aggregation counts the number of occurrences over user-defined buckets. The buckets must be continuous and cannot overlap.

```sql theme={null}
SELECT pdb.agg('{"range": {"field": "rating", "ranges": [{"to": 3.0 }, {"from": 3.0, "to": 6.0} ]}}')
FROM mock_items
WHERE id @@@ pdb.all();
```

```ini Expected Response theme={null}
                                                                              agg
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 {"buckets": [{"to": 3.0, "key": "*-3", "doc_count": 4}, {"to": 6.0, "key": "3-6", "from": 3.0, "doc_count": 37}, {"key": "6-*", "from": 6.0, "doc_count": 0}]}
(1 row)
```

See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/bucket/struct.RangeAggregation.html)
for all available options.


# Terms
Source: https://docs.paradedb.com/documentation/aggregates/bucket/terms

Count the number of occurrences for each value in a result set

<Note>
  If a text or JSON field is in the `GROUP BY` or `ORDER BY` clause, it must use
  the [literal](/documentation/tokenizers/available-tokenizers/literal)
  tokenizer.
</Note>

A terms aggregation counts the number of occurrences for every unique value in a field. For example, the following query
groups the `mock_items` table by `rating`, and calculates the number of items for each unique `rating`.

```sql theme={null}
SELECT rating, pdb.agg('{"value_count": {"field": "id"}}') FROM mock_items
WHERE id @@@ pdb.all()
GROUP BY rating
LIMIT 10;
```

```ini Expected Response theme={null}
 rating |       agg
--------+-----------------
      4 | {"value": 16.0}
      5 | {"value": 12.0}
      3 | {"value": 9.0}
      2 | {"value": 3.0}
      1 | {"value": 1.0}
(5 rows)
```

Ordering by the bucketing field is supported:

```sql theme={null}
SELECT rating, pdb.agg('{"value_count": {"field": "id"}}') FROM mock_items
WHERE id @@@ pdb.all()
GROUP BY rating
ORDER BY rating
LIMIT 10;
```

<Note>Ordering by the aggregate value is not yet supported.</Note>

For performance reasons, we strongly recommend adding a `LIMIT` to the `GROUP BY`. Terms aggregations without a `LIMIT` consume more memory and
are slower to execute. If a query does not have a limit and more than `65000` unique values are found in a field, an error will be returned.


# Facets
Source: https://docs.paradedb.com/documentation/aggregates/facets

Compute a Top N and aggregate in one query

A common pattern in search is to query for both an aggregate and a set of search results. For example, "find the top 10
results, and also count the total number of results."

Instead of issuing two separate queries -- one for the search results, and another for the aggregate -- `pdb.agg` allows for
these results to be returned in a single "faceted" query. This can significantly improve read throughput, since issuing a single
query uses less CPU and disk I/O.

For example, this query returns the top 3 search results alongside the total number of results found.

```sql theme={null}
SELECT
  id, description, rating,
  pdb.agg('{"value_count": {"field": "id"}}') OVER ()
FROM mock_items
WHERE category === 'electronics'
ORDER BY rating DESC
LIMIT 3;
```

```ini Expected Response theme={null}
 id |         description         | rating |      agg
----+-----------------------------+--------+----------------
 12 | Innovative wireless earbuds |      5 | {"value": 5.0}
  1 | Ergonomic metal keyboard    |      4 | {"value": 5.0}
  2 | Plastic Keyboard            |      4 | {"value": 5.0}
(3 rows)
```

<Note>
  Faceted queries require that `pdb.agg` be used as a window function:
  `pdb.agg() OVER ()`.
</Note>

### Faceted Performance Optimization

On every query, ParadeDB runs checks to ensure that deleted or updated-away rows are not factored into the result set.

If your table is not frequently updated or you can tolerate an approximate result, the performance of faceted queries can be improved by disabling these visibility checks.
To do so, set the second argument of `pdb.agg` to `false`.

```sql theme={null}
SELECT
     description, rating, category,
     pdb.agg('{"value_count": {"field": "id"}}', false) OVER ()
FROM mock_items
WHERE description ||| 'running shoes'
ORDER BY rating
LIMIT 5;
```

Disabling this check can improve query times by 2-4x in some cases (at the expense of correctness).


# Limitations
Source: https://docs.paradedb.com/documentation/aggregates/limitations

Caveats for aggregate support

## ParadeDB Operator

In order for ParadeDB to push down an aggregate, a ParadeDB text search operator must be present in the query.

```sql theme={null}
-- Not pushed down
SELECT COUNT(*) FROM mock_items
WHERE rating = 5;

-- Pushed down
SELECT COUNT(*) FROM mock_items
WHERE rating = 5
AND id @@@ pdb.all();
```

If your query does not contain a ParadeDB operator, a way to "force" aggregate pushdown is to append the [all query](/documentation/query-builder/compound/all) to the query's
`WHERE` clause.

## Join Support

ParadeDB is currently only able to push down aggregates over a single table. JOINs are not yet pushed down but are on the [roadmap](/welcome/roadmap).

## NUMERIC Columns

`NUMERIC` columns do not support aggregate pushdown. Queries with aggregates on `NUMERIC` columns will automatically fall back to PostgreSQL for aggregation.

For numeric data that requires aggregate pushdown, use `FLOAT` or `DOUBLE PRECISION` instead:

```sql theme={null}
-- Aggregates can be pushed down
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    price DOUBLE PRECISION
);

-- Aggregates fall back to PostgreSQL
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    price NUMERIC(10,2)
);
```

<Note>
  Filter pushdown (equality and range queries) is fully supported for all
  `NUMERIC` columns. Only aggregate pushdown is not supported.
</Note>


# Average
Source: https://docs.paradedb.com/documentation/aggregates/metrics/average

Compute the average value of a field

The following query computes the average value over a specific field:

```sql theme={null}
SELECT pdb.agg('{"avg": {"field": "rating"}}') FROM mock_items
WHERE id @@@ pdb.all();
```

```ini Expected Response theme={null}
              agg
-------------------------------
 {"value": 3.8536585365853657}
(1 row)
```

See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.AverageAggregation.html) for all available options.

## SQL Average Syntax

SQL's `AVERAGE` syntax is supported in beta. To enable it, first run

```sql theme={null}
SET paradedb.enable_aggregate_custom_scan TO on;
```

With this feature enabled, the following query is equivalent to the above and is executed in the same way.

```sql theme={null}
SELECT AVG(rating) FROM mock_items
WHERE id @@@ pdb.all();
```

By default, `AVG` ignores null values. Use `COALESCE` to include them in the final average:

```sql theme={null}
SELECT AVG(COALESCE(rating, 0)) FROM mock_items
WHERE id @@@ pdb.all();
```


# Cardinality
Source: https://docs.paradedb.com/documentation/aggregates/metrics/cardinality

Compute the number of distinct values in a field

The cardinality aggregation estimates the number of distinct values in a field.

```sql theme={null}
SELECT pdb.agg('{"cardinality": {"field": "rating"}}') FROM mock_items
WHERE id @@@ pdb.all();
```

```ini Expected Response theme={null}
      agg
----------------
 {"value": 5.0}
(1 row)
```

Unlike SQL's `DISTINCT` clause, which returns an exact value but is very computationally expensive, the cardinality aggregation uses the HyperLogLog++ algorithm to
closely approximate the number of distinct values.

See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.CardinalityAggregationReq.html) for all available options.


# Count
Source: https://docs.paradedb.com/documentation/aggregates/metrics/count

Count the number of values in a field

The following query counts the number of values in a field:

```sql theme={null}
SELECT pdb.agg('{"value_count": {"field": "rating"}}') FROM mock_items
WHERE id @@@ pdb.all();
```

```ini Expected Response theme={null}
       agg
-----------------
 {"value": 41.0}
(1 row)
```

See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.CountAggregation.html) for all available options.

## SQL Count Syntax

SQL's `COUNT` syntax is supported in beta. To enable it, first run

```sql theme={null}
SET paradedb.enable_aggregate_custom_scan TO on;
```

With this feature enabled, the following query is equivalent to the above and is executed in the same way.

```sql theme={null}
SELECT COUNT(rating) FROM mock_items
WHERE id @@@ pdb.all();
```

To count all rows, including rows with null values, use `COUNT(*)`:

```sql theme={null}
SELECT COUNT(*) FROM mock_items
WHERE id @@@ pdb.all();
```


# Min/Max
Source: https://docs.paradedb.com/documentation/aggregates/metrics/minmax

Compute the min/max value of a field

`min` and `max` return the smallest and largest values of a column, respectively.

<CodeGroup>
  ```sql Min theme={null}
  SELECT pdb.agg('{"min": {"field": "rating"}}') FROM mock_items
  WHERE id @@@ pdb.all();
  ```

  ```sql Max theme={null}
  SELECT pdb.agg('{"max": {"field": "rating"}}') FROM mock_items
  WHERE id @@@ pdb.all();
  ```
</CodeGroup>

<CodeGroup>
  ```ini Expected Response (Min) theme={null}
        agg
  ----------------
   {"value": 1.0}
  (1 row)
  ```

  ```ini Expected Response (Max) theme={null}
        agg
  ----------------
   {"value": 5.0}
  (1 row)
  ```
</CodeGroup>

See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.MinAggregation.html) for all available options.

## SQL Min/Max Syntax

SQL's `MIN`/`MAX` syntax is supported in beta. To enable it, first run

```sql theme={null}
SET paradedb.enable_aggregate_custom_scan TO on;
```

With this feature enabled, the following query is equivalent to the above and is executed in the same way.

<CodeGroup>
  ```sql Min theme={null}
  SELECT MIN(rating) FROM mock_items
  WHERE id @@@ pdb.all();
  ```

  ```sql Max theme={null}
  SELECT MAX(rating) FROM mock_items
  WHERE id @@@ pdb.all();
  ```
</CodeGroup>

By default, `MIN`/`MAX` ignore null values. Use `COALESCE` to include them in the final sum:

```sql theme={null}
SELECT MIN(COALESCE(rating, 0)) FROM mock_items
WHERE id @@@ pdb.all();
```


# Percentiles
Source: https://docs.paradedb.com/documentation/aggregates/metrics/percentiles

Analyze the distribution of a field

The percentiles aggregation computes the values below which a given percentage of the data falls.
In this example, the aggregation will return the 50th and 95th percentiles for `rating`.

```sql theme={null}
SELECT pdb.agg('{"percentiles": {"field": "rating", "percents": [50, 95]}}') FROM mock_items
WHERE id @@@ pdb.all();
```

```ini Expected Response theme={null}
                                 agg
---------------------------------------------------------------------
 {"values": {"50.0": 4.014835333028612, "95.0": 5.0028295751107414}}
(1 row)
```

See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.PercentilesAggregationReq.html) for all available options.


# Stats
Source: https://docs.paradedb.com/documentation/aggregates/metrics/stats

Compute several metrics at once

The stats aggregation returns the count, sum, min, max, and average all at once.

```sql theme={null}
SELECT pdb.agg('{"stats": {"field": "rating"}}') FROM mock_items
WHERE id @@@ pdb.all();
```

```ini Expected Response theme={null}
                                      agg
--------------------------------------------------------------------------------
 {"avg": 3.8536585365853657, "max": 5.0, "min": 1.0, "sum": 158.0, "count": 41}
(1 row)
```

See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.StatsAggregation.html) for all available options.


# Sum
Source: https://docs.paradedb.com/documentation/aggregates/metrics/sum

Compute the sum of a field

The sum aggregation computes the sum of a field.

```sql theme={null}
SELECT pdb.agg('{"sum": {"field": "rating"}}') FROM mock_items
WHERE id @@@ pdb.all();
```

```ini Expected Response theme={null}
       agg
------------------
 {"value": 158.0}
(1 row)
```

See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.SumAggregation.html) for all available options.

## SQL Sum Syntax

SQL's `SUM` syntax is supported in beta. To enable it, first run

```sql theme={null}
SET paradedb.enable_aggregate_custom_scan TO on;
```

With this feature enabled, the following query is equivalent to the above and is executed in the same way.

```sql theme={null}
SELECT SUM(rating) FROM mock_items
WHERE id @@@ pdb.all();
```

By default, `SUM` ignores null values. Use `COALESCE` to include them in the final sum:

```sql theme={null}
SELECT SUM(COALESCE(rating, 0)) FROM mock_items
WHERE id @@@ pdb.all();
```


# Top Hits
Source: https://docs.paradedb.com/documentation/aggregates/metrics/tophits

Compute the top hits for each bucket in a terms aggregation

The top hits aggregation is meant to be used in conjunction with the [terms](/documentation/aggregates/bucket/terms)
aggregation. It returns the top documents for each bucket of a terms aggregation.

For example, the following query answers "what are top 3 results sorted by `created_at` for each
`rating` category?"

```sql theme={null}
SELECT pdb.agg('{"top_hits": {"size": 3, "sort": [{"created_at": "desc"}], "docvalue_fields": ["id", "created_at"]}}')
FROM mock_items
WHERE id @@@ pdb.all()
GROUP BY rating;
```

```ini Expected Response theme={null}
      agg
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {"hits": [{"sort": [10907000251854775808], "docvalue_fields": {"id": [25], "created_at": ["2023-05-09T10:30:15Z"]}}, {"sort": [10906844884854775808], "docvalue_fields": {"id": [26], "created_at": ["2023-05-07T15:20:48Z"]}}, {"sort": [10906666358854775808], "docvalue_fields": {"id": [13], "created_at": ["2023-05-05T13:45:22Z"]}}]}
 {"hits": [{"sort": [10906756363854775808], "docvalue_fields": {"id": [24], "created_at": ["2023-05-06T14:45:27Z"]}}, {"sort": [10906385295854775808], "docvalue_fields": {"id": [28], "created_at": ["2023-05-02T07:40:59Z"]}}, {"sort": [10906236353854775808], "docvalue_fields": {"id": [29], "created_at": ["2023-04-30T14:18:37Z"]}}]}
 {"hits": [{"sort": [10906480573854775808], "docvalue_fields": {"id": [17], "created_at": ["2023-05-03T10:08:57Z"]}}, {"sort": [10906315942854775808], "docvalue_fields": {"id": [20], "created_at": ["2023-05-01T12:25:06Z"]}}, {"sort": [10906218361854775808], "docvalue_fields": {"id": [8], "created_at": ["2023-04-30T09:18:45Z"]}}]}
 {"hits": [{"sort": [10906573359854775808], "docvalue_fields": {"id": [27], "created_at": ["2023-05-04T11:55:23Z"]}}, {"sort": [10905961160854775808], "docvalue_fields": {"id": [15], "created_at": ["2023-04-27T09:52:04Z"]}}, {"sort": [10905202003854775808], "docvalue_fields": {"id": [7], "created_at": ["2023-04-18T14:59:27Z"]}}]}
 {"hits": [{"sort": [10906586188854775808], "docvalue_fields": {"id": [10], "created_at": ["2023-05-04T15:29:12Z"]}}]}
(5 rows)
```

The `sort` value returned by the aggregation is Tantivy's internal sort ID and should be ignored.
To get the actual fields, pass a list of fields to `docvalue_fields`.

If a text or JSON field is passed to `docvalue_fields`, it must be indexed with the [literal](/documentation/tokenizers/available-tokenizers/literal)
or [literal normalized](/documentation/tokenizers/available-tokenizers/literal-normalized) tokenizer.

To specify an offset, use `from`:

```sql theme={null}
SELECT pdb.agg('{"top_hits": {"size": 3, "from": 1, "sort": [{"created_at": "desc"}], "docvalue_fields": ["id", "created_at"]}}')
FROM mock_items
WHERE id @@@ pdb.all()
GROUP BY rating;
```

If multiple fields are passed into `sort`, the additional fields are used as tiebreakers:

```sql theme={null}
SELECT pdb.agg('{"top_hits": {"size": 3, "sort": [{"created_at": "desc"}, {"id": "asc"}], "docvalue_fields": ["id", "created_at"]}}')
FROM mock_items
WHERE id @@@ pdb.all()
GROUP BY rating;
```

See the [Tantivy documentation](https://docs.rs/tantivy/latest/tantivy/aggregation/metric/struct.TopHitsAggregationReq.html) for all available options.


# Aggregate Syntax
Source: https://docs.paradedb.com/documentation/aggregates/overview

Accelerate aggregates with the ParadeDB index

The `pdb.agg` function accepts an Elasticsearch-compatible JSON aggregate query string. It executes the aggregate using the
[columnar](/welcome/architecture#columnar-index) portion of the ParadeDB index, which can significantly accelerate performance compared to vanilla Postgres.

For example, the following query counts the total number of results for a search query.

```sql theme={null}
SELECT pdb.agg('{"value_count": {"field": "id"}}')
FROM mock_items
WHERE category === 'electronics';
```

```ini Expected Response theme={null}
      agg
----------------
 {"value": 5.0}
(1 row)
```

This query counts the number of results for every distinct group:

```sql theme={null}
SELECT rating, pdb.agg('{"value_count": {"field": "id"}}')
FROM mock_items
WHERE category === 'electronics'
GROUP BY rating
ORDER BY rating
LIMIT 5;
```

```ini Expected Response theme={null}
 rating |      agg
--------+----------------
      3 | {"value": 1.0}
      4 | {"value": 3.0}
      5 | {"value": 1.0}
(3 rows)
```

## Multiple Aggregations

To compute multiple aggregations at once, simply include multiple `pdb.agg` functions in the target list:

```sql theme={null}
SELECT
  pdb.agg('{"avg": {"field": "rating"}}') AS avg_rating,
  pdb.agg('{"value_count": {"field": "id"}}') AS count
FROM mock_items
WHERE category === 'electronics';
```

```ini Expected Response theme={null}
   avg_rating   |     count
----------------+----------------
 {"value": 4.0} | {"value": 5.0}
(1 row)
```

## JSON Fields

If `metadata` is a JSON field with key `color`, use `metadata.color` as the field name:

```sql theme={null}
SELECT pdb.agg('{"terms": {"field": "metadata.color"}}')
FROM mock_items
WHERE id @@@ pdb.all();
```

<Note>
  If a text or JSON field is used inside `pdb.agg`, it must use the [literal](/documentation/tokenizers/available-tokenizers/literal) or
  [literal normalized](/documentation/tokenizers/available-tokenizers/literal-normalized) tokenizer.
</Note>


# Performance Tuning
Source: https://docs.paradedb.com/documentation/aggregates/tuning

Several settings can be tuned to improve the performance of aggregates in ParadeDB

### Configure Parallel Workers

ParadeDB uses Postgres parallel workers. By default, Postgres allows two workers per parallel query.
Increasing the number of [parallel workers](/documentation/performance-tuning/reads) allows parallel queries to use all of the available hardware on the host machine and can deliver significant
speedups.

### Run `VACUUM`

`VACUUM` updates the table's [visibility map](https://www.postgresql.org/docs/current/storage-vm.html),
which speeds up Postgres' visibility checks.

```sql theme={null}
VACUUM mock_items;
```

If the table experiences frequent updates, we recommend configuring [autovacuum](https://www.postgresql.org/docs/current/routine-vacuuming.html).

### Run `pg_prewarm`

The `pg_prewarm` extension can be used to preload data from the index into the Postgres buffer cache, which
improves the response times of "cold" queries (i.e. the first search query after Postgres has restarted).

```sql theme={null}
CREATE EXTENSION pg_prewarm;
SELECT pg_prewarm('search_idx');
```


# Filtering
Source: https://docs.paradedb.com/documentation/filtering

Filter search results based on metadata from other fields

Adding filters to text search is as simple as using PostgreSQL's built-in `WHERE` clauses and operators.
For instance, the following query filters out results that do not meet `rating > 2`.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description ||| 'running shoes' AND rating > 2;
```

## Filter Pushdown

### Non-Text Fields

While not required, filtering performance over non-text columns can be improved by including them in the BM25 index.
When these columns are part of the index, `WHERE` clauses that reference them can be pushed down into the index scan itself.
This can result in faster query execution over large datasets.

For example, if `rating` and `created_at` are frequently used in filters, they can be added to the BM25 index during index creation:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25(id, description, rating, created_at)
WITH (key_field = 'id');
```

Filter pushdown is currently supported for the following combinations of types and operators:

| Operator                                   | Left Operand Type | Right Operand Type | Example                    |
| ------------------------------------------ | ----------------- | ------------------ | -------------------------- |
| `=`, `<`, `>`, `<=`, `>=`, `<>`, `BETWEEN` | `int2`            | `int2`             | `WHERE rating = 2`         |
|                                            | `int4`            | `int4`             |                            |
|                                            | `int8`            | `int8`             |                            |
|                                            | `int2`            | `int4`             |                            |
|                                            | `int2`            | `int8`             |                            |
|                                            | `int4`            | `int8`             |                            |
|                                            | `float4`          | `float4`           |                            |
|                                            | `float8`          | `float8`           |                            |
|                                            | `float4`          | `float8`           |                            |
|                                            | `numeric`         | `numeric`          | `WHERE price = 99.99`      |
|                                            | `date`            | `date`             |                            |
|                                            | `time`            | `time`             |                            |
|                                            | `timetz`          | `timetz`           |                            |
|                                            | `timestamp`       | `timestamp`        |                            |
|                                            | `timestamptz`     | `timestamptz`      |                            |
|                                            | `uuid`            | `uuid`             |                            |
| `=`                                        | `bool`            | `bool`             | `WHERE in_stock = true`    |
| `IN`, `ANY`, `ALL`                         | `bool`            | `bool[]`           | `WHERE rating IN (1,2,3)`  |
|                                            | `int2`            | `int2[]`           |                            |
|                                            | `int4`            | `int4[]`           |                            |
|                                            | `int8`            | `int8[]`           |                            |
|                                            | `int2`            | `int4[]`           |                            |
|                                            | `int2`            | `int8[]`           |                            |
|                                            | `int4`            | `int8[]`           |                            |
|                                            | `float4`          | `float4[]`         |                            |
|                                            | `float8`          | `float8[]`         |                            |
|                                            | `float4`          | `float8[]`         |                            |
|                                            | `date`            | `date[]`           |                            |
|                                            | `timetz`          | `timetz[]`         |                            |
|                                            | `timestamp`       | `timestamp[]`      |                            |
|                                            | `timestamptz`     | `timestamptz[]`    |                            |
|                                            | `uuid`            | `uuid[]`           |                            |
| `IS`, `IS NOT`                             | `bool`            | `bool`             | `WHERE in_stock IS true`   |
| `IS NULL`, `IS NOT NULL`                   | `bool`            |                    | `WHERE rating IS NOT NULL` |
|                                            | `int2`            |                    |                            |
|                                            | `int4`            |                    |                            |
|                                            | `int8`            |                    |                            |
|                                            | `int2`            |                    |                            |
|                                            | `int2`            |                    |                            |
|                                            | `int4`            |                    |                            |
|                                            | `float4`          |                    |                            |
|                                            | `float8`          |                    |                            |
|                                            | `float4`          |                    |                            |
|                                            | `date`            |                    |                            |
|                                            | `time`            |                    |                            |
|                                            | `timetz`          |                    |                            |
|                                            | `timestamp`       |                    |                            |
|                                            | `timestamptz`     |                    |                            |
|                                            | `uuid`            |                    |                            |

### Text Fields

Suppose we have a text filter that looks for an exact string match like `category = 'Footwear'`:

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description @@@ 'shoes' AND category = 'Footwear';
```

To push down the `category = 'Footwear'` filter, `category` must be indexed using the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25(id, description, (category::pdb.literal))
WITH (key_field = 'id');
```

Pushdown of set filters over text fields also requires the literal tokenizer:

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description @@@ 'shoes' AND category IN ('Footwear', 'Apparel');
```


# Fuzzy
Source: https://docs.paradedb.com/documentation/full-text/fuzzy

Allow for typos in the query string

Fuzziness allows for tokens to be considered a match even if they are not identical, allowing for typos
in the query string.

<Warning>
  While fuzzy matching will work for non-latin characters (Chinese, Japanese, Korean, etc..), it may not
  give expected results (with large result sets returned) as Levenshtein distance relies on individual character difference.

  If you need this functionality then please thumbs-up this [issue](https://github.com/paradedb/paradedb/issues/3782), and leave
  a comment with your use case.
</Warning>

## Overview

To add fuzziness to a query, cast it to the `fuzzy(n)` type, where `n` is the [edit distance](#how-it-works).
Fuzziness is supported for [match](/documentation/full-text/match) and [term](/documentation/full-text/term) queries.

```sql theme={null}
-- Fuzzy match disjunction
SELECT id, description
FROM mock_items
WHERE description ||| 'runing shose'::pdb.fuzzy(2)
LIMIT 5;

-- Fuzzy match conjunction
SELECT id, description
FROM mock_items
WHERE description &&& 'runing shose'::pdb.fuzzy(2)
LIMIT 5;

-- Fuzzy Term
SELECT id, description
FROM mock_items
WHERE description === 'shose'::pdb.fuzzy(2)
LIMIT 5;
```

## How It Works

By default, the [match](/documentation/full-text/match) and [term](/documentation/full-text/term) queries require exact token matches between the query and indexed text. When a query is cast to `fuzzy(n)`, this requirement is relaxed -- tokens are matched if their Levenshtein distance, or edit distance, is less than or equal to `n`.

Edit distance is a measure of how many single-character operations are needed to turn one string into another. The allowed operations are:

* **Insertion** adds a character e.g., "shoe" → "shoes" (insert "s") has an edit distance of `1`
* **Deletion** removes a character e.g. "runnning" → "running" (delete one "n") has an edit distance of `1`
* **Transposition** replaces on character with another e.g., "shose" → "shoes" (transpose "s" → "e") has an edit distance of `2`

<Note>For performance reasons, the maximum allowed edit distance is `2`.</Note>

<Note>Casting a query to `fuzzy(0)` is the same as an exact token match.</Note>

## Fuzzy Prefix

`fuzzy` also supports prefix matching.
For instance, "runn" is a prefix of "running" because it matches the beginning of the token exactly. "rann" is a **fuzzy prefix** of "running" because it matches the
beginning within an edit distance of `1`.

To treat the query string as a prefix, set the second argument of `fuzzy` to either `t` or `"true"`:

```sql theme={null}
SELECT id, description
FROM mock_items
WHERE description === 'rann'::pdb.fuzzy(1, t)
LIMIT 5;
```

<Note>
  Postgres requires that `true` be double-quoted, i.e. `fuzzy(1, "true")`.
</Note>

When used with [match](/documentation/full-text/match) queries, fuzzy prefix treats all tokens in the query string as prefixes.
For instance, the following query means "find all documents containing the fuzzy prefix `rann` AND the fuzzy prefix `slee`":

```sql theme={null}
SELECT id, description
FROM mock_items
WHERE description &&& 'slee rann'::pdb.fuzzy(1, t)
LIMIT 5;
```

## Transposition Cost

By default, the cost of a transposition (i.e. "shose" → "shoes") is `2`. Setting the third argument of `fuzzy` to `t` lowers the
cost of a transposition to `1`:

```sql theme={null}
SELECT id, description
FROM mock_items
WHERE description === 'shose'::pdb.fuzzy(1, f, t)
LIMIT 5;
```

<Note>
  The default value for the second and third arguments of `fuzzy` is `f`, which
  means `fuzzy(1)` is equivalent to `fuzzy(1, f, f)`.
</Note>


# Highlighting
Source: https://docs.paradedb.com/documentation/full-text/highlight

Generate snippets for portions of the source text that match the query string

<Note>
  Highlighting is an expensive process and can slow down query times. We
  recommend passing a `LIMIT` to any query where `pdb.snippet` or `pdb.snippets`
  is called to restrict the number of snippets that need to be generated.
</Note>

<Note>Highlighting is not supported for fuzzy search.</Note>

Highlighting refers to the practice of visually emphasizing the portions of a document that match a user's
search query.

## Basic Usage

`pdb.snippet(<column>)` can be added to any query where an `@@@` operator is present. `pdb.snippet` returns the single best snippet, sorted by relevance score.
The following query generates highlighted snippets against the `description` field.

```sql theme={null}
SELECT id, pdb.snippet(description)
FROM mock_items
WHERE description ||| 'shoes'
LIMIT 5;
```

<ParamField>
  The leading indicator around the highlighted region.
</ParamField>

<ParamField>
  The trailing indicator around the highlighted region.
</ParamField>

<ParamField>
  Max number of characters for a highlighted snippet. A snippet may contain
  multiple matches if they are close to each other.
</ParamField>

By default, `<b></b>` encloses the snippet. This can be configured with `start_tag` and `end_tag`:

```sql theme={null}
SELECT id, pdb.snippet(description, start_tag => '<i>', end_tag => '</i>')
FROM mock_items
WHERE description ||| 'shoes'
LIMIT 5;
```

## Multiple Snippets

`pdb.snippets(<column>)` returns an array of snippets, allowing you to retrieve multiple highlighted matches from a document. This is particularly useful when a document has several relevant matches spread throughout its content.

```sql theme={null}
SELECT id, pdb.snippets(description, max_num_chars => 15)
FROM mock_items
WHERE description ||| 'artistic vase'
LIMIT 5;
```

```ini Expected Response theme={null}
 id |                snippets
----+-----------------------------------------
 19 | {<b>Artistic</b>,"ceramic <b>vase</b>"}
(1 row)

```

<ParamField>
  The leading indicator around the highlighted region.
</ParamField>

<ParamField>
  The trailing indicator around the highlighted region.
</ParamField>

<ParamField>
  Max number of characters for a highlighted snippet. When `max_num_chars` is
  small, multiple snippets may be generated for a single document.
</ParamField>

<ParamField>
  The maximum number of snippets to return per document.
</ParamField>

<ParamField>
  The number of snippets to skip before returning results. Use with `limit` for
  pagination.
</ParamField>

<ParamField>
  The order in which to sort the snippets. Can be `'score'` (default, sorts by
  relevance) or `'position'` (sorts by appearance in the document).
</ParamField>

### Limiting and Offsetting Snippets

You can control the number and order of snippets returned using the `limit`, `offset`, and `sort_by` parameters.

For example, to get only the first snippet:

```sql theme={null}
SELECT id, pdb.snippets(description, max_num_chars => 15, "limit" => 1)
FROM mock_items
WHERE description ||| 'running'
LIMIT 5;
```

To get the second snippet (by skipping the first one):

```sql theme={null}
SELECT id, pdb.snippets(description, max_num_chars => 15, "limit" => 1, "offset" => 1)
FROM mock_items
WHERE description ||| 'running'
LIMIT 5;
```

### Sorting Snippets

Snippets can be sorted either by their relevance score (`'score'`) or their position within the document (`'position'`).

To sort snippets by their appearance in the document:

```sql theme={null}
SELECT id, pdb.snippets(description, max_num_chars => 15, sort_by => 'position')
FROM mock_items
WHERE description ||| 'artistic vase'
LIMIT 5;
```

## Byte Offsets

`pdb.snippet_positions(<column>)` returns the byte offsets in the original text where the snippets would appear. It returns an array of
tuples, where the first element of the tuple is the byte index of the first byte of the highlighted region, and the second element is the byte index after the last byte of the region.

```sql theme={null}
SELECT id, pdb.snippet(description), pdb.snippet_positions(description)
FROM mock_items
WHERE description ||| 'shoes'
LIMIT 5;
```

```ini Expected Response theme={null}
 id |          snippet           | snippet_positions
----+----------------------------+-------------------
  3 | Sleek running <b>shoes</b> | {"{14,19}"}
  4 | White jogging <b>shoes</b> | {"{14,19}"}
  5 | Generic <b>shoes</b>       | {"{8,13}"}
(3 rows)
```


# Match
Source: https://docs.paradedb.com/documentation/full-text/match

Returns documents that match the provided query string, which is tokenized before matching

Match queries are the go-to query type for text search in ParadeDB. There are two types of match queries:
[match disjunction](#match-disjunction) and [match conjunction](#match-conjunction).

## Match Disjunction

Match disjunction uses the `|||` operator and means "find all documents that contain one or more of the terms tokenized from this text input."

To understand what this looks like in practice, let's consider the following query:

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description ||| 'running shoes';
```

This query returns:

```csv theme={null}
     description     | rating | category
---------------------+--------+----------
 Sleek running shoes |      5 | Footwear
 White jogging shoes |      3 | Footwear
 Generic shoes       |      4 | Footwear
(3 rows)
```

### How It Works

Let's look at what the `|||` operator does:

1. Retrieves the tokenizer configuration of the `description` column. In this example,
   let's assume `description` uses the [unicode](/documentation/tokenizers/available-tokenizers/unicode) tokenizer.
2. Tokenizes the query string with the same tokenizer. This means `running shoes` becomes two tokens: `running` and `shoes`.
3. Finds all rows where `description` contains **any one** of the tokens, `running` or `shoes`.

This is why all results have either `running` or `shoes` tokens in `description`.

### Examples

Let's consider a few more hypothetical documents to see whether they would be returned by match disjunction.
These examples assume that the index uses the default tokenizer and token filters, and that the query is
`running shoes`.

| Original Text       | Tokens                    | Match | Reason                                  | Related                                                               |
| ------------------- | ------------------------- | ----- | --------------------------------------- | --------------------------------------------------------------------- |
| Sleek running shoes | `sleek` `running` `shoes` | ✅     | Contains both `running` and `shoes`.    |                                                                       |
| Running shoes sleek | `sleek` `running` `shoes` | ✅     | Contains both `running` and `shoes`.    | [Phrase](/documentation/full-text/phrase)                             |
| SLeeK RUNNING ShOeS | `sleek` `running` `shoes` | ✅     | Contains both `running` and `shoes`.    | [Lowercasing](/documentation/indexing/create-index)                   |
| Sleek run shoe      | `sleek` `run` `shoe`      | ❌     | Contains neither `running` nor `shoes`. | [Stemming](/documentation/indexing/create-index)                      |
| Sleke ruining shoez | `sleke` `ruining` `shoez` | ❌     | Contains neither `running` nor `shoes`. | [Fuzzy](/documentation/full-text/fuzzy)                               |
| White jogging shoes | `white` `jogging` `shoes` | ✅     | Contains `shoes`.                       | [Match conjunction](/documentation/full-text/match#match-conjunction) |

## Match Conjunction

Suppose we want to find rows that contain both `running` **and** `shoes`. This is where the `&&&` match conjunction operator comes in.
`&&&` means "find all documents that contain all terms tokenized from this text input."

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description &&& 'running shoes';
```

This query returns:

```csv theme={null}
     description     | rating | category
---------------------+--------+----------
 Sleek running shoes |      5 | Footwear
(1 row)
```

Note that `White jogging shoes` and `Generic shoes` are no longer returned because they do not have the token `running`.

### How It Works

Match conjunction works exactly like match disjunction, except for one key distinction. Instead of finding documents containing
at least one matching token from the query, it finds documents where **all tokens** from the query are a match.

### Examples

Let’s consider a few more hypothetical documents to see whether they would be returned by match conjunction.
These examples assume that the index uses the default tokenizer and token filters, and that the query is
`running shoes`.

| Original Text       | Tokens                    | Match | Reason                                       | Related                                                               |
| ------------------- | ------------------------- | ----- | -------------------------------------------- | --------------------------------------------------------------------- |
| Sleek running shoes | `sleek` `running` `shoes` | ✅     | Contains both `running` and `shoes`.         |                                                                       |
| Running shoes sleek | `sleek` `running` `shoes` | ✅     | Contains both `running` and `shoes`.         | [Phrase](/documentation/full-text/phrase)                             |
| SLeeK RUNNING ShOeS | `sleek` `running` `shoes` | ✅     | Contains both `running` and `shoes`.         | [Lowercasing](/documentation/indexing/create-index)                   |
| Sleek run shoe      | `sleek` `run` `shoe`      | ❌     | Does not contain both `running` and `shoes`. | [Stemming](/documentation/indexing/create-index)                      |
| Sleke ruining shoez | `sleke` `ruining` `shoez` | ❌     | Does not contain both `running` and `shoes`. | [Fuzzy](/documentation/full-text/fuzzy)                               |
| White jogging shoes | `white` `jogging` `shoes` | ❌     | Does not contain both `running` and `shoes`. | [Match conjunction](/documentation/full-text/match#match-conjunction) |

<Note>
  If the query string only contains one token, then `|||` and `&&&` are effectively the same:

  ```sql theme={null}
  -- These two queries produce the same results
  SELECT description, rating, category
  FROM mock_items
  WHERE description ||| 'shoes';

  SELECT description, rating, category
  FROM mock_items
  WHERE description &&& 'shoes';
  ```
</Note>

## Using a Custom Tokenizer

By default, the match query automatically tokenizes the query string with the same tokenizer used by the field it's being searched against.
This behavior can be overridden by explicitly casting the query to a different tokenizer.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description ||| 'running shoes'::pdb.whitespace;
```

## Using Pretokenized Text

The match operators also accept text arrays. If a text array is provided, each element of the array is treated as an exact token,
which means that no further processing is done.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description &&& ARRAY['running', 'shoes'];
```


# How Text Search Works
Source: https://docs.paradedb.com/documentation/full-text/overview

Understand how ParadeDB uses token matching to efficiently search large corpuses of text

Text search in ParadeDB, like Elasticsearch and most search engines, is centered around the concept of **token matching**.

Token matching consists of two steps. First, at indexing time, text is processed by a tokenizer, which breaks input into discrete units called **tokens** or
**terms**. For example, the [default](/documentation/indexing/create-index) tokenizer splits the text `Sleek running shoes` into the tokens `sleek`, `running`, and `shoes`.

Second, at query time, the query engine looks for token matches based on the specified query and query type. Some common query types include:

* [Match](/documentation/full-text/match): Matches documents containing any or all query tokens
* [Phrase](/documentation/full-text/phrase): Matches documents where all tokens appear in the same order as the query
* [Term](/documentation/full-text/term): Matches documents containing an exact token
* ...and many more [advanced](/documentation/query-builder/overview) query types

## Not Substring Matching

While ParadeDB supports substring matching via [regex](/documentation/query-builder/term/regex) queries, it's important to note that token matching is **not** the
same as substring matching.

Token matching is a much more versatile and powerful technique. It enables relevance scoring, language-specific analysis, typo tolerance, and more expressive query types — capabilities that go far beyond simply looking for a sequence of characters.

## Similarity Search

Text search is different than similarity search, also known as vector search. Whereas text search matches based on token matches, similarity search
matches based on semantic meaning.

ParadeDB currently does not build its own extensions for similarity search. Most ParadeDB users install [pgvector](https://github.com/pgvector/pgvector), the
Postgres extension for vector search, for this use case.

We have tentative long-term plans in our [roadmap](/welcome/roadmap#vector-search-improvements) to make improvements to Postgres' vector search.
If this is useful to you, please [reach out](mailto:support@paradedb.com).


# Phrase
Source: https://docs.paradedb.com/documentation/full-text/phrase

Phrase queries are like match queries, but with order and position of matching tokens enforced

Phrase queries work exactly like [match conjunction](/documentation/full-text/match#match-conjunction), but are more strict in that they require the
order and position of tokens to be the same.

Suppose our query is `running shoes`, and we want to omit results like
`running sleek shoes` or `shoes running` — these results contain the right tokens, but not in the exact order and position
that the query specifies.

Enter the `###` phrase operator:

```sql theme={null}
INSERT INTO mock_items (description, rating, category) VALUES
('running sleek shoes', 5, 'Footwear'),
('shoes running', 5, 'Footwear');

SELECT description, rating, category
FROM mock_items
WHERE description ### 'running shoes';
```

This query returns:

```csv theme={null}
     description     | rating | category
---------------------+--------+----------
 Sleek running shoes |      5 | Footwear
(1 row)
```

Note that `running sleek shoes` and `shoes running` did not match the phrase `running shoes` despite having the tokens `running` and
`shoes` because they appear in the wrong order or with other words in between.

## How It Works

Let's look at what happens under the hood for the above phrase query:

1. Retrieves the tokenizer configuration of the `description` column. In this example,
   let's assume `description` uses the [unicode](/documentation/tokenizers/available-tokenizers/unicode) tokenizer.
2. Tokenizes the query string with the same tokenizer. This means `running shoes` becomes two tokens: `running` and `shoes`.
3. Finds all rows where `description` contains `running` immediately followed by `shoes`.

## Examples

Let’s consider a few more hypothetical documents to see whether they would be returned by the phrase query.
These examples assume that index uses the default tokenizer and token filters, and that the query is
`running shoes`.

| Original Text       | Tokens                    | Match | Reason                                         | Related                                                               |
| ------------------- | ------------------------- | ----- | ---------------------------------------------- | --------------------------------------------------------------------- |
| Sleek running shoes | `sleek` `running` `shoes` | ✅     | Contains `running` and `shoes`, in that order. |                                                                       |
| Running shoes sleek | `sleek` `running` `shoes` | ❌     | `running` and `shoes` not in the right order.  | [Match conjunction](/documentation/full-text/match#match-conjunction) |
| SLeeK RUNNING ShOeS | `sleek` `running` `shoes` | ✅     | Contains `running` and `shoes`, in that order. | [Lowercasing](/documentation/indexing/create-index)                   |
| Sleek run shoe      | `sleek` `run` `shoe`      | ❌     | Does not contain both `running` and `shoes`.   | [Stemming](/documentation/indexing/create-index)                      |
| Sleke ruining shoez | `sleke` `ruining` `shoez` | ❌     | Does not contain both `running` and `shoes`.   |                                                                       |
| White jogging shoes | `white` `jogging` `shoes` | ❌     | Does not contain both `running` and `shoes`.   |                                                                       |

## Adding Slop

Slop allows the token ordering requirement of phrase queries to be relaxed. It specifies how many changes — like extra words in between or transposed word positions — are allowed while still considering the phrase a match:

* An extra word in between (e.g. `sleek shoes` vs. `sleek running shoes`) has a slop of `1`
* A transposition (e.g. `running shoes` vs. `shoes running`) has a slop of `2`

To apply slop to a phrase query, cast the query to `slop(n)`, where `n` is the maximum allowed slop.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description ### 'shoes running'::pdb.slop(2);
```

## Using a Custom Tokenizer

The phrase query supports custom query tokenization.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description ### 'running shoes'::pdb.whitespace;
```

## Using Pretokenized Text

The phrase operator also accepts a text array as the right-hand side argument. If a text array is provided, each element of the array is treated as an exact token,
which means that no further processing is done.

The following query matches documents containing the token `shoes` immediately followed by `running`:

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description ### ARRAY['running', 'shoes'];
```

Adding slop is supported:

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description ### ARRAY['shoes', 'running']::pdb.slop(2);
```


# Proximity
Source: https://docs.paradedb.com/documentation/full-text/proximity

Match documents based on token proximity within the source document

Proximity queries are used to match documents containing tokens that are within a certain token distance of one another.

## Overview

The following query finds all documents where the token `sleek` is at most `1` token away from `shoes`.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description @@@ ('sleek' ## 1 ## 'shoes');
```

<Note>
  Like the [term](/documentation/full-text/term) query, the query string in a
  proximity query is treated as a finalized token.
</Note>

`##` does not care about order -- the term on the left-hand side may appear before or after the term on the right-hand side.
To ensure that the left-hand term appears before the right-hand term, use `##>`.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description @@@ ('sleek' ##> 1 ##> 'shoes');
```

## Proximity Regex

In addition to exact tokens, proximity queries can also match against regex expressions.

The following query finds all documents where any token matching the regex query `sl.*` is at most `1` token away
from the token `shoes`.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description @@@ (pdb.prox_regex('sl.*') ## 1 ## 'shoes');
```

By default, `pdb.prox_regex` will expand to the first `50` regex matches in each document. This limit can be overridden
by providing a second argument:

```sql theme={null}
-- Expand up to 100 regex matches
SELECT description, rating, category
FROM mock_items
WHERE description @@@ (pdb.prox_regex('sl.*', 100) ## 1 ## 'shoes');
```

## Proximity Array

`pdb.prox_array` matches against an array of tokens instead of a single token. For example, the following query finds all
documents where any of the tokens `sleek` or `white` is within `1` token of `shoes`.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description @@@ (pdb.prox_array('sleek', 'white') ## 1 ## 'shoes');
```

`pdb.prox_array` can also take regex:

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description @@@ (pdb.prox_array(pdb.prox_regex('sl.*'), 'white') ## 1 ## 'shoes');
```


# Term
Source: https://docs.paradedb.com/documentation/full-text/term

Look for exact token matches in the source document, without any further processing of the query string

Term queries look for exact token matches. A term query is like an exact string match, but at the
token level.

Unlike [match](/documentation/full-text/match) or [phrase](/documentation/full-text/phrase) queries, term queries treat the query
string as a **finalized token**. This means that the query string is taken as-is, without any further tokenization or filtering.

Term queries use the `===` operator. To understand exactly how it works, let's consider the following two term queries:

```sql theme={null}
-- Term query 1
SELECT description, rating, category
FROM mock_items
WHERE description === 'running';

-- Term query 2
SELECT description, rating, category
FROM mock_items
WHERE description === 'RUNNING';
```

The first query returns:

```csv theme={null}
     description     | rating | category
---------------------+--------+----------
 Sleek running shoes |      5 | Footwear
(1 row)
```

However, the second query returns no results. This is because term queries look for exact matches, which includes
case sensitivity, and there are no documents in the example dataset containing the token `RUNNING`.

<Note>
  All tokenizers besides the literal tokenizer
  [lowercase](/documentation/token-filters/lowercase) tokens by default. Make
  sure to account for this when searching for a term.
</Note>

<Note>
  If you are using `===` to do an exact string match on the original text, make
  sure that the text uses the
  [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer.
</Note>

## How It Works

Under the hood, `===` simply finds all documents where any of their tokens are an exact string match against the query token.
A document's tokens are determined by the field's tokenizer and token filters, configured at index creation time.

## Examples

Let’s consider a few more hypothetical documents to see whether they would be returned by the term query.
These examples assume that index uses the default tokenizer and token filters, and that the term query is
`running`.

| Original Text       | Tokens                    | Match | Reason                                | Related                                             |
| ------------------- | ------------------------- | ----- | ------------------------------------- | --------------------------------------------------- |
| Sleek running shoes | `sleek` `running` `shoes` | ✅     | Contains the token `running`.         |                                                     |
| Running shoes sleek | `sleek` `running` `shoes` | ✅     | Contains the token `running`.         |                                                     |
| SLeeK RUNNING ShOeS | `sleek` `running` `shoes` | ✅     | Contains the token `running`.         | [Lowercasing](/documentation/indexing/create-index) |
| Sleek run shoe      | `sleek` `run` `shoe`      | ❌     | Does not contain the token `running`. | [Stemming](/documentation/indexing/create-index)    |
| Sleke ruining shoez | `sleke` `ruining` `shoez` | ❌     | Does not contain the token `running`. | [Fuzzy](/documentation/full-text/fuzzy)             |
| White jogging shoes | `white` `jogging` `shoes` | ❌     | Does not contain the token `running`. |                                                     |

## Term Set

Passing a text array to the right-hand side of `===` means "find all documents containing any one of these tokens."

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description === ARRAY['shoes', 'running'];
```


# Install ParadeDB
Source: https://docs.paradedb.com/documentation/getting-started/install

How to run the ParadeDB Docker image

The fastest way to install ParadeDB is by pulling the ParadeDB Docker image and running it locally. If
your primary Postgres is in a virtual private cloud (VPC), we recommend deploying ParadeDB on a compute
instance within your VPC to avoid exposing public IP addresses and needing to provision traffic routing
rules.

**Note**: ParadeDB supports Postgres 15+, and the `latest` tag ships with Postgres 18. To specify a different Postgres version, please refer to the available tags on [Docker Hub](https://hub.docker.com/r/paradedb/paradedb/tags).

```bash theme={null}
docker run \
  --name paradedb \
  -e POSTGRES_USER=myuser \
  -e POSTGRES_PASSWORD=mypassword \
  -e POSTGRES_DB=mydatabase \
  -v paradedb_data:/var/lib/postgresql/ \
  -p 5432:5432 \
  -d \
  paradedb/paradedb:latest
```

You may replace `myuser`, `mypassword`, and `mydatabase` with whatever values you want. These will be your database
connection credentials.

To connect to ParadeDB, install the `psql` client and run

```bash theme={null}
docker exec -it paradedb psql -U myuser -d mydatabase -W
```

To see all the ways in which you can install ParadeDB, please refer to our [deployment documentation](/deploy/overview).

That's it! Next, let's [run a few queries](/documentation/getting-started/quickstart) over mock data with ParadeDB.


# Load Data from Postgres
Source: https://docs.paradedb.com/documentation/getting-started/load

Dump data from an existing Postgres and load into ParadeDB

The easiest way to copy data from another Postgres into ParadeDB is with the `pg_dump` and `pg_restore` utilities. These are
installed by default when you install `psql`.

This approach is ideal for quickly testing ParadeDB. See the [deployment guide](/deploy/overview) for how to deploy ParadeDB into production.

## Create a Dump

Run `pg_dump` to create a copy of your database. The `pg_dump` version needs be greater than or equal to that of your Postgres database. You can check the version with `pg_dump --version`.

Below, we use the "custom" format (`-Fc`) for both `pg_dump` and `pg_restore`. Please review the [Postgres `pg_dump` documentation](https://www.postgresql.org/docs/current/app-pgdump.html) for other options that may be more appropriate for your environment.

<Note>
  Replace `host`, `username`, and `dbname` with your existing Postgres database
  credentials. If you deployed ParadeDB within your VPC, the `host` will be the
  private IP address of your existing Postgres database.
</Note>

```bash theme={null}
pg_dump -Fc --no-acl --no-owner \
    -h <host> \
    -U <username> \
    <dbname> > old_db.dump
```

If your database is large, this can take some time. You can speed this up by dumping specific tables.

```bash theme={null}
pg_dump -Fc --no-acl --no-owner \
    -h <host> \
    -U <username> \
    -t <table_name_1> -t <table_name_2> \
    <dbname> > old_db.dump
```

## Restore the Dump

Run `pg_restore` to load this data into ParadeDB. The `pg_restore` version needs be greater than or equal to that of your `pg_dump`. You can check the version with `pg_restore --version`.

<Note>
  Replace `host`, `username`, and `dbname` with your ParadeDB credentials.
</Note>

```bash theme={null}
pg_restore --verbose --clean --no-acl --no-owner \
    -h <host> \
    -U <username> \
    -d <dbname> \
    -Fc \
    old_db.dump
```

Congratulations! You are now ready to run real queries over your data. To get started, refer to our [full text search documentation](https://docs.paradedb.com/documentation/full-text/overview).


# Quickstart
Source: https://docs.paradedb.com/documentation/getting-started/quickstart

Get started with ParadeDB in five minutes

This guide will walk you through a few queries to give you a feel for ParadeDB.

## Create Example Table

ParadeDB comes with a helpful procedure that creates a table populated with mock data to help
you get started. Once connected with `psql`, run the following commands to create and inspect
this table.

```sql theme={null}
CALL paradedb.create_bm25_test_table(
  schema_name => 'public',
  table_name => 'mock_items'
);

SELECT description, rating, category
FROM mock_items
LIMIT 3;
```

```ini Expected Response theme={null}
       description        | rating |  category
--------------------------+--------+-------------
 Ergonomic metal keyboard |      4 | Electronics
 Plastic Keyboard         |      4 | Electronics
 Sleek running shoes      |      5 | Footwear
(3 rows)
```

Next, let's create a BM25 index called `search_idx` on this table. A BM25 index is a covering index, which means that multiple columns can be included in the same index.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description, category, rating, in_stock, created_at, metadata, weight_range)
WITH (key_field='id');
```

<Note>
  As a general rule of thumb, any columns that you want to filter, `GROUP BY`,
  `ORDER BY`, or aggregate as part of a full text query should be added to the
  index for faster performance.
</Note>

<Note>
  Note the mandatory `key_field` option. See [choosing a key
  field](/documentation/indexing/create-index#choosing-a-key-field) for more details.
</Note>

## Match Query

We're now ready to execute a basic text search query. We'll look for matches where `description` matches `running shoes` where `rating` is greater than `2`.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description ||| 'running shoes' AND rating > 2
ORDER BY rating
LIMIT 5;
```

```ini Expected Response theme={null}
     description     | rating | category
---------------------+--------+----------
 White jogging shoes |      3 | Footwear
 Generic shoes       |      4 | Footwear
 Sleek running shoes |      5 | Footwear
(3 rows)
```

`|||` is ParadeDB's custom [match disjunction](/documentation/full-text/match#disjunction) operator, which means "find me all documents containing
`running OR shoes`.

If we want all documents containing `running AND shoes`, we can use ParadeDB's `&&&` [match conjunction](/documentation/full-text/match#conjunction) operator.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description &&& 'running shoes' AND rating > 2
ORDER BY rating
LIMIT 5;
```

```ini Expected Response theme={null}
     description     | rating | category
---------------------+--------+----------
 Sleek running shoes |      5 | Footwear
(1 row)
```

## BM25 Scoring

Next, let's add [BM25 scoring](/documentation/sorting/score) to the results, which sorts matches by relevance. To do this, we'll use `pdb.score`.

```sql theme={null}
SELECT description, pdb.score(id)
FROM mock_items
WHERE description ||| 'running shoes' AND rating > 2
ORDER BY score DESC
LIMIT 5;
```

```ini Expected Response theme={null}
     description     |   score
---------------------+-----------
 Sleek running shoes |  6.817111
 Generic shoes       | 3.8772602
 White jogging shoes | 3.4849067
(3 rows)
```

## Highlighting

Finally, let's also [highlight](/documentation/full-text/highlight) the relevant portions of the documents that were matched.
To do this, we'll use `pdb.snippet`.

```sql theme={null}
SELECT description, pdb.snippet(description), pdb.score(id)
FROM mock_items
WHERE description ||| 'running shoes' AND rating > 2
ORDER BY score DESC
LIMIT 5;
```

```ini Expected Response theme={null}
     description     |              snippet              |   score
---------------------+-----------------------------------+-----------
 Sleek running shoes | Sleek <b>running</b> <b>shoes</b> |  6.817111
 Generic shoes       | Generic <b>shoes</b>              | 3.8772602
 White jogging shoes | White jogging <b>shoes</b>        | 3.4849067
(3 rows)
```

## Top N

ParadeDB is highly optimized for quickly returning the [Top N](/documentation/sorting/topn) results out of the index. In SQL, this means queries that contain an `ORDER BY...LIMIT`:

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description ||| 'running shoes'
ORDER BY rating
LIMIT 5;
```

```ini Expected Response theme={null}
     description     | rating | category
---------------------+--------+----------
 White jogging shoes |      3 | Footwear
 Generic shoes       |      4 | Footwear
 Sleek running shoes |      5 | Footwear
(3 rows)
```

## Facets

[Faceted queries](/documentation/aggregates/facets) allow a single query to return both the Top N results and an aggregate value,
which is more CPU-efficient than issuing two separate queries.

For example, the following query returns the top 3 results as well as the total number of results matched.

```sql theme={null}
SELECT
     description, rating, category,
     pdb.agg('{"value_count": {"field": "id"}}') OVER ()
FROM mock_items
WHERE description ||| 'running shoes'
ORDER BY rating
LIMIT 5;
```

```ini Expected Response theme={null}
     description     | rating | category |      agg
---------------------+--------+----------+----------------
 White jogging shoes |      3 | Footwear | {"value": 3.0}
 Generic shoes       |      4 | Footwear | {"value": 3.0}
 Sleek running shoes |      5 | Footwear | {"value": 3.0}
(3 rows)
```

That's it! Next, let's [load your data](/documentation/getting-started/load) to start running real queries.


# Create an Index
Source: https://docs.paradedb.com/documentation/indexing/create-index

Index a Postgres table for full text search

Before a table can be searched, it must be indexed. ParadeDB uses a custom index type called the BM25 index.
The following code block creates a BM25 index over several columns in the `mock_items` table.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description, category)
WITH (key_field='id');
```

By default, text columns are tokenized using the [unicode](/documentation/tokenizers/available-tokenizers/unicode) tokenizer, which splits text according to the
Unicode segmentation standard. Because index creation is a time-consuming operation, we recommend experimenting with the [available tokenizers](/documentation/tokenizers/overview)
to find the most suitable one before running `CREATE INDEX`.

For instance, if a column contains multiple languages, the ICU tokenizer may be more appropriate.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.icu), category)
WITH (key_field='id');
```

Only one BM25 index can exist per table. We recommend indexing all columns in a table that may be present in a search query,
including columns used for sorting, grouping, filtering, and aggregations.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description, category, rating, in_stock, created_at, metadata, weight_range)
WITH (key_field='id');
```

Most Postgres types, including text, JSON, numeric, timestamp, range, boolean, and arrays, can be indexed.

## Track Create Index Progress

To monitor the progress of a long-running `CREATE INDEX`, open a separate Postgres connection and query `pg_stat_progress_create_index`:

```sql theme={null}
SELECT pid, phase, blocks_done, blocks_total
FROM pg_stat_progress_create_index;
```

Comparing `blocks_done` to `blocks_total` will provide a good approximation of the progress so far. If `blocks_done` equals
`blocks_total`, that means that all rows have been indexed and the index is being flushed to disk.

## Choosing a Key Field

In the `CREATE INDEX` statement above, note the mandatory `key_field` option.
Every BM25 index needs a `key_field`, which is the name of a column that will function as a row’s unique identifier within the index.

The `key_field` must:

1. Have a `UNIQUE` constraint. Usually this means the table's `PRIMARY KEY`.
2. Be the first column in the column list.
3. Be untokenized, if it is a text field.

## Token Filters

After tokens are created, [token filters](/documentation/token-filters/overview) can be configured to apply further processing like lowercasing, stemming, or unaccenting.
For example, the following code block adds English stemming to `description`:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('stemmer=english')), category)
WITH (key_field='id');
```


# Indexing Text Arrays
Source: https://docs.paradedb.com/documentation/indexing/indexing-arrays

Add text arrays to the index

The BM25 index accepts arrays of type `text[]` or `varchar[]`.

```sql theme={null}
CREATE TABLE array_demo (id SERIAL PRIMARY KEY, categories TEXT[]);
INSERT INTO array_demo (categories) VALUES
    ('{"food","groceries and produce"}'),
    ('{"electronics","computers"}'),
    ('{"books","fiction","mystery"}');

CREATE INDEX ON array_demo USING bm25 (id, categories)
WITH (key_field = 'id');
```

Under the hood, each element in the array is indexed as a separate entry. This means that an array is considered a
match if **any** of its entries is a match.

```sql theme={null}
SELECT * FROM array_demo WHERE categories === 'food';
```

```ini Expected Response theme={null}
 id |           categories
----+--------------------------------
  1 | {food,"groceries and produce"}
(1 row)
```

Text arrays can be [tokenized](/documentation/tokenizers/overview) and [filtered](/documentation/token-filters/overview) in the same way as text fields:

```sql theme={null}
CREATE INDEX ON array_demo USING bm25 (id, (categories::pdb.literal))
WITH (key_field = 'id');
```


# Indexing Expressions
Source: https://docs.paradedb.com/documentation/indexing/indexing-expressions

Add Postgres expressions to the index

In addition to indexing columns, Postgres expressions can also be indexed.

## Indexing Text/JSON Expressions

The following statement indexes an expression which concatenates `description` and `category`, which are both text fields:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, ((description || ' ' || category)::pdb.simple('alias=description_concat')))
WITH (key_field='id');
```

To index a text/JSON expression:

1. Add the expression to the column list. In this example, the expression is `description || ' ' || category`.
2. Cast it to a [tokenizer](/documentation/tokenizers/overview), in this example `pdb.simple`.
3. ParadeDB will try and infer a field name based on the field used in the expression. However,
   if the field name cannot be inferred (e.g. because the expression involves more than one field), you will be required
   to add an `alias=<alias_name>` to the tokenizer.

Querying against the expression is the same as querying a regular field:

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE (description || ' ' || category) &&& 'running shoes';
```

<Note>
  The expression on the left-hand side of the operator must exactly match the
  expression that was indexed.
</Note>

## Indexing Non-Text Expressions

To index a non-text expression, cast the expression to `pdb.alias`. For example, the following statement indexes
the expression `rating + 1`, which returns an integer:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description, ((rating + 1)::pdb.alias('rating')))
WITH (key_field='id');
```

With the expression indexed, queries containing the expression can be pushed down to the ParadeDB index:

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description &&& 'running shoes'
AND rating + 1 > 3;
```


# Indexing JSON
Source: https://docs.paradedb.com/documentation/indexing/indexing-json

Add JSON and JSONB types to the index

When indexing JSON, ParadeDB automatically indexes all sub-fields of the JSON object. The type of each sub-field is also inferred automatically.
For example, consider the following statement where `metadata` is `JSONB`:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, metadata)
WITH (key_field='id');
```

A single `metadata` JSON may look like:

```json theme={null}
{ "color": "Silver", "location": "United States" }
```

ParadeDB will automatically index both `metadata.color` and `metadata.location` as text.

By default, all text sub-fields of a JSON object use the same tokenizer. The tokenizer can be configured the same way as text fields:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (metadata::pdb.ngram(2,3)))
WITH (key_field='id');
```

Instead of indexing the entire JSON, sub-fields of the JSON can be indexed individually. This allows for configuring separate tokenizers
within a larger JSON:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, ((metadata->>'color')::pdb.ngram(2,3)))
WITH (key_field='id');
```


# Reindexing
Source: https://docs.paradedb.com/documentation/indexing/reindexing

Rebuild an existing index with zero downtime

## Changing the Schema

If an index's schema is changed, it must be rebuilt. This includes:

1. Adding a field to the index
2. Removing a field from the index
3. Renaming an indexed column in the underlying table
4. Changing a field's tokenizer

Let's assume the existing index is called `search_idx`, and we want to create a new index called `search_idx_v2`.
First, use `CREATE INDEX CONCURRENTLY` to build a new index in the background.

```sql theme={null}
CREATE INDEX CONCURRENTLY search_idx_v2
ON mock_items USING bm25 (id, description, category)
WITH (key_field = 'id');
```

<Note>
  The `CONCURRENTLY` clause is required. `CONCURRENTLY` allows the existing
  index to continue serving queries while the new index is being built.
</Note>

From another session, you can use `pg_stat_progress_create_index` to [track the progress](/documentation/indexing/create-index#track-create-index-progress) of the new index.

Once the new index is done building, confirm that it is valid:

```sql theme={null}
SELECT ix.indisvalid, ix.indisready, ix.indislive
FROM pg_class i
JOIN pg_index ix ON ix.indexrelid = i.oid
WHERE i.relname = 'search_idx_v2';
```

```csv Expected Response theme={null}
 indisvalid | indisready | indislive
------------+------------+-----------
 t          | t          | t
(1 row)
```

If all three columns are `true`, the original index can safely be dropped, which will redirect queries to the new index.

```sql theme={null}
DROP INDEX search_idx;
```

## Rebuilding the Index

`REINDEX` is used to rebuild an index without changing the schema.
The basic syntax for `REINDEX` is:

```sql theme={null}
REINDEX INDEX search_idx;
```

This operation takes an exclusive lock on the table, which blocks incoming writes (but not reads) while the new index is being built.

To allow for concurrent writes during a reindex, use `REINDEX CONCURRENTLY`:

```sql theme={null}
REINDEX INDEX CONCURRENTLY search_idx;
```

The tradeoff is that `REINDEX CONCURRENTLY` is slower than a plain `REINDEX`. Generally speaking, `REINDEX CONCURRENTLY` is recommended for
production systems that cannot tolerate temporarily blocked writes.

## Important Caveats

Although `CREATE INDEX CONCURRENTLY` and `REINDEX CONURRENTLY` run in the background, Postgres requires that the
session that is executing the command remain open. If the session is closed,
Postgres will cancel the operation. This is relevant if you are using a
connection pooler like `pgbouncer`, which may terminate
sessions after a certain idle timeout is reached.

If `REINDEX CONCURRENTLY` fails or is cancelled, an invalid transient index will be left behind that must be dropped manually.
To check for invalid indexes in `psql`, run `\d <table_name>` and look for `INVALID` indexes.


# Verify Index Integrity
Source: https://docs.paradedb.com/documentation/indexing/verify-index

Check BM25 indexes for corruption and structural issues

ParadeDB provides `amcheck`-style index verification functions to detect corruption and validate the structural integrity of BM25 indexes.
These functions are useful for:

* Proactive corruption detection before issues become critical
* Validating index health after hardware failures or unexpected shutdowns
* Verifying backup integrity
* Debugging index-related issues

## Basic Verification

The `pdb.verify_index` function performs structural integrity checks on a BM25 index:

```sql theme={null}
SELECT * FROM pdb.verify_index('search_idx');
```

This returns a table with three columns:

| Column       | Type    | Description                                   |
| ------------ | ------- | --------------------------------------------- |
| `check_name` | text    | Name of the verification check                |
| `passed`     | boolean | Whether the check passed                      |
| `details`    | text    | Additional information about the check result |

### Example Output

```
               check_name               | passed |                    details
----------------------------------------+--------+-----------------------------------------------
 search_idx: schema_valid               | t      | Index schema loaded successfully
 search_idx: index_readable             | t      | Index reader opened successfully
 search_idx: checksums_valid            | t      | All segment checksums validated successfully
 search_idx: segment_metadata_valid     | t      | 3 segments validated successfully
```

## Heap Reference Validation

To verify that all indexed entries still exist in the heap table, use the `heapallindexed` option:

```sql theme={null}
SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true);
```

This adds an additional check that validates every indexed `ctid` (tuple identifier) references a valid row in the table.
This is particularly useful for detecting index entries that reference deleted or non-existent rows.

<Warning>
  The `heapallindexed` option can be slow on large indexes as it must verify
  every document. Consider using `sample_rate` for quick spot checks on large
  indexes.
</Warning>

## Options

### Sampling for Large Indexes

For large indexes, you can check a random sample of documents instead of all documents:

```sql theme={null}
-- Check 10% of documents
SELECT * FROM pdb.verify_index('search_idx',
    heapallindexed := true,
    sample_rate := 0.1
);
```

### Progress Reporting

For long-running verifications, enable progress reporting to see status updates:

```sql theme={null}
SELECT * FROM pdb.verify_index('search_idx',
    heapallindexed := true,
    report_progress := true
);
```

Progress messages are emitted via PostgreSQL's `WARNING` channel.

### Verbose Mode

For detailed logging including segment-by-segment progress and resume hints, enable verbose mode:

```sql theme={null}
SELECT * FROM pdb.verify_index('search_idx',
    heapallindexed := true,
    report_progress := true,
    verbose := true
);
```

### Stop on First Error

To stop verification immediately when the first error is found (similar to `pg_amcheck --on-error-stop`):

```sql theme={null}
SELECT * FROM pdb.verify_index('search_idx', on_error_stop := true);
```

## Parallel Verification

A single `verify_index` call processes segments sequentially within one PostgreSQL backend.
For very large indexes, you can distribute verification across multiple database connections
by specifying which segments each connection should check using the `segment_ids` parameter.
This allows you to utilize multiple CPU cores by running verification in parallel processes.

### Listing Segments

First, list all segments in the index:

```sql theme={null}
SELECT * FROM pdb.index_segments('search_idx');
```

```
 partition_name | segment_idx | segment_id | num_docs | num_deleted | max_doc
----------------+-------------+------------+----------+-------------+---------
 search_idx     |           0 | b7e661af   |    10000 |           0 |   10000
 search_idx     |           1 | b4fc1b40   |    10000 |           0 |   10000
 search_idx     |           2 | 9894b412   |    10000 |           0 |   10000
 search_idx     |           3 | 4d0168d6   |     5000 |           0 |    5000
```

### Verifying Specific Segments

Then verify specific segments using the `segment_ids` parameter:

```sql theme={null}
-- Worker 1: Verify even segments
SELECT * FROM pdb.verify_index('search_idx',
    heapallindexed := true,
    segment_ids := ARRAY[0, 2]
);

-- Worker 2: Verify odd segments
SELECT * FROM pdb.verify_index('search_idx',
    heapallindexed := true,
    segment_ids := ARRAY[1, 3]
);
```

### Automation Example

Distribute verification across N workers:

```sql theme={null}
-- Get segments for worker 0 (of 4 workers)
SELECT array_agg(segment_idx) AS segments
FROM pdb.index_segments('search_idx')
WHERE segment_idx % 4 = 0;

-- Run verification with those segments
SELECT * FROM pdb.verify_index('search_idx',
    heapallindexed := true,
    segment_ids := (
        SELECT array_agg(segment_idx)
        FROM pdb.index_segments('search_idx')
        WHERE segment_idx % 4 = 0
    )
);
```

## Verifying All BM25 Indexes

To verify all BM25 indexes in the database at once:

```sql theme={null}
SELECT * FROM pdb.verify_all_indexes();
```

### Filtering by Pattern

Filter indexes by schema or name pattern (using SQL `LIKE` syntax):

```sql theme={null}
-- Verify indexes in the 'public' schema only
SELECT * FROM pdb.verify_all_indexes(schema_pattern := 'public');

-- Verify indexes matching a name pattern
SELECT * FROM pdb.verify_all_indexes(index_pattern := 'search_%');

-- Combine filters
SELECT * FROM pdb.verify_all_indexes(
    schema_pattern := 'app_%',
    index_pattern := '%_idx',
    heapallindexed := true
);
```

## Listing All BM25 Indexes

To see all BM25 indexes in the database with summary statistics:

```sql theme={null}
SELECT * FROM pdb.indexes();
```

```
 schemaname |  tablename  |   indexname   | indexrelid | num_segments | total_docs
------------+-------------+---------------+------------+--------------+------------
 public     | products    | products_idx  |      16421 |            3 |      50000
 public     | documents   | documents_idx |      16435 |            5 |     125000
 app        | articles    | articles_idx  |      16448 |            2 |      10000
```

## Function Reference

### `pdb.verify_index`

Verifies a single BM25 index.

| Parameter         | Type     | Default    | Description                                                |
| ----------------- | -------- | ---------- | ---------------------------------------------------------- |
| `index`           | regclass | (required) | The index to verify                                        |
| `heapallindexed`  | boolean  | `false`    | Check that all indexed ctids exist in the heap             |
| `sample_rate`     | float    | `NULL`     | Fraction of documents to check (0.0-1.0). NULL = check all |
| `report_progress` | boolean  | `false`    | Emit progress messages                                     |
| `verbose`         | boolean  | `false`    | Emit detailed segment-level progress and resume hints      |
| `on_error_stop`   | boolean  | `false`    | Stop on first error found                                  |
| `segment_ids`     | int\[]   | `NULL`     | Specific segment indices to check. NULL = all segments     |

### `pdb.verify_all_indexes`

Verifies all BM25 indexes in the database.

| Parameter         | Type    | Default | Description                                                |
| ----------------- | ------- | ------- | ---------------------------------------------------------- |
| `schema_pattern`  | text    | `NULL`  | Filter by schema name (SQL LIKE pattern). NULL = all       |
| `index_pattern`   | text    | `NULL`  | Filter by index name (SQL LIKE pattern). NULL = all        |
| `heapallindexed`  | boolean | `false` | Check that all indexed ctids exist in the heap             |
| `sample_rate`     | float   | `NULL`  | Fraction of documents to check (0.0-1.0). NULL = check all |
| `report_progress` | boolean | `false` | Emit progress messages                                     |
| `on_error_stop`   | boolean | `false` | Stop on first error found                                  |

### `pdb.index_segments`

Lists all segments in a BM25 index.

| Parameter | Type     | Default    | Description          |
| --------- | -------- | ---------- | -------------------- |
| `index`   | regclass | (required) | The index to inspect |

Returns:

| Column           | Type   | Description                                      |
| ---------------- | ------ | ------------------------------------------------ |
| `partition_name` | text   | Name of the index partition                      |
| `segment_idx`    | int    | Segment index (use with `segment_ids` parameter) |
| `segment_id`     | text   | Tantivy segment UUID                             |
| `num_docs`       | bigint | Number of live documents                         |
| `num_deleted`    | bigint | Number of deleted documents                      |
| `max_doc`        | bigint | Maximum document ID                              |

### `pdb.indexes`

Lists all BM25 indexes in the database.

Returns:

| Column         | Type   | Description                         |
| -------------- | ------ | ----------------------------------- |
| `schemaname`   | text   | Schema containing the index         |
| `tablename`    | text   | Table the index is on               |
| `indexname`    | text   | Name of the index                   |
| `indexrelid`   | oid    | OID of the index                    |
| `num_segments` | int    | Number of Tantivy segments          |
| `total_docs`   | bigint | Total documents across all segments |


# Index Creation
Source: https://docs.paradedb.com/documentation/performance-tuning/create-index

Settings to make index creation faster

These actions can improve the performance and memory consumption of `CREATE INDEX` and `REINDEX` statements.

### Raise Parallel Indexing Workers

ParadeDB uses Postgres' `max_parallel_maintenance_workers` setting to determine the degree of parallelism during `CREATE INDEX`/`REINDEX`. Postgres' default is `2`, which may be too low for large tables.

```sql theme={null}
SET max_parallel_maintenance_workers = 8;
```

In order for `max_parallel_maintenance_workers` to take effect, it must be less than or equal to both `max_parallel_workers` and `max_worker_processes`.

### Configure Indexing Memory

The default Postgres `maintenance_work_mem` value of `64MB` is quite conservative and can slow down parallel index builds. We recommend at least `64MB` per
[parallel indexing worker](#raise-parallel-indexing-workers).

```sql theme={null}
SET maintenance_work_mem = '2GB';
```

<Note>
  Each worker is required to have at least `15MB` memory. If
  `maintenance_work_mem` is set too low, an error will be returned.
</Note>

### Defer Index Creation

If possible, creating the BM25 index should be deferred until **after** a table has been populated. To illustrate:

```sql theme={null}
-- This is preferred
CREATE TABLE test (id SERIAL, data text);
INSERT INTO test (data) VALUES ('hello world'), ('many more values');
CREATE INDEX ON test USING bm25 (id, data) WITH (key_field = 'id');

-- ...to this
CREATE TABLE test (id SERIAL, data text);
CREATE INDEX ON test USING bm25 (id, data) WITH (key_field = 'id');
INSERT INTO test (data) VALUES ('hello world'), ('many more values');
```

This allows the BM25 index to create a more tightly packed, efficient representation on disk and will lead to faster build times.


# Joins
Source: https://docs.paradedb.com/documentation/performance-tuning/joins

Optimize JOIN queries in ParadeDB

ParadeDB supports all PostgreSQL JOIN types and extends them with BM25-powered full-text search. This guide explains how JOINs behave with search, how to identify sub-optimal query plans, and offers strategies to keep queries fast.

## Supported JOIN Types

ParadeDB supports all PostgreSQL JOINs:

* `INNER JOIN`
* `LEFT / RIGHT / FULL OUTER JOIN`
* `CROSS JOIN`
* `LATERAL`
* Semi and Anti JOINs

For the most part you can mix search and relational queries without changing your SQL.

## Scoring in JOINs

When using `paradedb.score()` or `paradedb.snippet()` inside JOINs:

* Scores and snippets are computed **before the JOIN** at the base table level.
* JOIN conditions never change the score, they only determine which rows are combined.

This design keeps scores predictable and consistent across queries.

## Performance Characteristics

### Fast Cases

Queries are efficient when search filters can be applied directly to the underlying tables.
In these cases, PostgreSQL can push down the `|||` operators so that each table does its own filtered index scan before the JOIN runs.

That means:

* Each table only contributes rows that already match the search condition.
* The JOIN operates on much smaller intermediate sets.

In this query, both `a.bio` and `b.content` are filtered independently.
The planner runs efficient index scans on each table and then joins the results.

```sql theme={null}
SELECT a.name, b.title, paradedb.score(a.id)
FROM authors a
JOIN books b ON a.id = b.author_id
WHERE
    a.bio ||| 'science fiction'
    AND b.content ||| 'space travel';
```

The plan will have this shape:

```
Gather
  -> Parallel Hash Join
       Hash Cond: (b.id = a.id)
       -> Parallel ParadeDB Scan on authors a
       -> Parallel Hash
            -> Parallel ParadeDB Scan on books b
```

### Slower Cases

Queries become slower when search conditions span multiple tables in a way that prevents PostgreSQL from pushing them down. The most common example is an `OR` across different tables:

```sql theme={null}
SELECT a.name, b.title
FROM authors a
JOIN books b ON a.id = b.author_id
WHERE
    a.bio ||| 'science'
    OR b.content ||| 'artificial';
```

Because the condition references both `a` and `b`, PostgreSQL cannot apply it until after the join. As a result, both tables must be scanned in full, joined, and only then filtered.

The plan will have this shape:

```
Gather
  -> Parallel Hash Join
       Hash Cond: (a.id = b.author_id)
       Join Filter: (a.bio ||| (...) OR b.content ||| (...))
       -> Parallel Seq Scan on authors a
       -> Parallel Hash
            -> Parallel Seq Scan on books b
```

Note that the `|||` query is in the *Join Filter*, not in the scan.

## Diagnosing Performance

Use `EXPLAIN` to check the query plan:

```sql theme={null}
EXPLAIN (ANALYZE, BUFFERS)
SELECT a.name, b.title, paradedb.score(a.id)
FROM authors a
JOIN books b ON a.id = b.author_id
WHERE a.bio ||| 'science'
   OR b.content ||| 'artificial';
```

Watch for:

* `Custom Scan` nodes with large row counts
* ParadeDB operators inside JOIN conditions
* `Tantivy Query: all` (full index scan)

## Writing Faster JOIN Queries

### Replace Cross-Table OR with UNION

If you don’t need scores/snippets and have a simple JOIN, express the OR as a UNION of two separately filtered joins. This lets PostgreSQL push each search predicate down to a Custom Index Scan and avoid a join-time filter.

```sql theme={null}
SELECT a.name, b.title
FROM authors a
JOIN books b ON a.id = b.author_id
WHERE a.bio ||| 'science'
UNION
SELECT a.name, b.title
FROM authors a
JOIN books b ON a.id = b.author_id
WHERE b.content ||| 'artificial';
```

### Use CTEs for Complex Queries

Use common table expressions (CTEs) to pre-filter each table with its own search condition, then join the smaller result sets together.
If possible, add a `LIMIT` to each CTE to keep the result sets small.

```sql theme={null}
WITH matching_authors AS (
  SELECT id, name, paradedb.score(id) AS author_score
  FROM authors
  WHERE bio ||| 'science'
  LIMIT 100
),
matching_books AS (
  SELECT id, title, author_id, paradedb.score(id) AS book_score
  FROM books
  WHERE content ||| 'artificial'
  LIMIT 100
)
SELECT
  COALESCE(ma.name, a.name) AS name,
  COALESCE(mb.title, b.title) AS title,
  ma.author_score,
  mb.book_score
FROM matching_authors ma
FULL JOIN matching_books mb ON ma.id = mb.author_id
LEFT JOIN authors a ON mb.author_id = a.id AND ma.id IS NULL
LEFT JOIN books b ON ma.id = b.author_id AND mb.id IS NULL;
```

BM25 scores should not be added, if you want to combine scores then consider using [reciprocal rank fusion (RRF)](https://www.paradedb.com/learn/search-concepts/reciprocal-rank-fusion).

## Roadmap

We really want to remove the need to think about the way to do `JOIN`s in ParadeDB. At the moment we are actively working on:

* A `CustomScan Join API` for native join handling
* Smarter cost estimation for the PostgreSQL planner


# How to Tune ParadeDB
Source: https://docs.paradedb.com/documentation/performance-tuning/overview

Settings for better read and write performance

ParadeDB uses Postgres' settings, which can be found in the `postgresql.conf` file. To find your `postgresql.conf` file, use `SHOW`.

```sql theme={null}
SHOW config_file;
```

These settings can be changed in several ways:

1. By editing the `postgresql.conf` file and restarting Postgres. This makes the setting permanent for all sessions. `postgresql.conf`
   accepts ParadeDB's custom `paradedb.*` settings.
2. By running `SET`. This temporarily changes the setting for the current session. Note that Postgres does not allow all `postgresql.conf` settings to be changed with `SET`.

```sql theme={null}
SET maintenance_work_mem = '8GB'
```

If ParadeDB is deployed with [CloudNativePG](/deploy/self-hosted/kubernetes), these settings should be set in your
`.tfvars` file.

```hcl .tfvars theme={null}
postgresql = {
    parameters = {
      max_worker_processes                   = 76
      max_parallel_workers                   = 64
      # Note that paradedb.* settings must be wrapped in double quotes
      "paradedb.global_mutable_segment_rows" = 1000
    }
}
```


# Read Throughput
Source: https://docs.paradedb.com/documentation/performance-tuning/reads

Settings to improve read performance

As a general rule of thumb, the performance of expensive search queries can be greatly improved
if they are able to access more parallel Postgres workers and more shared buffer memory.

## Raise Parallel Workers

There are three settings that control how many parallel workers ultimately get assigned to a query.

First, `max_worker_processes` is a global limit for the number of workers.
Next, `max_parallel_workers` is a subset of `max_worker_processes`, and sets the limit for workers used in
parallel queries. Finally, `max_parallel_workers_per_gather` limits how many workers a *single query* can receive.

```init postgresql.conf theme={null}
max_worker_processes = 72
max_parallel_workers = 64;
max_parallel_workers_per_gather = 4;
```

In the above example, the maximum number of workers that a single query can receive is set to `4`. The `max_parallel_workers` pool
is set to `64`, which means that `16` queries can execute simultaneously with `4` workers each. Finally, `max_worker_processes` is
set to `72` to give headroom for other workers like autovacuum and replication.

In practice, we recommend experimenting with different settings, as the best configuration depends on the underlying hardware,
query patterns, and volume of data.

<Note>
  If all `max_parallel_workers` are in use, Postgres will still execute
  additional queries, but those queries will run without parallelism. This means
  that queries do not fail — they just may run slower due to lack of
  parallelism.
</Note>

## Raise Shared Buffers

`shared_buffers` controls how much memory is available to the Postgres buffer cache. We recommend allocating no more than 40% of total memory
to `shared_buffers`.

```bash postgresql.conf theme={null}
shared_buffers = 8GB
```

The `pg_prewarm` extension can be used to load the BM25 index into the buffer cache after Postgres restarts. A higher `shared_buffers` value allows more of the index to be
stored in the buffer cache.

```sql theme={null}
CREATE EXTENSION pg_prewarm;
SELECT pg_prewarm('search_idx');
```

## Configure Autovacuum

If an index experiences frequent writes, the search performance of some queries like [sorting](/documentation/sorting/score) or
[aggregates](/documentation/aggregates/overview) can degrade if `VACUUM` has not been recently run. This is because writes can cause parts of Postgres' visibility map
to go out of date, and `VACUUM` updates the visibility map.

To determine if search performance is degraded by lack of `VACUUM`, run `EXPLAIN ANALYZE` over a query. A `Parallel Custom Scan`
in the query plan with a large number of `Heap Fetches` typically means that `VACUUM` should be run.

Postgres can be configured to automatically vacuum a table when a certain number of rows have been updated. Autovacuum settings
can be set globally in `postgresql.conf` or for a specific table.

```sql theme={null}
ALTER TABLE mock_items SET (autovacuum_vacuum_threshold = 500);
```

There are several [autovacuum settings](https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM), but the important ones to
note are:

1. `autovacuum_vacuum_scale_factor` triggers an autovacuum if a certain percentage of rows in a table have been updated.
2. `autovacuum_vacuum_threshold` triggers an autovacuum if an absolute number of rows have been updated.
3. `autovacuum_naptime` ensures that vacuum does not run too frequently.

This means that setting `autovacuum_vacuum_scale_factor` to `0` and `autovacuum_vacuum_threshold` to `100000` will trigger an autovacuum
for every `100000` row updates. As a general rule of thumb, we recommend autovacuuming at least once every `100000` single-row updates.

## Adjust Target Segment Count

By default, `CREATE INDEX`/`REINDEX` will create as many segments as there are CPUs on the host machine. This can be changed using the
`target_segment_count` index option.

```sql theme={null}
CREATE INDEX search_idx ON mock_items USING bm25 (id, description, rating) WITH (key_field = 'id', target_segment_count = 32, ...);
```

This property is attached to the index so that during `REINDEX`, the same value will be used.

It can be changed with ALTER INDEX, like so:

```sql theme={null}
ALTER INDEX search_idx SET (target_segment_count = 8);
```

However, a `REINDEX` is required to rebalance the index to that segment count.

For optimal performance, the segment count should equal the number of parallel workers that a query can receive, which is controlled by
[`max_parallel_workers_per_gather`](/documentation/performance-tuning/reads#raise-parallel-workers). If `max_parallel_workers_per_gather` is greater than the number of CPUs on the host machine, then increasing the target segment count to match `max_parallel_workers_per_gather` can improve query
performance.

<Note>
  `target_segment_count` is merely a suggestion.

  While `pg_search` will endeavor to ensure the created index will have exactly this many segments, it is possible for it
  to have less or more. Mostly this depends on the distribution of work across parallel builder processes, memory
  constraints, and table size.
</Note>


# Write Throughput
Source: https://docs.paradedb.com/documentation/performance-tuning/writes

Settings to improve write performance

These actions can improve the throughput of `INSERT`/`UPDATE`/`COPY` statements to the BM25 index.

## Ensure Merging Happens in the Background

During every `INSERT`/`UPDATE`/`COPY`/`VACUUM`, the BM25 index runs a compaction process that looks for opportunities to merge segments
together. The goal is to consolidate smaller segments into larger ones, reducing the total number of segments and improving query performance.

Segments become candidates for merging if their combined size meets or exceeds one of several **configurable layer thresholds**. These thresholds define target
segment sizes — such as `10KB`, `100KB`, `1MB`, etc. For each layer, the compactor checks if there are enough smaller segments whose total size adds up to the threshold.

The default layer sizes are `100KB`, `1MB`, `100MB`, `1GB`, and `10GB` but can be configured.

```sql theme={null}
ALTER INDEX search_idx SET (background_layer_sizes = '100MB, 1GB');
```

By default, merging happens in the background so that writes are not blocked. The `layer_sizes` option allows merging to happen in the foreground.
This is not typically recommended because it slows down writes, but can be used to apply back pressure to writes if segments are being created faster
than they can be merged down.

```sql theme={null}
ALTER INDEX search_idx SET (layer_sizes = '100KB, 1MB');
```

Setting `layer_sizes` to `0` disables foreground merging, and setting `background_layer_sizes` to `0` disables background merging.

## Increase Work Memory for Bulk Updates

`work_mem` controls how much memory to allocate to a single `INSERT`/`UPDATE`/`COPY` statement. Each statement that writes to a BM25 index is required to have at least `15MB` memory. If
`work_mem` is below `15MB`, it will be ignored and `15MB` will be used.

If your typical update patterns are large, bulk updates (not single-row updates) a larger value may be better.

```sql theme={null}
SET work_mem = 64MB;
```

Since many write operations can be running concurrently, this value should be raised more conservatively than `maintenance_work_mem`.

## Increase Mutable Segment Size

The `mutable_segment_rows` setting enables use of mutable segments, which buffer new rows in order to amortize the cost of indexing them.
By default, it is set to `1000`, which means that 1000 writes are buffered before being flushed.

```sql theme={null}
ALTER INDEX search_idx SET (mutable_segment_rows = 1000);
```

A higher value generally improves write throughput at the expense of read performance,
since the mutable data structure is slower to search. Additionally, the mutable data structure is read into
memory, so higher values cause reads to consume more RAM.

Alternatively, this setting can be set to apply to all indexes in the database:

```sql theme={null}
SET paradedb.global_mutable_segment_rows = 1000
```

If both a per-index setting and global setting exist, the global `paradedb.global_mutable_segment_rows` will be used.
To ignore the global setting, set `paradedb.global_mutable_segment_rows` to `-1` (this is the default).

```sql theme={null}
SET paradedb.global_mutable_segment_rows = -1
```


# All
Source: https://docs.paradedb.com/documentation/query-builder/compound/all

Search all rows in the index

The all query means "search all rows in the index."

The primary use case for the all query is to force the query to be executed by the ParadeDB index instead of Postgres' other execution methods.
Because ParadeDB executes a query only when a ParadeDB operator is present in the query, the all query injects an operator into the query
without changing the query's meaning.

To use it, pass the [key field](/documentation/indexing/create-index#choosing-a-key-field) to the left-hand side of `@@@` and `pdb.all()` to the right-hand side.

```sql theme={null}
-- Top N executed by standard Postgres
SELECT * FROM mock_items
WHERE rating IS NOT NULL
ORDER BY rating
LIMIT 5;

-- Top N executed by ParadeDB
SELECT * FROM mock_items
WHERE rating IS NOT NULL AND id @@@ pdb.all()
ORDER BY rating
LIMIT 5;
```

This is useful for cases where queries that don't contain a ParadeDB operator can be more efficiently executed by ParadeDB vs. standard Postgres,
like [Top N](/documentation/sorting/topn) or [aggregate](/documentation/aggregates/overview) queries.


# Query Parser
Source: https://docs.paradedb.com/documentation/query-builder/compound/query-parser

Accept raw user-provided query strings

The parse query accepts a [Tantivy query string](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html).
The intended use case is for accepting raw query strings provided by the end user.

To use it, pass the [key field](/documentation/indexing/create-index#choosing-a-key-field) to the left-hand side of `@@@` and `pdb.parse('<query>')` to the right-hand side.

```sql theme={null}
SELECT description, rating, category FROM mock_items
WHERE id @@@ pdb.parse('description:(sleek shoes) AND rating:>3');
```

Please refer to the [Tantivy docs](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html) for an overview of
the query string language.

## Lenient Parsing

By default, strict syntax parsing is used. This means that if any part of the query does not conform to Tantivy’s query string syntax, the query fails. For instance, a valid field name must be provided before every query (i.e. `category:footwear`).
By setting `lenient` to `true`, the query is executed on a best-effort basis. For example, if no field names are provided, the query is executed over all fields in the index.

```sql theme={null}
SELECT description, rating, category FROM mock_items
WHERE id @@@ pdb.parse('description:(sleek shoes) AND rating:>3', lenient => true);
```

## Conjunction Mode

By default, terms in the query string are `OR`ed together. With `conjunction_mode` set to `true`, they are instead `AND`ed together.
For instance, the following query returns documents containing both `sleek` and `shoes`.

```sql theme={null}
SELECT description, rating, category FROM mock_items
WHERE id @@@ pdb.parse('description:(sleek shoes)', conjunction_mode => true);
```


# How Advanced Query Functions Work
Source: https://docs.paradedb.com/documentation/query-builder/overview

ParadeDB's query builder functions provide advanced query types

In addition to basic [match](/documentation/full-text/match), [phrase](/documentation/full-text/phrase), and
[term](/documentation/full-text/term) queries, additional advanced query types are exposed as query builder functions.

Query builder functions use the `@@@` operator. `@@@` takes a column on the left-hand side and a query builder function on the
right-hand side. It means "find all rows where the column matches the given query."

For example:

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description @@@ pdb.regex('key.*rd');
```

```ini Expected Response theme={null}
       description        | rating |  category
--------------------------+--------+-------------
 Ergonomic metal keyboard |      4 | Electronics
 Plastic Keyboard         |      4 | Electronics
(2 rows)
```

This uses the [regex](/documentation/query-builder/term/regex) builder function to match all rows where `description` matches the regex expression `key.*rd`.


# Phrase Prefix
Source: https://docs.paradedb.com/documentation/query-builder/phrase/phrase-prefix

Finds documents containing a phrase followed by a term prefix

Phrase prefix identifies documents containing a phrase followed by a term prefix.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description @@@ pdb.phrase_prefix(ARRAY['running', 'sh']);
```

<div />

<ParamField>
  An `ARRAY` of tokens that the search is looking to match, followed by a term
  prefix rather than a complete term.
</ParamField>

<ParamField>
  Limits the number of term variations that the prefix can expand to during the
  search. This helps in controlling the breadth of the search by setting a cap
  on how many different terms the prefix can match.
</ParamField>

## Performance Considerations

Expanding a prefix might lead to thousands of matching terms, which impacts search times.

With `max_expansions`, the prefix term is expanded to at most `max_expansions` terms
in lexicographic order. For instance, if `sh` matches `shall`, `share`, `shoe`, and `shore` but `max_expansions` is set to 3,
`sh` will only be expanded to `shall`, `share`, and `shoe`.


# Regex Phrase
Source: https://docs.paradedb.com/documentation/query-builder/phrase/regex-phrase

Matches a specific sequence of regex queries

Regex phrase matches a specific sequence of regex queries. Think of it like a conjunction of [regex](/documentation/query-builder/term/regex)
queries, with positions and ordering of tokens enforced.

For example, the regex phrase query for `ru.* shoes` will match `running shoes`, but will not match `shoes running`.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description @@@ pdb.regex_phrase(ARRAY['ru.*', 'shoes']);
```

<div />

<ParamField>
  An `ARRAY` of expressions that form the search phrase. These expressions must
  appear in the specified order within the document for a match to occur,
  although some flexibility is allowed based on the `slop` parameter. Please see
  [regex](/documentation/query-builder/term/regex) for allowed regex constructs.
</ParamField>

<ParamField>
  A slop of `0` requires the terms to appear exactly as they are in the phrase
  and adjacent to each other. Higher slop values allow for transpositions and
  distance between terms.
</ParamField>

<ParamField>
  Limits total number of terms that the regex phrase query can expand to. If
  this number is exceeded, an error will be returned.
</ParamField>


# More Like This
Source: https://docs.paradedb.com/documentation/query-builder/specialized/more-like-this

Finds documents that are "like" another document.

The more like this (MLT) query finds documents that are "like" another document.
To use this query, pass the [key field](/documentation/indexing/create-index#choosing-a-key-field) value of the input document
to `pdb.more_like_this`.

For instance, the following query finds documents that are "like" a document with an `id` of `3`:

```sql theme={null}
SELECT id, description, rating, category
FROM mock_items
WHERE id @@@ pdb.more_like_this(3)
ORDER BY id;
```

```ini Expected Response theme={null}
 id |     description      | rating | category
----+----------------------+--------+----------
  3 | Sleek running shoes  |      5 | Footwear
  4 | White jogging shoes  |      3 | Footwear
  5 | Generic shoes        |      4 | Footwear
 13 | Sturdy hiking boots  |      4 | Footwear
 23 | Comfortable slippers |      3 | Footwear
 33 | Winter woolen socks  |      5 | Footwear
(6 rows)
```

In the output above, notice that documents matching any of the indexed fields, `description`, `rating`, and `category`, were returned.
This is because, by default, all fields present in the index are considered for matching.

<Note>
  The only exception is JSON fields, which are not yet supported and are ignored
  by the more like this query.
</Note>

To find only documents that match on specific fields, provide an array of field names as the second argument:

```sql theme={null}
SELECT id, description, rating, category
FROM mock_items
WHERE id @@@ pdb.more_like_this(3, ARRAY['description'])
ORDER BY id;
```

```ini Expected Response theme={null}
 id |     description     | rating | category
----+---------------------+--------+----------
  3 | Sleek running shoes |      5 | Footwear
  4 | White jogging shoes |      3 | Footwear
  5 | Generic shoes       |      4 | Footwear
(3 rows)
```

<Note>
  Because JSON fields are not yet supported for MLT, an error will be returned
  if a JSON field is passed into the array.
</Note>

## How It Works

Let's look at how the MLT query works under the hood:

1. Stored values for the input document's fields are retrieved. If they are text fields, they are tokenized and filtered in the same way
   as the field was during [index creation](/documentation/indexing/create-index).
2. A set of representative terms is created from the input document. For example, in the statement above, these terms would be
   `sleek`, `running`, and `shoes` for the `description` field; `5` for the `rating` field; `footwear` for the `category` field.
3. Documents with at least one term match across any of the fields are considered a match.

## Using a Custom Input Document

In addition to providing a key field value, a custom document can also be provided as JSON.
The JSON keys are field names and must correspond to field names in the index.

```sql theme={null}
SELECT id, description, rating, category
FROM mock_items
WHERE id @@@ pdb.more_like_this('{"description": "Sleek running shoes", "category": "footwear"}')
ORDER BY id;
```

## Configuration Options

### Term Frequency

`min_term_frequency` excludes terms that appear fewer than a certain number of times in the input document,
while `max_term_frequency` excludes terms that appear more than that many times. By default, no terms are excluded
based on term frequency.

For instance, the following query returns no results because no term appears twice in the input document.

```sql theme={null}
SELECT id, description, rating, category
FROM mock_items
WHERE id @@@ pdb.more_like_this(3, min_term_frequency => 2)
ORDER BY id;
```

### Document Frequency

`min_doc_frequency` excludes terms that appear in fewer than a certain number of documents across the entire index,
while `max_doc_frequency` excludes terms that appear in more than that many documents. By default, no terms are excluded
based on document frequency.

```sql theme={null}
SELECT id, description, rating, category
FROM mock_items
WHERE id @@@ pdb.more_like_this(3, min_doc_frequency => 3)
ORDER BY id;
```

### Max Query Terms

By default, only the top 25 terms across all fields are considered for matching. Terms are scored using a combination of inverse document
frequency and term frequency (TF-IDF) -- this means that terms that appear frequently in the input document and are rare across the index
score the highest.

This can be configured with `max_query_terms`:

```sql theme={null}
SELECT id, description, rating, category
FROM mock_items
WHERE id @@@ pdb.more_like_this(3, max_query_terms => 10)
ORDER BY id;
```

### Term Length

`min_word_length` and `max_word_length` can be used to exclude terms that are too short or too long, respectively. By default, no terms
are excluded based on length.

```sql theme={null}
SELECT id, description, rating, category
FROM mock_items
WHERE id @@@ pdb.more_like_this(3, min_word_length => 5)
ORDER BY id;
```

### Custom Stopwords

To exclude terms from being considered, provide a text array to `stopwords`:

```sql theme={null}
SELECT id, description, rating, category
FROM mock_items
WHERE id @@@ pdb.more_like_this(3, stopwords => ARRAY['the', 'a'])
ORDER BY id;
```


# Range Term
Source: https://docs.paradedb.com/documentation/query-builder/term/range-term

Filters over Postgres range types

`range_term` is the equivalent of Postgres' operators over [range types](https://www.postgresql.org/docs/current/rangetypes.html).
It supports operations like range containment, overlap, and intersection.

## Term Within

In this example, `weight_range` is an `int4range` type.
The following query finds all rows where `weight_range` contains `1`:

```sql theme={null}
SELECT id, weight_range FROM mock_items
WHERE weight_range @@@ pdb.range_term(1);
```

## Range Intersects

The following query finds all ranges that share at least one common
point with the query range:

```sql theme={null}
SELECT id, weight_range FROM mock_items
WHERE weight_range @@@ pdb.range_term('(10, 12]'::int4range, 'Intersects');
```

## Range Contains

The following query finds all ranges that are contained by the query range:

```sql theme={null}
SELECT id, weight_range FROM mock_items
WHERE weight_range @@@ pdb.range_term('(3, 9]'::int4range, 'Contains');
```

## Range Within

The following query finds all ranges that contain the query range:

```sql theme={null}
SELECT id, weight_range FROM mock_items
WHERE weight_range @@@ pdb.range_term('(2, 11]'::int4range, 'Within');
```


# Regex
Source: https://docs.paradedb.com/documentation/query-builder/term/regex

Searches for terms that match a regex pattern

Regex queries search for terms that follow a pattern. For example, the wildcard pattern `key.*` finds all terms that start with `key`.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description @@@ pdb.regex('key.*');
```

ParadeDB supports all regex constructs of the Rust [regex](https://docs.rs/regex/latest/regex/) crate, with the following exceptions:

1. Lazy quantifiers such as `+?`
2. Word boundaries such as `\b`

Otherwise, the full syntax of the [regex](https://docs.rs/regex/latest/regex/) crate is supported, including all Unicode support and relevant flags.

A list of regex flags and grouping options can be [found here](https://docs.rs/regex/latest/regex/#grouping-and-flags), which includes:

* named and numbered capture groups
* case insensitivty flag (`i`)
* multi-line mode (`m`)

<Note>
  Regex queries operate at the token level. To execute regex over the original
  text, use the keyword tokenizer.
</Note>

## Performance Considerations

During a regex query, ParadeDB doesn't scan through every single word. Instead, it uses a highly optimized structure called a [finite state transducer (FST)](https://en.wikipedia.org/wiki/Finite-state_transducer) that makes it possible to jump straight to the matching terms.
Even if the index contains millions of words, the regex query only looks at the ones that have a chance of matching, skipping everything else.

This is why the certain regex constructs are not supported -- they are difficult to implement efficiently.


# Relevance Tuning
Source: https://docs.paradedb.com/documentation/sorting/boost

Tune the BM25 score by adjusting the weights of individual queries

## Boosting

ParadeDB offers several ways to tune a document's [BM25 score](/documentation/sorting/score).
The first is boosting, which increases or decreases the impact of a specific query by multiplying its contribution to the overall BM25 score.

To boost a query, cast the query to the `boost` type. In this example, the `shoes` query is weighted twice as heavily as the `footwear` query.

```sql theme={null}
SELECT id, pdb.score(id), description, category
FROM mock_items
WHERE description ||| 'shoes'::pdb.boost(2) OR category ||| 'footwear'
ORDER BY score DESC
LIMIT 5;
```

`boost` takes a numeric value, which is the multiplicative boost factor. It can be any floating point number between `-2048` and `2048`.

[Query builder functions](/documentation/query-builder/overview) can also be boosted:

```sql theme={null}
SELECT id, description, category, pdb.score(id)
FROM mock_items
WHERE description @@@ pdb.regex('key.*')::pdb.boost(2)
ORDER BY score DESC
LIMIT 5;
```

Boost can be used in conjunction with other type casts, like [fuzzy](/documentation/full-text/fuzzy):

```sql theme={null}
SELECT id, description, category, pdb.score(id)
FROM mock_items
WHERE description ||| 'shose'::pdb.fuzzy(2)::pdb.boost(2)
ORDER BY score DESC
LIMIT 5;
```

## Constant Scoring

Constant scoring assigns the same score to all documents that match a query. To apply a constant score, cast the query to the `const` type with a
numeric value.

For instance, the following query assigns a score of `1` to all documents matching the query `shoes`.

```sql theme={null}
SELECT id, pdb.score(id), description, category
FROM mock_items
WHERE description ||| 'shoes'::pdb.const(1)
ORDER BY score DESC
LIMIT 5;
```


# BM25 Scoring
Source: https://docs.paradedb.com/documentation/sorting/score

BM25 scores sort the result set by relevance

BM25 scores measure how relevant a score is for a given query. Higher scores indicate higher relevance.

## Basic Usage

The `pdb.score(<key_field>)` function produces a BM25 score and can be added to any query where any of the ParadeDB operators are present.

```sql theme={null}
SELECT id, pdb.score(id)
FROM mock_items
WHERE description ||| 'shoes'
ORDER BY pdb.score(id)
LIMIT 5;
```

In order for a field to be factored into the BM25 score, it must be present in the BM25 index. For instance,
consider this query:

```sql theme={null}
SELECT id, pdb.score(id)
FROM mock_items
WHERE description ||| 'keyboard' OR rating < 2
ORDER BY pdb.score(id)
LIMIT 5;
```

While BM25 scores will be returned as long as `description` is indexed, including `rating` in the BM25 index definition will allow results matching
`rating < 2` to rank higher than those that do not match.

## Joined Scores

First, let's create a second table called `orders` that can be joined with `mock_items`:

```sql theme={null}
CALL paradedb.create_bm25_test_table(
  schema_name => 'public',
  table_name => 'orders',
  table_type => 'Orders'
);

ALTER TABLE orders
ADD CONSTRAINT foreign_key_product_id
FOREIGN KEY (product_id)
REFERENCES mock_items(id);

CREATE INDEX orders_idx ON orders
USING bm25 (order_id, product_id, order_quantity, order_total, customer_name)
WITH (key_field = 'order_id');
```

Next, let's compute a "combined BM25 score" over a join across both tables.

```sql theme={null}
SELECT o.order_id, o.customer_name, m.description, pdb.score(o.order_id) + pdb.score(m.id) as score
FROM orders o
JOIN mock_items m ON o.product_id = m.id
WHERE o.customer_name ||| 'Johnson' AND m.description ||| 'running shoes'
ORDER BY score DESC, o.order_id
LIMIT 5;
```

## Score Refresh

The scores generated by the BM25 index may be influenced by dead rows that have not been cleaned up by the `VACUUM` process.

Running `VACUUM` on the underlying table will remove all dead rows from the index and ensures that only rows visible to the current
transaction are factored into the BM25 score.

```sql theme={null}
VACUUM mock_items;
```

This can be automated with [autovacuum](/documentation/performance-tuning/overview).


# Top N
Source: https://docs.paradedb.com/documentation/sorting/topn

ParadeDB is optimized for quickly finding the "Top N" results in a table

ParadeDB is highly optimized for quickly returning the Top N results out of the index. In SQL, this means queries that contain an `ORDER BY...LIMIT`:

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description ||| 'running shoes'
ORDER BY rating
LIMIT 5;
```

In order for a Top N query to be executed by ParadeDB vs. vanilla Postgres, all of the following conditions must be met:

1. All `ORDER BY` fields must be indexed. If they are text fields, they [must use the literal tokenizer](#sorting-by-text).
2. At least one ParadeDB text search operator must be present at the same level as the `ORDER BY...LIMIT`.
3. The query must have a `LIMIT`.
4. With the exception of `lower`, ordering by expressions is not supported -- only the raw fields themselves.

To verify that ParadeDB is executing the Top N, look for a `Custom Scan` with a `TopNScanExecState` in the `EXPLAIN` output:

```sql theme={null}
EXPLAIN SELECT description, rating, category
FROM mock_items
WHERE description ||| 'running shoes'
ORDER BY rating
LIMIT 5;
```

<Accordion title="Expected Response">
  ```csv theme={null}
                                                                                                     QUERY PLAN
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   Limit  (cost=10.00..10.02 rows=3 width=552)
     ->  Custom Scan (ParadeDB Scan) on mock_items  (cost=10.00..10.02 rows=3 width=552)
           Table: mock_items
           Index: search_idx
           Segment Count: 1
           Exec Method: TopNScanExecState
           Scores: false
              TopN Order By: rating asc
              TopN Limit: 5
           Tantivy Query: {"with_index":{"query":{"match":{"field":"description","value":"running shoes","tokenizer":null,"distance":null,"transposition_cost_one":null,"prefix":null,"conjunction_mode":false}}}}
  (10 rows)
  ```
</Accordion>

If any of the above conditions are not met, the query cannot be fully optimized and you will not see a `TopNScanExecState` in the `EXPLAIN` output.

## Tiebreaker Sorting

To guarantee stable sorting in the event of a tie, additional columns can be provided to `ORDER BY`:

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description ||| 'running shoes'
ORDER BY rating, id
LIMIT 5;
```

<Note>
  ParadeDB is currently able to handle 3 `ORDER BY` columns. If there are more
  than 3 columns, the `ORDER BY` will not be efficiently executed by ParadeDB.
</Note>

## Sorting by Text

If a text field is present in the `ORDER BY` clause, it must be indexed with the [literal](/documentation/tokenizers/available-tokenizers/literal) or
[literal normalized](/documentation/tokenizers/available-tokenizers/literal-normalized) tokenizer.

Sorting by lowercase text using `lower(<text_field>)` is also supported. To enable this, the expression `lower(<text_field>)` must be indexed
with either the literal or literal normalized tokenizer. See [indexing expressions](/documentation/indexing/indexing-expressions) for more information.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (lower(description)::pdb.literal))
WITH (key_field='id');
```

This allows sorting by lowercase to be optimized.

```sql theme={null}
SELECT description, rating, category
FROM mock_items
WHERE description ||| 'sleek running shoes'
ORDER BY lower(description)
LIMIT 5;
```

## Sorting by JSON

Ordering by JSON subfield is on the roadmap but not yet supported. For example, this query will not receive an optimized
Top N scan:

```sql theme={null}
SELECT id, description, metadata
FROM mock_items
WHERE description ||| 'sleek running shoes'
ORDER BY metadata->'weight'
LIMIT 5;
```


# Alpha Numeric Only
Source: https://docs.paradedb.com/documentation/token-filters/alphanumeric

Removes any tokens that contain characters that are not ASCII letters

The alpha numeric only filter removes any tokens that contain characters that are not ASCII letters (i.e. `a` to `z` and `A` to `Z`) or digits
(i.e. `0` to `9`). It is supported for all tokenizers besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer.

To enable, append `alpha_num_only=true` to the tokenizer's arguments.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('alpha_num_only=true')))
WITH (key_field='id');
```

To demonstrate this token filter, let's compare the output of the following two statements:

```sql theme={null}
SELECT
  'The café at 9pm!'::pdb.simple::text[],
  'The café at 9pm!'::pdb.simple('alpha_num_only=true')::text[];
```

```ini Expected Response theme={null}
       text        |     text
-------------------+--------------
 {the,café,at,9pm} | {the,at,9pm}
(1 row)
```


# ASCII Folding
Source: https://docs.paradedb.com/documentation/token-filters/ascii-folding

Strips away diacritical marks like accents

The ASCII folding filter strips away diacritical marks (accents, umlauts, tildes, etc.) while leaving the base character intact.
It is supported for all tokenizers besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer.

To enable, append `ascii_folding=true` to the tokenizer's arguments.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('ascii_folding=true')))
WITH (key_field='id');
```

To demonstrate this token filter, let's compare the output of the following two statements:

```sql theme={null}
SELECT
  'Café naïve coöperate'::pdb.simple::text[],
  'Café naïve coöperate'::pdb.simple('ascii_folding=true')::text[];
```

```ini Expected Response theme={null}
          text          |          text
------------------------+------------------------
 {café,naïve,coöperate} | {cafe,naive,cooperate}
(1 row)
```


# Lowercase
Source: https://docs.paradedb.com/documentation/token-filters/lowercase

Converts all characters to lowercase

The lowercase filter converts all characters to lowercase, allowing for case-insensitive queries. It is enabled by default but can be
configured for all tokenizers besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer.

To disable, append `lowercase=false` to the tokenizer's arguments:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('lowercase=false')))
WITH (key_field='id');
```

To demonstrate this token filter, let's compare the output of the following two statements:

```sql theme={null}
SELECT
  'Tokenize me!'::pdb.simple::text[],
  'Tokenize me!'::pdb.simple('lowercase=false')::text[];
```

```ini Expected Response theme={null}
     text      |     text
---------------+---------------
 {tokenize,me} | {Tokenize,me}
(1 row)
```


# How Token Filters Work
Source: https://docs.paradedb.com/documentation/token-filters/overview

Token filters apply additional processing to tokens like lowercasing or stemming

After a [tokenizer](/documentation/tokenizers/overview) splits up text into tokens, token filters
apply additional processing to each token. Common examples include [stemming](/documentation/token-filters/stemming)
to reduce words to their root form, or [ASCII folding](/documentation/token-filters/ascii-folding) to remove accents.

Token filters can be added to any tokenizer besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer, which by definition
must preserve the source text exactly.

To add a token filter to a tokenizer, append a configuration string to the argument list:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('stemmer=english', 'ascii_folding=true')))
WITH (key_field='id');
```


# Stemmer
Source: https://docs.paradedb.com/documentation/token-filters/stemming

Reduces words to their root form for a given language

Stemming is the process of reducing words to their root form. In English, for example, the root form of "running" and "runs" is "run".
Stemming can be configured for any tokenizer besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer.

To set a stemmer, append `stemmer=<language>` to the tokenizer's arguments.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('stemmer=english')))
WITH (key_field='id');
```

Valid languages are `arabic`, `danish`, `dutch`, `english`, `finnish`, `french`, `german`, `greek`, `hungarian`, `italian`, `norwegian`, `polish`, `portuguese`, `romanian`, `russian`, `spanish`, `swedish`, `tamil`, and `turkish`.

To demonstrate this token filter, let's compare the output of the following two statements:

```sql theme={null}
SELECT
  'I am running'::pdb.simple::text[],
  'I am running'::pdb.simple('stemmer=english')::text[];
```

```ini Expected Response theme={null}
      text      |    text
----------------+------------
 {i,am,running} | {i,am,run}
(1 row)
```


# Remove Stopwords
Source: https://docs.paradedb.com/documentation/token-filters/stopwords

Remove language-specific stopwords from the index

Stopwords are words that are so common or semantically insignificant in most contexts that they can be ignored during indexing.
In English, for example, stopwords include "a", "and", "or", etc.

All tokenizers besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer can be configured to automatically remove stopwords
for one or more languages.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('stopwords_language=english')))
WITH (key_field='id');
```

Valid languages are `Danish`, `Dutch`, `English`, `Finnish`, `French`, `German`, `Hungarian`, `Italian`, `Norwegian`, `Polish`, `Portuguese`, `Russian`, `Spanish`, and `Swedish`. Language names are case-insensitive.

## Multiple Languages

For documents containing multiple languages, you can specify multiple stopword languages as a comma-separated list:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('stopwords_language=English,French')))
WITH (key_field='id');
```

```sql theme={null}
SELECT 'the quick fox and le renard et'::pdb.simple('stopwords_language=English,French')::text[];
```

```ini Expected Response theme={null}
        text
--------------------
 {quick,fox,renard}
(1 row)
```

## Example

To demonstrate this token filter, let's compare the output of the following two statements:

```sql theme={null}
SELECT
  'The cat in the hat'::pdb.simple::text[],
  'The cat in the hat'::pdb.simple('stopwords_language=English')::text[];
```

```ini Expected Response theme={null}
         text         |   text
----------------------+-----------
 {the,cat,in,the,hat} | {cat,hat}
(1 row)
```


# Token Length
Source: https://docs.paradedb.com/documentation/token-filters/token-length

Remove tokens that are above or below a certain byte length from the index

The token length filter automatically removes tokens that are above or below a certain length in bytes.
To remove all tokens longer than a certain length, append a `remove_long` configuration to the tokenizer:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('remove_long=100')))
WITH (key_field='id');
```

To remove all tokens shorter than a length, use `remove_short`:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('remove_short=3')))
WITH (key_field='id');
```

All tokenizers besides the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer accept these configurations.

To demonstrate this token filter, let's compare the output of the following two statements:

```sql theme={null}
SELECT
  'A supersupersuperlong token'::pdb.simple::text[],
  'A supersupersuperlong token'::pdb.simple('remove_short=2', 'remove_long=10')::text[];
```

```ini Expected Response theme={null}
             text              |  text
-------------------------------+---------
 {a,supersupersuperlong,token} | {token}
(1 row)
```


# Trim
Source: https://docs.paradedb.com/documentation/token-filters/trim

Remove trailing and leading whitespace from a token

The trim filter removes leading and trailing whitespace from a token (but not whitespace in the middle). If a token consists
entirely of whitespace, the token is eliminated entirely.

This filter is useful for tokenizers that don't already split on whitespace, like the [literal normalized](/documentation/tokenizers/available-tokenizers/literal-normalized)
tokenizer or certain language-specific tokenizers.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.literal_normalized('trim=true')))
WITH (key_field='id');
```

To demonstrate this token filter, let's compare the output of the following two statements:

```sql theme={null}
SELECT
  '    token with whitespace   '::pdb.literal_normalized::text[],
  '    token with whitespace   '::pdb.literal_normalized('trim=true')::text[];
```

```ini Expected Response theme={null}
               text               |           text
----------------------------------+---------------------------
 {"    token with whitespace   "} | {"token with whitespace"}
(1 row)
```


# Chinese Compatible
Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/chinese-compatible

A simple tokenizer for Chinese, Japanese, and Korean characters

The Chinese compatible tokenizer is like the [simple](/documentation/tokenizers/available-tokenizers/simple) tokenizer -- it lowercases non-CJK characters and splits on
any non-alphanumeric character. Additionally, it treats each CJK character as its own token.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.chinese_compatible))
WITH (key_field='id');
```

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Hello world! 你好!'::pdb.chinese_compatible::text[];
```

```ini Expected Response theme={null}
        text
---------------------
 {hello,world,你,好}
(1 row)
```


# ICU
Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/icu

Splits text according to the Unicode standard

The ICU (International Components for Unicode) tokenizer breaks down text according to the Unicode standard. It can be used to tokenize most languages and recognizes the nuances in word boundaries across different languages.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.icu))
WITH (key_field='id');
```

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Hello world! 你好!'::pdb.icu::text[];
```

```ini Expected Response theme={null}
        text
--------------------
 {hello,world,你好}
(1 row)
```


# Jieba
Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/jieba

The most advanced Chinese tokenizer that leverages both a dictionary and statistical models

The Jieba tokenizer is a tokenizer for Chinese text that leverages both a dictionary and statistical models. It is generally considered to be better at identifying ambiguous Chinese word boundaries
compared to the [Chinese Lindera](/documentation/tokenizers/available-tokenizers/lindera) and [Chinese compatible](/documentation/tokenizers/available-tokenizers/chinese-compatible) tokenizers, but
the tradeoff is that it is slower.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.jieba))
WITH (key_field='id');
```

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Hello world! 你好!'::pdb.jieba::text[];
```

```ini Expected Response theme={null}
              text
--------------------------------
 {hello," ",world,!," ",你好,!}
(1 row)
```


# Lindera
Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/lindera

Uses prebuilt dictionaries to tokenize Chinese, Japanese, and Korean text

The Lindera tokenizer is a more advanced CJK tokenizer that uses prebuilt Chinese, Japanese, or Korean dictionaries to break text into meaningful tokens (words or phrases) rather than on individual characters.
Chinese Lindera uses the CC-CEDICT dictionary, Korean Lindera uses the KoDic dictionary, and Japanese Lindera uses the IPADIC dictionary.

By default, non-CJK text is lowercased, but punctuation and whitespace are not ignored.

<CodeGroup>
  ```sql Chinese Lindera theme={null}
  CREATE INDEX search_idx ON mock_items
  USING bm25 (id, (description::pdb.lindera(chinese)))
  WITH (key_field='id');
  ```

  ```sql Korean Lindera theme={null}
  CREATE INDEX search_idx ON mock_items
  USING bm25 (id, (description::pdb.lindera(korean)))
  WITH (key_field='id');
  ```

  ```sql Japanese Lindera theme={null}
  CREATE INDEX search_idx ON mock_items
  USING bm25 (id, (description::pdb.lindera(japanese)))
  WITH (key_field='id');
  ```
</CodeGroup>

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Hello world! 你好!'::pdb.lindera(chinese)::text[];
```

```ini Expected Response theme={null}
              text
--------------------------------
 {hello," ",world,!," ",你好,!}
(1 row)
```


# Literal
Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/literal

Indexes the text in its raw form, without any splitting or processing

<Note>
  The literal tokenizer is not ideal for text search queries like
  [match](/documentation/full-text/match) or
  [phrase](/documentation/full-text/phrase). If you need to do text search over
  a field that is literal tokenized, consider using [multiple
  tokenizers](/documentation/tokenizers/multiple-per-field).
</Note>

<Note>
  Because the literal tokenizer preserves the source text exactly, [token
  filters](/documentation/token-filters/overview) cannot be configured for this
  tokenizer.
</Note>

The literal tokenizer applies no tokenization to the text, preserving it as-is. It is the default for `uuid` fields (since
exact UUID matching is a common use case), and is useful for doing exact string matching over text fields.

It is also required if the text field is used as a sort field in a [Top N](/documentation/sorting/topn) query,
or as part of an [aggregate](/documentation/aggregates/overview).

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.literal))
WITH (key_field='id');
```

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Tokenize me!'::pdb.literal::text[];
```

```ini Expected Response theme={null}
       text
------------------
 {"Tokenize me!"}
(1 row)
```


# Literal Normalized
Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/literal-normalized

Like the literal tokenizer, but allows for token filters

<Note>
  For all patch versions greater than `0.20.8` in the `20` minor version, and
  all patch versions greater than `0.21.4` in the `21` minor version, fields
  using the [literal
  normalized](/documentation/tokenizers/available-tokenizers/literal-normalized)
  tokenizer are also columnar indexed. This means that they can be used in
  [aggregates](/documentation/aggregates/overview) and [Top N
  queries](/documentation/sorting/topn). Indexes created prior to these versions
  must be reindexed to use this feature.
</Note>

The literal normalized tokenizer is similar to the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer in that it does not split the source text.
All text is treated as a single token, regardless of how many words are contained.

However, unlike the literal tokenizer, this tokenizer allows [token filters](/documentation/token-filters/overview) to be applied. By default, the literal normalized tokenizer
also [lowercases](/documentation/token-filters/lowercase) the text.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.literal_normalized))
WITH (key_field='id');
```

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Tokenize me!'::pdb.literal_normalized::text[];
```

```ini Expected Response theme={null}
       text
------------------
 {"tokenize me!"}
(1 row)
```


# Ngram
Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/ngrams

Splits text into small chunks called grams, useful for partial matching

The ngram tokenizer splits text into "grams," where each "gram" is of a certain length.

The tokenizer takes two arguments. The first is the minimum character length of a "gram," and the second is the maximum character length. Grams will be generated for all sizes between
the minimum and maximum gram size, inclusive. For example, `pdb.ngram(2,5)` will generate tokens of size `2`, `3`, `4`, and `5`.

To generate grams of a single fixed length, set the minimum and maximum gram size equal to each other.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.ngram(3,3)))
WITH (key_field='id');
```

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Tokenize me!'::pdb.ngram(3,3)::text[];
```

```ini Expected Response theme={null}
                      text
-------------------------------------------------
 {tok,oke,ken,eni,niz,ize,"ze ","e m"," me",me!}
(1 row)
```

## Ngram Prefix Only

The generate ngram tokens for only the first `n` characters in the text, set `prefix_only` to `true`.

```sql theme={null}
SELECT 'Tokenize me!'::pdb.ngram(3,3,'prefix_only=true')::text[];
```

```ini Expected Response theme={null}
 text
-------
 {tok}
(1 row)
```

## Phrase and Proximity Queries with Ngram

Because multiple ngram tokens can overlap, the ngram tokenizer does not store token positions. As a result,
queries that rely on token positions like [phrase](/documentation/full-text/phrase), [phrase prefix](/documentation/query-builder/phrase/phrase-prefix), [regex phrase](/documentation/query-builder/phrase/regex-phrase) and [proximity](/documentation/full-text/proximity) are not supported over ngram-tokenized
fields.

An exception is if the min gram size equals the max gram size, which guarantees unique token positions. In this case, setting
`positions=true` enables these queries.

```sql theme={null}
SELECT 'Tokenize me!'::pdb.ngram(3,3,'positions=true')::text[];
```

### Exact Substring Matching with Phrase Queries

With `positions=true`, [phrase queries](/documentation/full-text/phrase) over ngram fields perform exact substring matching.
This is faster than using [match conjunction](/documentation/full-text/match#match-conjunction) on an ngram field, which
creates a `Must` clause for every ngram token and intersects them independently. A phrase query uses a single positional
intersection instead.

The tradeoff is that phrase queries are stricter: they require tokens at consecutive positions within a single field value,
while match conjunction only requires all tokens to appear somewhere in the document.

```sql theme={null}
CREATE TABLE books (id SERIAL PRIMARY KEY, titles TEXT[]);
INSERT INTO books (titles) VALUES
    (ARRAY['The Dragon Hatchling', 'Wings of Gold']),
    (ARRAY['Dragon Slayer', 'Hatchling Care']);

CREATE INDEX ON books
USING bm25 (id, (titles::pdb.ngram(4,4,'positions=true')))
WITH (key_field='id');

-- Phrase: matches exact substring "Dragon Hatchling" — only row 1
SELECT * FROM books WHERE titles ### 'Dragon Hatchling';

-- Match conjunction: matches all ngrams anywhere — also only row 1 here,
-- but on larger datasets could match rows where the ngrams are scattered
SELECT * FROM books WHERE titles ||| 'Dragon Hatchling';

DROP TABLE books;
```

When constructing queries as [JSON](/documentation/query-builder/json), use `tokenized_phrase` to achieve the same
result as the `###` operator. It tokenizes the input string with the field's tokenizer and builds
a phrase query from the resulting tokens:

```json theme={null}
{ "tokenized_phrase": { "field": "titles", "phrase": "Dragon Hatchling" } }
```


# Regex Patterns
Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/regex

Tokenizes text using a regular expression

The `regex_pattern` tokenizer tokenizes text using a regular expression. The regular expression can be specified with the pattern parameter.
For instance, the following tokenizer creates tokens only for words starting with the letter `h`:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.regex_pattern('(?i)\bh\w*')))
WITH (key_field='id');
```

The regex tokenizer uses the Rust [regex](https://docs.rs/regex/latest/regex/) crate, which supports all regex constructs with the following
exceptions:

1. Lazy quantifiers such as `+?`
2. Word boundaries such as `\b`

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Hello world!'::pdb.regex_pattern('(?i)\bh\w*')::text[];
```

```ini Expected Response theme={null}
  text
---------
 {hello}
(1 row)
```


# Simple
Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/simple

Splits on any non-alphanumeric character

The simple tokenizer splits on any non-alphanumeric character (e.g. whitespace, punctuation, symbols). All characters are
[lowercased](/documentation/token-filters/lowercase) by default.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple))
WITH (key_field='id');
```

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Tokenize me!'::pdb.simple::text[];
```

```ini Expected Response theme={null}
     text
---------------
 {tokenize,me}
(1 row)
```


# Source Code
Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/source-code

Tokenizes text that is actually code

The source code tokenizer is intended for tokenizing code. In addition to splitting on whitespace,
punctuation, and symbols, it also splits on common casing conventions like camel case and snake case. For instance, text like
`my_variable` or `myVariable` would get split into `my` and `variable`.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.source_code))
WITH (key_field='id');
```

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'let my_variable = 2;'::pdb.source_code::text[];
```

```ini Expected Response theme={null}
        text
---------------------
 {let,my,variable,2}
(1 row)
```


# Unicode
Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/unicode

The default text tokenizer in ParadeDB

The unicode tokenizer splits text according to word boundaries defined by the [Unicode Standard Annex #29](https://www.unicode.org/reports/tr29/)
rules. All characters are [lowercased](/documentation/token-filters/lowercase) by default.

This tokenizer is the default text tokenizer. If no tokenizer is specified for a text field, the unicode tokenizer will be used
(unless the text field is the [key field](/documentation/indexing/create-index#choosing-a-key-field), in which case the text is not tokenized).

```sql theme={null}
-- The following two configurations are equivalent
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (key_field='id');

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.unicode_words))
WITH (key_field='id');
```

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Tokenize me!'::pdb.unicode_words::text[];
```

```ini Expected Response theme={null}
     text
---------------
 {tokenize,me}
(1 row)
```

## Remove Emojis

By default, emojis in the source text are preserved. To remove emojis, set `remove_emojis` to `true`.

```sql theme={null}
SELECT 'Tokenize me! 😊'::pdb.unicode_words('remove_emojis=true')::text[];
```

```ini Expected Response theme={null}
     text
---------------
 {tokenize,me}
(1 row)
```


# Whitespace
Source: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/whitespace

Tokenizes text by splitting on whitespace

The whitespace tokenizer splits only on whitespace. It also [lowercases](/documentation/token-filters/lowercase) characters by default.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.whitespace))
WITH (key_field='id');
```

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Tokenize me!'::pdb.whitespace::text[];
```

```ini Expected Response theme={null}
      text
----------------
 {tokenize,me!}
(1 row)
```


# Multiple Tokenizers Per Field
Source: https://docs.paradedb.com/documentation/tokenizers/multiple-per-field

Apply different token configurations to the same field

In many cases, a text field needs to be tokenized multiple ways. For instance, using the [unicode](/documentation/tokenizers/available-tokenizers/unicode)
tokenizer for search, and the [literal](/documentation/tokenizers/available-tokenizers/literal) tokenizer for [Top N ordering](/documentation/sorting/topn).

To tokenize a field in more than one way, append an `alias=<alias_name>` argument to the additional tokenizer configurations.
The alias name can be any string you like. For instance, the following statement tokenizes `description` using both the simple and literal tokenizers.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (
  id,
  (description::pdb.literal),
  (description::pdb.simple('alias=description_simple'))
) WITH (key_field='id');
```

Under the hood, two distinct fields are created in the index: a field called `description`, which uses the literal tokenizer,
and an aliased field called `description_simple`, which uses the simple tokenizer.

To query against the aliased field, cast it to `pdb.alias('alias_name')`:

```sql theme={null}
-- Query against `description_simple`
SELECT description, rating, category
FROM mock_items
WHERE description::pdb.alias('description_simple') ||| 'Sleek running shoes';

-- Query against `description`
SELECT description, rating, category
FROM mock_items
WHERE description ||| 'Sleek running shoes';
```

<Note>
  If a text field uses multiple tokenizers and one of them is [literal](/documentation/tokenizers/available-tokenizers/literal), we recommend aliasing
  the other tokenizers and leaving the literal tokenizer un-aliased. This is so queries that `GROUP BY`, `ORDER BY`, or aggregate the
  text field can reference the field directly:

  ```sql theme={null}
  CREATE INDEX search_idx ON mock_items
  USING bm25 (
    id,
    (description::pdb.literal),
    (description::pdb.simple('alias=description_simple'))
  ) WITH (key_field='id');

  SELECT description, rating, category
  FROM mock_items
  WHERE description @@@ 'shoes'
  ORDER BY description
  LIMIT 5;
  ```
</Note>


# How Tokenizers Work
Source: https://docs.paradedb.com/documentation/tokenizers/overview

Tokenizers split large chunks of text into small, searchable units called tokens

Before text is indexed, it is first split into searchable units called tokens.

The default tokenizer in ParadeDB is the [unicode tokenizer](/documentation/tokenizers/available-tokenizers/unicode). It splits text according to word boundaries defined by the Unicode Standard Annex #29 rules. All characters are lowercased by default. To visualize how this tokenizer works, you can cast a text string to the tokenizer type, and then to `text[]`:

```sql theme={null}
SELECT 'Hello world!'::pdb.simple::text[];
```

```ini Expected Response theme={null}
     text
---------------
 {hello,world}
(1 row)
```

On the other hand, the [ngrams](/documentation/tokenizers/available-tokenizers/ngrams) tokenizer splits text into "grams" of size `n`. In this example, `n = 3`:

```sql theme={null}
SELECT 'Hello world!'::pdb.ngram(3,3)::text[];
```

```ini Expected Response theme={null}
                      text
-------------------------------------------------
 {hel,ell,llo,"lo ","o w"," wo",wor,orl,rld,ld!}
(1 row)
```

Choosing the right tokenizer is crucial to getting the search results you want. For instance, the simple tokenizer works best for whole-word matching like "hello" or "world", while the ngram tokenizer enables partial matching.

To configure a tokenizer for a column in the index, simply cast it to the desired tokenizer type:

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.ngram(3,3)))
WITH (key_field='id');
```


# AI Agents
Source: https://docs.paradedb.com/welcome/ai-agents

Integrate ParadeDB with AI coding assistants and LLM-powered agents.

ParadeDB provides first-class support for AI coding assistants and LLM-powered agents. Whether you're using Claude, Cursor, VS Code Copilot, or building custom AI workflows, you can give your agent expert knowledge of ParadeDB.

## Agent Skills

Agent skills are instruction files that teach AI coding assistants about ParadeDB. Once installed, your agent will automatically provide accurate, up-to-date guidance when you ask about BM25 indexing, full-text search, or any ParadeDB feature.

The skill uses a **pointer-based approach**—instead of bundling static documentation that can go stale, it instructs agents to fetch the latest docs in real-time.

Install the skill with:

```bash theme={null}
npx skills add paradedb/agent-skills
```

This installs `paradedb-skill` into your agent's skills directory (for example, Codex uses `$CODEX_HOME/skills/paradedb-skill`).

For manual and tool-specific setup instructions, see the [agent-skills repository](https://github.com/paradedb/agent-skills).

## MCP Integration

ParadeDB documentation is available via the [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) for direct integration with MCP-compatible agents.

**MCP Endpoint:**

```
https://docs.paradedb.com/mcp
```

This allows MCP-enabled tools to query ParadeDB documentation programmatically and provide contextual assistance.

## Built-in AI Features

Every page in the ParadeDB documentation includes AI-powered features:

### Copy Page Menu

Click **"Copy page"** in the top-right corner of any page to access quick integrations:

* **Copy page** — Copy the page content as Markdown for LLMs
* **Copy MCP Server** — Copy the MCP server URL to your clipboard
* **Connect to Cursor** — Install the MCP server directly in Cursor
* **Connect to VS Code** — Install the MCP server directly in VS Code
* **Open in Perplexity / ChatGPT / Claude** — Ask questions about the current page in your preferred AI assistant


# Architecture
Source: https://docs.paradedb.com/welcome/architecture

A deep dive into how ParadeDB is built on Postgres

ParadeDB introduces modern query execution paths and data structures, optimized for high-ingest search and analytics workloads, to Postgres.

## Custom Index

<img alt="Custom Index Architecture" />

In Postgres, indexes provide alternative data structures for accessing the data in a table (which Postgres calls a "heap table") more efficiently.
ParadeDB introduces a custom index called the *BM25 index*.

When a table row is inserted or updated, the BM25 index is immediately notified. These changes are recorded as part of the current transaction, ensuring that index updates are real-time.

## Data Model

<img alt="Data Model" />

The BM25 index is laid out as an [LSM tree](#lsm-tree), where each segment in the tree consists of both an inverted index and columnar index.
The inverted and columnar indexes optimize for fast reads, while the LSM tree optimizes for high-frequency writes.

### Inverted Index

An inverted index is a structure that maps each term (i.e., tokenized word) to a list of documents that contain that term (called a "postings list") along with metadata like term frequency and document frequency.
[This structure](https://github.com/quickwit-oss/tantivy/blob/main/ARCHITECTURE.md#the-inverted-search-index) allows ParadeDB to efficiently retrieve all documents matching a particular search term or phrase without scanning the entire table.

### Columnar Index

Alongside the inverted index, ParadeDB also maintains a structure that stores fields in a column-oriented format. Columnar formats are standard
for analytical (i.e. OLAP) databases because they store values contiguously and enable efficient scans over large datasets compared to Postgres'
row-oriented layout. All text fields which use the [literal](/documentation/tokenizers/available-tokenizers/literal) or [literal normalized](/documentation/tokenizers/available-tokenizers/literal-normalized) tokenizer, or are non-text,
are stored in the columnstore.

In Tantivy these structures are referred to as [fast fields](https://docs.rs/tantivy/latest/tantivy/fastfield/index.html), but they are largely transparent in ParadeDB.

### LSM Tree

To support real-time updates, the BM25 index uses a [Log-Structured Merge (LSM) tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree).

An LSM tree is a write-optimized data structure commonly used in systems like RocksDB and Cassandra. The core idea behind an LSM tree is to turn random writes into sequential ones. Incoming writes are first stored in an in-memory buffer, which is fast to update. Once the buffer fills up or the current statement finishes, it is flushed to disk as an immutable "segment" file.

These segment files are organized by size into layers or levels. Newer data is written to the topmost layer. Over time, data is gradually pushed down into lower levels through a process called merging or compaction, where data from smaller segments is merged, deduplicated, and rewritten into larger segments.

In ParadeDB, every `INSERT`/`UPDATE`/`COPY` statement creates a new segment. Each segment has its own inverted index and columnar index, which means that the BM25 index
is actually a collection of many inverted/columnar indexes, each of which allows for very dense intersection queries to rapidly filter matches.

## Query Execution

### Custom Operators

ParadeDB introduces several new text search operators to Postgres. For example, `|||` is used for [match disjunction](/documentation/full-text/match) queries, whereas `###`
is for [phrase](/documentation/full-text/phrase) queries.

```sql theme={null}
SELECT * FROM mock_items
WHERE description ||| 'running shoes';
```

ParadeDB’s custom query execution paths are only triggered when at least one of ParadeDB's operators is present in the query. Otherwise, it is executed entirely by native Postgres.

### Custom Scan

Whenever a ParadeDB operator is present in a query, ParadeDB will execute the query using a [custom scan](https://www.postgresql.org/docs/current/custom-scan.html).

Custom scans are execution nodes set aside by Postgres that allow extensions to run custom logic during a query. They are more powerful and versatile than typical Postgres index scans because they
allow the extension to "take over" large parts of the query, including aggregates, `WHERE`, and even [`GROUP BY` clauses](/welcome/roadmap#analytics).

From a performance perspective, custom scans significantly speed up queries by pushing down filters, aggregates, and other operations directly into the index, rather than applying them afterward in separate phases.

To understand what kind of scan is used, run `EXPLAIN`:

```sql theme={null}
-- Native Postgres scan, no ParadeDB operator
EXPLAIN SELECT * FROM mock_items
WHERE description = 'running shoes' AND rating <= 5;

-- Custom scan, ParadeDB operator used
EXPLAIN SELECT * FROM mock_items
WHERE description ||| 'running shoes' AND rating <= 5;
```

As a rule of thumb: if `EXPLAIN` shows a custom scan (or, in rare cases, a BM25 index scan), then that part of query is going through ParadeDB. Otherwise, the query passes through standard Postgres.

### Parallelization

For queries that need to read large amounts of data like [Top N](/documentation/sorting/topn) or aggregate queries, the custom scan automatically spawns additional workers to execute the query
in parallel. To see if a query was parallelized, run `EXPLAIN ANALYZE`:

```sql theme={null}
-- "Top N" queries may be parallelized
EXPLAIN ANALYZE SELECT * FROM mock_items
WHERE description ||| 'running shoes'
ORDER BY rating LIMIT 5;
```

<Note>
  Parallelization also depends on the [number of available
  workers](/documentation/performance-tuning/reads).
</Note>

Parallel workers are another reason why the BM25 index is significantly faster than Postgres' native text search and aggregates,
which are mostly not capable of parallelization.

## Design Philosophy

* **Keep it Boring**. Use robust extension points in Postgres vs. hacking around the internals. Adopt battle-tested tools, like industry standard file formats and query engine libraries, instead of cutting-edge but less-proven alternatives.
* **Behave Exactly Like Postgres**. This extends from user-facing aspects, like the SQL query syntax and ORM compatibility, all the way down to low-level integrations with Postgres' storage system and query planner.
* **Works Out of the Box**. Users should be able to get satisfying search results and performance with minimal tuning or configuration.

## Dependencies

The two main dependencies of `pg_search` are [`pgrx`](https://github.com/pgcentralfoundation/pgrx/tree/develop), the
library for writing Postgres extensions in Rust, and [Tantivy](https://github.com/quickwit-oss/tantivy), a Rust-based search library
inspired by [Lucene](https://github.com/apache/lucene).


# Guarantees
Source: https://docs.paradedb.com/welcome/guarantees

ParadeDB ensures ACID compliance, concurrency, data integrity, and replication safety

### ACI(D)

All reads and writes go through Postgres’ transaction engine. This means that inserts, updates, and deletes to indexed columns are atomic, consistent, and respect Postgres' [isolation levels](https://www.postgresql.org/docs/current/transaction-iso.html).

Durability — the "D" in ACID — means that once a transaction is committed, its changes will survive crashes or failovers. In PostgreSQL, this guarantee is provided by the write-ahead log (WAL), which ensures that all changes are safely recorded before being applied to disk.

[ParadeDB Community](https://github.com/paradedb/paradedb) does **not** write to the WAL, and therefore does not guarantee durability in the face of crashes. For production use cases that require full durability, [ParadeDB Enterprise](/deploy/enterprise) — a closed-source fork of ParadeDB for enterprise customers — includes full WAL integration.

### Concurrency

ParadeDB is designed to support concurrent reads and writes in the same way that Postgres does — by adhering to Postgres' [multi-version
concurrency control (MVCC)](https://www.postgresql.org/docs/current/mvcc.html) rules. We maintain an internal testing suite that rigorously measures the read and write throughput of the BM25
index under concurrent load.

Both read and write throughput under concurrent load can be improved by tuning Postgres' settings. For instance, read throughput can be improved
by configuring the [`max_parallel_worker` pool](/documentation/performance-tuning/reads#raise-parallel-workers) and [buffer cache size](/documentation/performance-tuning/reads#raise-shared-buffers),
whereas writes can be improved by increasing [per-statement memory](/documentation/performance-tuning/writes#increase-memory-for-bulk-updates).

### Correctness vs. Performance

While ParadeDB optimizes heavily around performance, there are some situations where the database can squeeze more performance by relaxing correctness
constraints. In these cases, ParadeDB — like Postgres — will guarantee correctness, even if it comes at the cost of slower query execution.

### Replication Safety

ParadeDB distinguishes between logical replication and physical replication.

Logical replication refers to replicating changes from a standard Postgres primary (e.g. AWS RDS) into a ParadeDB instance. This is commonly used when ParadeDB acts as a search node built from upstream Postgres changes.

Physical replication refers to running ParadeDB itself in a multi-node, high-availability (HA) setup using write-ahead log (WAL) shipping.

[ParadeDB Community](https://github.com/paradedb/paradedb) supports logical replication, but not physical replication:

* It can act as a logical replica, ingesting changes from a Postgres primary and indexing them transactionally.
* The BM25 index does not get physically replicated and won't be available on other nodes in a high availability setup.

[ParadeDB Enterprise](/deploy/enterprise) supports both:

* It can act as a logical replica, ingesting changes from a Postgres primary and indexing them transactionally.
* It supports physical replication and high availability, ensuring that the BM25 index remains consistent and crash-safe across nodes.

If your deployment requires high availability, or failover, we recommend using [ParadeDB Enterprise](/deploy/enterprise).

### Data Integrity

All data inserted into the BM25 index must conform to the column’s declared type. ParadeDB relies on Postgres’ type system and input/output functions to ensure validity. For example, invalid data will result in a Postgres error at insert time, not at query time.


# Simple, Elastic-Quality Search for Postgres
Source: https://docs.paradedb.com/welcome/introduction

ParadeDB is the modern Elastic alternative built as a Postgres extension.

<img alt="ParadeDB Banner" />

## Who is ParadeDB for?

You are likely a good fit for ParadeDB if you identify with the following:

1. Your **primary database is Postgres**, either managed (e.g. AWS RDS) or self-managed
2. You **have used Postgres' built-in search** capabilities via `tsvector` and the GIN index, but have reached a scale where you're limited by **performance bottlenecks** or **missing features** like BM25 scoring or fuzzy search
3. You are evaluating a search engine like Elasticsearch, but **don't want to introduce another cumbersome dependency** to your stack

## Why ParadeDB?

For teams that already use Postgres, ParadeDB is the simplest way to bring Elastic-quality search to your application.

### Zero ETL Required

Syncing Postgres with an external search engine like Elastic can be a time-consuming, error-prone process that involves babysitting ETL pipelines
and debugging data inconsistency issues. ParadeDB eliminates this class of problems because you can:

* [Install](/deploy/self-hosted/extension) the ParadeDB extension directly inside your Postgres, if it is self-managed
* [Run ParadeDB as a logical replica](/deploy/self-hosted/logical-replication/getting-started) of your primary Postgres, if you use managed Postgres providers like RDS

### Search That Feels Like Postgres

In ParadeDB, writing a search query is as simple as writing SQL. ParadeDB supports JOINs, which removes the complexity of denormalizing your existing schema.

### As Reliable As Postgres

ParadeDB supports Postgres transactions and ACID guarantees. This means that data is searchable immediately
after it's written to ParadeDB, and durable thanks to Postgres write-ahead logging.

## ParadeDB vs. Alternatives

People usually compare ParadeDB to two other types of systems: OLTP databases like vanilla Postgres and search engines like Elastic.

|                             | **OLTP database**                                                        | **Search engine**                                                                                      | **ParadeDB**                                                                                                              |
| --------------------------- | ------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------- |
| **Primary role**            | System of record                                                         | Search and retrieval engine                                                                            | System of record **and** search/analytics engine                                                                          |
| **Examples**                | Postgres, MySQL                                                          | Elasticsearch, OpenSearch                                                                              |                                                                                                                           |
| **Search features**         | Basic FTS (no BM25, weak ranking)                                        | Rich search features (BM25, fuzzy matching, faceting, hybrid search)                                   | Rich search features (BM25, fuzzy matching, faceting, hybrid search)                                                      |
| **Analytics features**      | Not an analytical DB (no column store, batch processing, etc.)           | Column store, batch processing, parallelization via sharding                                           | Column store, batch processing, parallelization via Postgres [parallel workers](/documentation/performance-tuning/writes) |
| **Lag**                     | None in a single cluster                                                 | At least network, ETL transformation, and indexing time                                                | None in a single cluster                                                                                                  |
| **Operational complexity**  | Simple (single datastore)                                                | Complex (ETL pipelines, managing multiple systems)                                                     | Simple (single datastore)                                                                                                 |
| **Scalability**             | Vertical scaling in a single node, horizontal scaling through Kubernetes | Horizontal scaling through sharding                                                                    | Vertical scaling in a single node, horizontal scaling through [Kubernetes](/deploy/self-hosted/kubernetes)                |
| **Language**                | SQL                                                                      | Custom DSL                                                                                             | Standard SQL with custom search operators                                                                                 |
| **ACID guarantees**         | Full ACID compliance, read-after-write guarantees                        | No transactions, atomic only per-document, eventual consistency, durability not guaranteed until flush | Full ACID compliance, read-after-write guarantees                                                                         |
| **Update & delete support** | Built for fast-changing data                                             | Struggles with updates/deletes                                                                         | Built for fast-changing data                                                                                              |

## Production Readiness

As a company, ParadeDB is over two years old. ParadeDB launched in the [Y Combinator (YC)](https://ycombinator.com) S23 batch and has been validated in
production since December 2023.

[ParadeDB Community](https://github.com/paradedb/paradedb), the open-source version of ParadeDB, has been deployed over 400,000 times in the past 12 months.
ParadeDB Enterprise, the durable and production-hardened edition of ParadeDB, powers core search and analytics use cases at enterprises ranging from Fortune 500s to fast-growing startups. A few
examples include:

* **Alibaba Cloud**, the largest Asia-Pacific cloud provider, uses ParadeDB to power search inside their data warehouse. [Case study available](https://www.paradedb.com/customers/case-study-alibaba).
* **Bilt Rewards**, a rent payments technology company that processed over \$36B in payments in 2024. [Case study available](https://www.paradedb.com/customers/case-study-bilt).
* **Modern Treasury**<sup>1</sup>, a financial technology company that automates the full cycle of money movement.
* **Span**<sup>1</sup>, one of the fastest-growing AI developer productivity platforms
* **TCDI**<sup>1</sup>, a giant in the legal software and litigation management space.

*1. Case study coming soon*

## Next Steps

You're now ready to jump into our guides.

<CardGroup>
  <Card title="Getting Started" icon="forward-fast" href="/documentation/getting-started/install">
    Get started with ParadeDB in under five minutes.
  </Card>

  <Card title="Architecture" icon="diagram-project" href="/welcome/architecture">
    Learn how ParadeDB is built.
  </Card>

  <Card title="Reference" icon="magnifying-glass" href="/documentation/full-text/overview">
    API reference for full text search and analytics.
  </Card>

  <Card title="Deploy" icon="server" href="/deploy/overview">
    Deploy ParadeDB as a Postgres extension or standalone database.
  </Card>
</CardGroup>


# Limitations & Tradeoffs
Source: https://docs.paradedb.com/welcome/limitations

Understand ParadeDB's key limitations and tradeoffs

## Distributed Workloads

ParadeDB is designed to scale vertically on a single Postgres node with potentially many read replicas, and many production deployments comfortably operate in the 1–10TB range. The largest single ParadeDB database we’ve seen in production is 10TB.

For datasets that significantly exceed this scale, ParadeDB supports partitioned tables and can be deployed in sharded Postgres configurations. ParadeDB is fully compatible with [Citus](https://github.com/citusdata/citus) for distributed search workloads — you can create BM25 indexes on distributed tables and run search queries across shards. See our [Citus deployment guide](/deploy/citus) for more details.

If you're working with very large datasets, please [reach out to us](mailto:support@paradedb.com). We'd be happy to provide guidance and share our roadmap for future distributed query support.

## Join Support

ParadeDB supports all PostgreSQL `JOIN`s:

* `INNER JOIN`
* `LEFT / RIGHT / FULL OUTER JOIN`
* `CROSS JOIN`
* `LATERAL`
* Semi and Anti `JOIN`s

For the most part you can mix search and relational queries without changing your SQL. However, `JOIN`s do incur
some performance tradeoffs. See the [joins guide](/documentation/performance-tuning/joins) for more details.

## Covering Index

The BM25 index in ParadeDB is a covering index, which means it stores all indexed columns inside a single index per table. This
decision is intentional -- by colocating all the relevant data, ParadeDB optimizes for fast reads and boolean conditions.

However, this means that all columns must be defined up front at index creation time. Adding or removing columns requires a `REINDEX`.

## DDL Replication

A commonly known limitation of Postgres logical replication is that DDL (Data Definition Language) statements are not replicated. This includes operations like `CREATE TABLE` or `CREATE INDEX`.

If ParadeDB is running as a logical replica of a primary Postgres, DDL statements from the primary must be executed manually on the replica.
We recommend version-controlling your schema changes and applying them in a coordinated, repeatable way — either through a migration tool or deployment automation — to keep source and target databases in sync.


# Roadmap
Source: https://docs.paradedb.com/welcome/roadmap

The main features that we are currently working on

We're a lean team that likes to ship at [incredibly high velocity](https://github.com/paradedb/paradedb/releases).

## In Progress

### JOIN Improvements

* **Scoring and highlighting across JOINs**. BM25 score and snippet functions can be used in `JOIN` queries.
* **Smarter JOIN planning for search indexes**. Apply index-aware optimizations and cost estimation strategies when multiple BM25-indexed tables are joined.
* **Faster JOIN performance through predicate pushdown**. Search predicates are selectively pushed down to relevant tables based on indexability and selectivity, improving `JOIN` query speed.

### Ecosystem Integrations

* **ORMs**. Official support for popular ORMs, like Django, Prisma, Active Record, and many others is coming.
* **AI Frameworks**. Official support for LangChain, LLamaIndex, CrewAI, and others are coming.
* **PaaS Providers**. Official tutorials for hosting ParadeDB on platform-as-a-service providers like Railway, DigitalOcean, Porter.run, and many others is coming.

## Long Term

### Deeper Analytics Improvements

* **Push Postgres visibility rules into the index**. This is currently a filter applied post index scan that adds overhead to large scans.
* **Evaluate more industry-standard OLAP tools**. A new file format? Query execution library?

### Vector Search Improvements

* Postgres (and by extension, ParadeDB) uses [`pgvector`](https://github.com/pgvector/pgvector) for vector search. Contingent on demand and internal resources, we may
  investigate what improvements can be made to the known limitations of `pgvector`.

### Managed Cloud

* Today, you can [deploy ParadeDB](/deploy/overview) either self-hosted or with ParadeDB BYOC. We are working on a fully managed cloud offering,
  with a focus on scalability and supporting distributed workloads.

## Completed

### Analytics

* **A custom scan node for aggregates**. This will allow "plain SQL" aggregates to go through the same fast execution path as
  our [aggregate UDFs](/documentation/aggregates/tantivy), further accelerating aggregates like `COUNT`, and SQL clauses like `GROUP BY`.

### Write Throughput

* **Background merging**. Improves write performance by merging index segments asynchronously without blocking inserts.
* **Pending list**. Buffers recent write before flushing them to the LSM tree.

### Improved UX

* **More intuitive index configuration**. Overhaul the complicated JSON `WITH` index options.
* **More ORM friendly**. Overhaul the [query builder functions](/documentation/query-builder/overview) to use actual column references instead of string literals.
* **New operators**. In addition to the existing `@@@` operator, introduce new operators for different query types (e.g. phrase, term, conjunction/disjunction).

## We're Hiring

We're tackling some of the hardest and (in our opinion) most impactful problems in Postgres. If you want to be a part of it,
please check out our [open roles](https://paradedb.notion.site)!


# Help and Support
Source: https://docs.paradedb.com/welcome/support

How to obtain support for ParadeDB

For questions regarding enterprise support or commercial licensing, please [contact sales](mailto:sales@paradedb.com).
For community support and general questions, please join the [ParadeDB Community Slack](https://join.slack.com/t/paradedbcommunity/shared_invite/zt-32abtyjg4-yoYoi~RPh9MSW8tDbl0BQw).

## Ask a Question

Use the **"Ask a question..."** bar at the bottom of any page to get instant answers about ParadeDB. The AI assistant has full context of the documentation and can help with queries, troubleshooting, and best practices.


# ParadeDB BYOC
Source: https://docs.paradedb.com/deploy/byoc

Deploy ParadeDB Bring Your Own Cloud (BYOC) within your cloud environment

<Info>
  For access to ParadeDB BYOC, [contact sales](mailto:sales@paradedb.com).
</Info>

ParadeDB BYOC (Bring Your Own Cloud) is a managed deployment of ParadeDB within your cloud environment.
It combines the benefits of a managed platform with the security posture of a self-hosted deployment.

ParadeDB BYOC is supported on GCP and AWS, including GovCloud regions and airgapped environments.
To request access for Azure, Oracle Cloud, or another cloud platform please contact [sales@paradedb.com](mailto:sales@paradedb.com).

## How BYOC Works

ParadeDB BYOC provisions a Kubernetes cluster in your cloud environment with [high availability](/deploy/self-hosted/high-availability/configuration) preconfigured.
It also configures [logical replication](/deploy/self-hosted/logical-replication/getting-started) with your primary Postgres, backups, connection pooling, monitoring, access control, and audit logging.

ParadeDB BYOC can be deployed and managed in one of two ways:

* **Fully Managed**: ParadeDB will deploy and manage the ParadeDB BYOC module for you. ParadeDB requires a sub-account or project within your cloud provider via an IAM user or a service account.
* **Just-in-Time Managed**: You will deploy the ParadeDB BYOC module and can choose to provide just-in-time access to the ParadeDB team when support is required. This is typically useful for airgapped environments.

<img alt="ParadeDB BYOC Topology" />

## Getting Started

This section assumes that you have received access to the ParadeDB BYOC module and are deploying it yourself on AWS or GCP.
In a fully managed deployment, these steps will be performed by ParadeDB on your behalf.

### Install Dependencies

First, ensure that you are in the BYOC module repository. Next, install Terraform, Kubectl, PostgreSQL, and the CLI for your desired cloud provider:

<CodeGroup>
  ```bash macOS theme={null}
  brew install terraform kubectl postgresql
  ```

  ```bash Ubuntu theme={null}
  sudo apt-get install -y terraform kubectl postgresql
  ```
</CodeGroup>

### Authenticate CLI

Install and authenticate with either the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) or [GCP CLI](https://cloud.google.com/sdk/docs/install#deb).

<CodeGroup>
  ```bash AWS theme={null}
  aws configure
  ```

  ```bash GCP theme={null}
  gcloud init
  gcloud auth application-default login
  ```
</CodeGroup>

### Provision ParadeDB

Our Terraform project will provision a Kubernetes cluster (EKS or GKE) along with all the necessary infrastructure to run ParadeDB.

First, copy either `aws.example.tfvars` or `gcp.example.tfvars` into a new file called `byoc.tfvars`.

<CodeGroup>
  ```bash AWS theme={null}
  cp aws.example.tfvars byoc.tfvars
  ```

  ```bash GCP theme={null}
  cp gcp.example.tfvars byoc.tfvars
  ```
</CodeGroup>

Next, open and configure `byoc.tfvars`. Configuration instructions can be found directly within the file.

```bash theme={null}
open byoc.tfvars || xdg-open byoc.tfvars
```

### Run Terraform

First, initialize Terraform.

<CodeGroup>
  ```bash AWS theme={null}
  terraform -chdir=infrastructure/aws init
  ```

  ```bash GCP theme={null}
  terraform -chdir=infrastructure/gcp init
  ```
</CodeGroup>

Next, run Terraform `apply`.

<CodeGroup>
  ```bash AWS theme={null}
  terraform -chdir=infrastructure/aws apply -var-file=../../byoc.tfvars
  ```

  ```bash GCP theme={null}
  terraform -chdir=infrastructure/gcp apply -var-file=../../byoc.tfvars
  ```
</CodeGroup>

<Note>
  It may take up to 30 minutes to provision all the necessary infrastructure.
</Note>

When this command is complete, you will see a `kubectl` command printed as Terraform output to the terminal.
Run this command, which will add the EKS or GKE cluster configuration to your local `.kubeconfig` file.

That's it! You're now ready to connect to ParadeDB.

### Connect to ParadeDB

#### Access the Grafana Dashboard

First, port-forward the Grafana service to localhost.

```bash theme={null}
kubectl --namespace monitoring port-forward service/prometheus-grafana 8080:80
```

Then, go to `http://localhost:8080`. Your Grafana credentials have been printed in the terminal output of the above Terraform `apply` command.

You can find the ParadeDB dashboard by typing `CloudNativePG` in the search bar, and selecting `paradedb` for the Database Namespace.

By default, the dashboard will display metrics over the last 7 days. If you've just spun up the cluster, change it to the last 15 minutes to start seeing results immediately.

#### Access the ParadeDB Instance

First, retrieve the database credentials.

```bash theme={null}
kubectl --namespace paradedb get secrets paradedb-superuser -o json | jq -r '.data | map_values(@base64d) | .uri |= sub("\\*"; "paradedb") | .dbname = "paradedb"'
```

Next, port-forward the ParadeDB service to localhost.

```bash theme={null}
kubectl --namespace paradedb port-forward service/paradedb-rw 5432:5432
```

Now you can connect to the ParadeDB instance using the credentials you've retrieved.

```bash theme={null}
PGPASSWORD=<your-password> psql -h localhost -d paradedb -p 5432 -U <your-user>
```


# GitHub Actions
Source: https://docs.paradedb.com/deploy/ci/github-actions

How to run ParadeDB in Github Actions CI

## Sample GitHub Actions Workflow

```yaml theme={null}
name: ParadeDB in GitHub Actions

on:
  pull_request:
    branches:
      - main
  workflow_dispatch:

jobs:
  paradedb-in-github-actions:
    name: ParadeDB in GitHub Actions
    runs-on: ubuntu-latest

    services:
      paradedb:
        # The list of available tags can be found at https://hub.docker.com/r/paradedb/paradedb/tags
        image: paradedb/paradedb:latest
        env:
          POSTGRES_USER: testuser
          POSTGRES_PASSWORD: testpassword
          POSTGRES_DB: testdb
        ports:
          - 5432:5432
        options: >-
          --health-cmd="pg_isready -U postgres"
          --health-interval=10s
          --health-timeout=5s
          --health-retries=5

    steps:
      - name: Checkout repository
        uses: actions/checkout@v6

      - name: Wait for PostgreSQL to be ready
        run: |
          for i in {1..10}; do
            if psql -h localhost -U testuser -d testdb -c "SELECT 1;" > /dev/null 2>&1; then
              echo "Database is ready!"
              break
            fi
            echo "Waiting for database..."
            sleep 5
          done

      - name: Run ParadeDB example queries
        run: |
          psql -h localhost -U testuser -d testdb -c "CALL paradedb.create_bm25_test_table(schema_name => 'public', table_name => 'mock_items');"
          psql -h localhost -U testuser -d testdb -c "SELECT description, rating, category FROM mock_items LIMIT 3;"
          psql -h localhost -U testuser -d testdb -c "CREATE INDEX search_idx ON mock_items USING bm25 (id, description, category, rating, in_stock, created_at, metadata, weight_range) WITH (key_field='id');"
          psql -h localhost -U testuser -d testdb -c "SELECT description, rating, category FROM mock_items WHERE description @@@ 'shoes' OR category @@@ 'footwear' AND rating @@@ '>2' ORDER BY description LIMIT 5;"
```


# GitLab CI
Source: https://docs.paradedb.com/deploy/ci/gitlab-ci

How to run ParadeDB in Gitlab CI

## Sample GitLab CI Workflow

```yaml theme={null}
paradedb-in-gitlab-ci:
  # The list of available tags can be found at https://hub.docker.com/r/paradedb/paradedb/tags
  image: paradedb/paradedb:latest
  services:
    - postgres
  variables:
    POSTGRES_USER: testuser
    POSTGRES_DB: testdb
    POSTGRES_HOST_AUTH_METHOD: trust
  script:
    - psql -h "postgres" -U testuser -d testdb -c "CALL paradedb.create_bm25_test_table(schema_name => 'public', table_name => 'mock_items');"
    - psql -h "postgres" -U testuser -d testdb -c "SELECT description, rating, category FROM mock_items LIMIT 3;"
    - psql -h "postgres" -U testuser -d testdb -c "CREATE INDEX search_idx ON mock_items USING bm25 (id, description, category, rating, in_stock, created_at, metadata, weight_range) WITH (key_field='id');"
    - psql -h "postgres" -U testuser -d testdb -c "SELECT description, rating, category FROM mock_items WHERE description @@@ 'shoes' OR category @@@ 'footwear' AND rating @@@ '>2' ORDER BY description LIMIT 5;"
```


# Using ParadeDB with Citus
Source: https://docs.paradedb.com/deploy/citus

Distributed full-text search with Citus and ParadeDB

[Citus](https://github.com/citusdata/citus) transforms PostgreSQL into a distributed database with horizontal sharding. ParadeDB is fully compatible with Citus, enabling distributed full-text search across sharded tables.

## What's Supported

* **BM25 indexes on distributed tables** — Create search indexes after distributing tables with `create_distributed_table()`
* **Distributed queries with search operators** — Use the `|||` (match disjunction) and `&&&` (match conjunction) operators in queries across sharded tables
* **Subqueries with LIMIT** — Complex queries with subqueries and LIMIT clauses work correctly
* **JOIN queries** — Search with JOINs across distributed tables

## Installation

Both `citus` and `pg_search` must be added to `shared_preload_libraries` in the correct order:

```bash theme={null}
# Install Citus first
curl https://install.citusdata.com/community/deb.sh | sudo bash
apt-get install -y postgresql-17-citus-13.0

# Add both extensions to shared_preload_libraries
sed -i "s/^shared_preload_libraries = .*/shared_preload_libraries = 'citus,pg_search'/" /var/lib/postgresql/data/postgresql.conf

# Restart PostgreSQL
# Then create extensions in your database
```

<Note>
  The order in `shared_preload_libraries` matters. Always list `citus` before
  `pg_search` to ensure proper planner hook chaining.
</Note>

## Usage Example

Here's a complete example of setting up distributed search with Citus:

```sql theme={null}
CREATE EXTENSION citus;
CREATE EXTENSION pg_search;

-- Create a table with a distribution key
CREATE TABLE articles (
    id SERIAL,
    author_id INT NOT NULL,
    title TEXT,
    body TEXT,
    PRIMARY KEY (author_id, id)  -- Must include distribution column
);

-- Distribute the table across shards
SELECT create_distributed_table('articles', 'author_id');

-- Create a BM25 index on the distributed table
CREATE INDEX articles_search_idx ON articles
USING bm25 (id, title, body)
WITH (key_field='id');

-- Insert some data
INSERT INTO articles (author_id, title, body) VALUES
    (1, 'PostgreSQL Performance', 'Optimizing PostgreSQL queries for large datasets'),
    (1, 'Distributed Databases', 'Understanding sharding and replication strategies'),
    (2, 'Full-Text Search', 'Building search engines with PostgreSQL');

-- Search across shards
SELECT id, title FROM articles
WHERE body ||| 'PostgreSQL distributed'
ORDER BY id;

-- Results:
--  id |         title
-- ----+------------------------
--   1 | PostgreSQL Performance
--   3 | Full-Text Search
```

### Verify Distributed Execution

You can verify that both ParadeDB and Citus are working together by examining the query plan:

```sql theme={null}
EXPLAIN (VERBOSE)
SELECT id, title FROM articles
WHERE body ||| 'PostgreSQL distributed'
ORDER BY id;
```

The plan should show:

1. **Citus Adaptive Custom Scan** — Coordinating distributed query execution across shards
2. **ParadeDB Scan** — Using the BM25 index within each shard
3. **Task Count: 32** — Query distributed across 32 shards (default Citus shard count)

<Accordion title="Example EXPLAIN Output">
  ```
  Sort  (cost=11041.82..11291.82 rows=100000 width=36)
    Output: remote_scan.id, remote_scan.title
    Sort Key: remote_scan.id
    ->  Custom Scan (Citus Adaptive)  (cost=0.00..0.00 rows=100000 width=36)
          Output: remote_scan.id, remote_scan.title
          Task Count: 32
          Tasks Shown: One of 32
          ->  Task
                Query: SELECT id, title FROM public.articles_102008 articles WHERE (id OPERATOR(pg_catalog.@@@) ...)
                Node: host=localhost port=5432 dbname=postgres
                ->  Custom Scan (ParadeDB Scan) on public.articles_102008 articles  (cost=10.00..10.01 rows=1 width=36)
                      Output: id, title
                      Table: articles_102008
                      Index: articles_search_idx_102008
                      Tantivy Query: {"with_index":{"query":{"with_index":{"query":{"match":{"field":"body","value":"PostgreSQL distributed"}}}}}}
  ```
</Accordion>

## Distributed JOINs with Search

ParadeDB search operators work seamlessly with Citus distributed JOINs:

```sql theme={null}
-- Create and distribute a second table
CREATE TABLE authors (
    id INT PRIMARY KEY,
    name TEXT,
    bio TEXT
);

SELECT create_distributed_table('authors', 'id');

-- JOIN with search operators
SELECT a.name, ar.title
FROM authors a
JOIN articles ar ON a.id = ar.author_id
WHERE ar.body ||| 'PostgreSQL'
ORDER BY a.name;

-- Results:
--  name  |         title
-- -------+------------------------
--  Alice | PostgreSQL Performance
--  Bob   | Full-Text Search
```

### Verify Distributed JOIN Execution

Check the execution plan for distributed JOINs with search:

```sql theme={null}
EXPLAIN (VERBOSE)
SELECT a.name, ar.title
FROM authors a
JOIN articles ar ON a.id = ar.author_id
WHERE ar.body ||| 'PostgreSQL'
ORDER BY a.name;
```

<Accordion title="Example EXPLAIN Output for JOIN">
  ```
  Sort  (cost=12067.32..12317.32 rows=100000 width=64)
    Output: remote_scan.name, remote_scan.title
    Sort Key: remote_scan.name
    ->  Custom Scan (Citus Adaptive)  (cost=0.00..0.00 rows=100000 width=64)
          Output: remote_scan.name, remote_scan.title
          Task Count: 32
          Tasks Shown: One of 32
          ->  Task
                Query: SELECT a.name, ar.title FROM (public.authors_102040 a JOIN public.articles_102008 ar ON (...))
                Node: host=localhost port=5432 dbname=postgres
                ->  Nested Loop  (cost=10.15..18.20 rows=1 width=64)
                      Output: a.name, ar.title
                      Inner Unique: true
                      ->  Custom Scan (ParadeDB Scan) on public.articles_102008 ar  (cost=10.00..10.01 rows=1 width=36)
                            Output: ar.title, ar.author_id
                            Table: articles_102008
                            Index: articles_search_idx_102008
                            Tantivy Query: {"with_index":{"query":{"with_index":{"query":{"match":{"field":"body","value":"PostgreSQL"}}}}}}
                      ->  Index Scan using authors_pkey_102040 on public.authors_102040 a  (cost=0.15..8.17 rows=1 width=36)
                            Output: a.id, a.name, a.bio
                            Index Cond: (a.id = ar.author_id)
  ```

  Key indicators:

  * `Nested Loop` shows efficient JOIN execution on each shard
  * `Custom Scan (ParadeDB Scan)` on the outer side of the JOIN uses BM25 for filtering
  * `Index Scan` on authors table uses the primary key for lookups
  * JOINs execute **locally on each shard** for optimal performance
</Accordion>

## Known Limitations

* ❌ **Citus columnar tables** — BM25 indexes and other PostgreSQL indexes (like GiST, GIN) cannot be created on Citus columnar tables due to limitations in Citus's columnar storage implementation. However, you can use regular distributed tables with BM25 indexes alongside columnar tables for analytics.

## Performance Considerations

When using ParadeDB with Citus:

* **Index creation** happens locally on each shard, enabling parallel index building
* **Search queries** execute in parallel across shards and results are merged by the coordinator
* **Distribution column** should be chosen based on your query patterns to minimize cross-shard operations

For more guidance on optimizing distributed search workloads, please reach out to us in the [ParadeDB Community Slack](https://join.slack.com/t/paradedbcommunity/shared_invite/zt-32abtyjg4-yoYoi~RPh9MSW8tDbl0BQw) or via [email](mailto:support@paradedb.com).


# ParadeDB Enterprise
Source: https://docs.paradedb.com/deploy/enterprise

Feature comparison between ParadeDB Community and Enterprise

<Note>
  If you're a non-profit or a non-commercial open source project and are
  interested in ParadeDB Enterprise, please [contact
  sales](mailto:sales@paradedb.com). We provide complimentary access on a
  case-by-case basis.
</Note>

ParadeDB ships in two versions: ParadeDB Community and ParadeDB Enterprise.

[ParadeDB Community](https://github.com/paradedb/paradedb) is our open source product, licensed under [AGPL-3.0](https://github.com/paradedb/paradedb/blob/dev/LICENSE).
This license permits free use, modification, and distribution of the software, provided that distributed, derivative works of the software are released under the same license
(copyleft provision).

In addition to all of the features of ParadeDB Community, ParadeDB Enterprise:

1. Waives the copyleft provision of AGPL-3.0
2. Contains several close-source features that are recommended for ParadeDB to service enterprise, production workloads

For access to ParadeDB Enterprise, please [contact sales](mailto:sales@paradedb.com).

## Feature Comparison

|                                      | ParadeDB Community | ParadeDB Enterprise |
| ------------------------------------ | ------------------ | ------------------- |
| **Index Configuration**              |                    |                     |
| Support for most Postgres types      | ✅                  | ✅                   |
| Custom tokenizers and filters        | ✅                  | ✅                   |
| Multiple tokenizers per field        | ✅                  | ✅                   |
| **Full Text Search and Analytics**   |                    |                     |
| Query builder API                    | ✅                  | ✅                   |
| Efficient "Top N" ordering           | ✅                  | ✅                   |
| BM25 scoring                         | ✅                  | ✅                   |
| Highlighting                         | ✅                  | ✅                   |
| Hybrid search                        | ✅                  | ✅                   |
| Parallelized fast field aggregates   | ✅                  | ✅                   |
| **Concurrency and Consistency**      |                    |                     |
| Postgres MVCC-safe<sup>1</sup>       | ✅                  | ✅                   |
| Concurrent, non-blocking writes      | ✅                  | ✅                   |
| Block storage integration            | ✅                  | ✅                   |
| Buffer cache integration<sup>2</sup> | ✅                  | ✅                   |
| **Deployment** <sup>3</sup>          |                    |                     |
| Maximum cluster size <sup>4</sup>    | 1                  | Unlimited           |
| Physical (i.e. WAL) Replication      | ❌                  | ✅                   |
| Crash Recovery                       | ❌                  | ✅                   |
| Point in Time Recovery               | ❌                  | ✅                   |
| Logical Replication                  | ✅                  | ✅                   |

<Info>
  **Footnotes**

  <p>
    1. The BM25 index supports Postgres' multi-version concurrency control (MVCC)
       rules. The index reflects the current state of the underlying table at all
       times, changes to the index are atomic, and queries are transactionally
       consistent with the table.
    2. The BM25 index is built on block storage, Postgres' native storage API.
       This means that it leverages the Postgres buffer cache, which minimizes disk
       I/O.
    3. All listed deployment features and limitations are specific to the BM25 index. For instance,
       ParadeDB Community supports physical/logical replication, crash recovery, etc. for heap tables and other Postgres indexes like B-Tree.
    4. In a primary-replica topology, BM25 indexes in ParadeDB Community are only available on the primary, as
       the Community edition does not support physical (WAL) replication.
  </p>
</Info>


# Deploying ParadeDB
Source: https://docs.paradedb.com/deploy/overview

Explore the different ways to deploy ParadeDB into production

There are two ways to deploy ParadeDB: [self-hosting ParadeDB](/deploy/self-hosted/kubernetes) or using [ParadeDB BYOC](/deploy/byoc) (Bring Your Own Cloud).

## Self-Hosted ParadeDB

For hobby, development, or staging environments, [ParadeDB Community](https://github.com/paradedb/paradedb) can be deployed as a [Docker image](/deploy/self-hosted/docker),
inside [Kubernetes](/deploy/self-hosted/kubernetes), or as an [extension](/deploy/self-hosted/extension) inside an existing self-hosted Postgres.

For production environments, we recommend deploying the [ParadeDB Enterprise](/deploy/enterprise) binary inside Kubernetes. This unlocks support for
[logical replication](/deploy/self-hosted/logical-replication/getting-started) and [high availability](/deploy/self-hosted/high-availability/configuration), which must be configured separately.

<Note>
  Running ParadeDB Community in a production application that serves paying customers is discouraged.

  This is because ParadeDB Community [does not have write-ahead log (WAL) support](/deploy/enterprise). Without WALs, data can be lost or corrupted if
  the server crashes or restarts, which would necessitate a reindex and incur downtime for your application. For more details, see [guarantees](/welcome/guarantees#aci-d).

  When you are ready to deploy in ParadeDB to production, [contact us](mailto:sales@paradedb.com) for access to ParadeDB Enterprise, which has WAL support.
</Note>

## ParadeDB BYOC

[ParadeDB BYOC (Bring Your Own Cloud)](/deploy/byoc) is a managed deployment of ParadeDB Enterprise inside your AWS or GCP account.
Please [contact sales](mailto:sales@paradedb.com) for access.


# Docker
Source: https://docs.paradedb.com/deploy/self-hosted/docker

How to run the ParadeDB Docker container

<Note>
  We recommend running ParadeDB Enterprise, not Community, in production to
  maximize uptime. See [overview](/deploy/overview#self-hosted).
</Note>

To deploy ParadeDB via Docker, pull and run the `paradedb/paradedb` image locally. This is the recommended deployment method for
testing and hobby deployments. We do not recommend running the ParadeDB Docker image in production, as it
does not provide [high availability](/deploy/self-hosted/high-availability/configuration).

```bash theme={null}
docker run \
  --name paradedb \
  -e POSTGRES_USER=myuser \
  -e POSTGRES_PASSWORD=mypassword \
  -e POSTGRES_DB=mydatabase \
  -v paradedb_data:/var/lib/postgresql/ \
  -p 5432:5432 \
  -d \
  paradedb/paradedb:latest
```

You may replace `myuser`, `mypassword`, and `mydatabase` with whatever values you want. These will be your database
connection credentials.

The `paradedb/paradedb:latest` tag uses Postgres 17. Docker images are available for Postgres 15+. To specify a different Postgres version, please refer to the available tags on [Docker Hub](https://hub.docker.com/r/paradedb/paradedb/tags).

To connect to ParadeDB, install the `psql` client and run

```bash theme={null}
docker exec -it paradedb psql -U myuser -d mydatabase -W
```


# Extension
Source: https://docs.paradedb.com/deploy/self-hosted/extension

How to install ParadeDB as an extension inside an existing self-managed Postgres

<Note>
  We recommend running ParadeDB Enterprise, not Community, in production to
  maximize uptime. See [overview](/deploy/overview#self-hosted).
</Note>

If you already self-manage Postgres, you may prefer to install ParadeDB directly within your self-managed
Postgres instead of deploying the ParadeDB Helm chart.

This can be done by installing the `pg_search` extension, which powers all of ParadeDB's custom functionalities.

## Prerequisites

Ensure that you have superuser access to the Postgres database.

## Install the ParadeDB Postgres Extension

### ParadeDB Community

ParadeDB provides prebuilt binaries of our extension for Postgres 15+ on:

* Debian 12 (Bookworm) and 13 (Trixie)
* Ubuntu 22.04 (Jammy) and 24.04 (Noble)
* macOS 14 (Sonoma) and 15 (Sequoia)
* Red Hat Enterprise Linux 9 and 10

If you are using a different version of Postgres or a different operating system, you will need to build the extension from source.

#### pg\_search

The prebuilt releases can be found in [GitHub Releases](https://github.com/paradedb/paradedb/releases).

<Note>
  You can replace `0.21.8` with the `pg_search` version you wish to install and
  `17` with the version of Postgres you are using.
</Note>

<CodeGroup>
  ```bash Ubuntu 24.04 theme={null}
  # Available arch versions are amd64, arm64
  curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/postgresql-17-pg-search_0.21.8-1PARADEDB-noble_amd64.deb" -o /tmp/pg_search.deb
  sudo apt-get install -y /tmp/*.deb
  ```

  ```bash Ubuntu 22.04 theme={null}
  # Available arch versions are amd64, arm64
  curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/postgresql-17-pg-search_0.21.8-1PARADEDB-jammy_amd64.deb" -o /tmp/pg_search.deb
  sudo apt-get install -y /tmp/*.deb
  ```

  ```bash Debian 13 theme={null}
  # Available arch versions are amd64, arm64
  curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/postgresql-17-pg-search_0.21.8-1PARADEDB-trixie_amd64.deb" -o /tmp/pg_search.deb
  sudo apt-get install -y /tmp/*.deb
  ```

  ```bash Debian 12 theme={null}
  # Available arch versions are amd64, arm64
  curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/postgresql-17-pg-search_0.21.8-1PARADEDB-bookworm_amd64.deb" -o /tmp/pg_search.deb
  sudo apt-get install -y /tmp/*.deb
  ```

  ```bash RHEL 10 theme={null}
  # Available arch versions are x86_64, aarch64
  curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/pg_search_17-0.21.8-1PARADEDB.el10.x86_64.rpm" -o /tmp/pg_search.rpm
  sudo dnf install -y /tmp/*.rpm
  ```

  ```bash RHEL 9 theme={null}
  # Available arch versions are x86_64, aarch64
  curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/pg_search_17-0.21.8-1PARADEDB.el9.x86_64.rpm" -o /tmp/pg_search.rpm
  sudo dnf install -y /tmp/*.rpm
  ```

  ```bash macOS 15 (Sequoia) theme={null}
  # Available arch version is arm64
  curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/pg_search@17--0.21.8.arm64_sequoia.pkg" -o ~/Downloads/pg_search.pkg
  sudo installer -pkg ~/Downloads/pg_search.pkg -target /
  ```

  ```bash macOS 14 (Sonoma) theme={null}
  # Available arch version is arm64
  curl -L "https://github.com/paradedb/paradedb/releases/download/v0.21.8/pg_search@17--0.21.8.arm64_sonoma.pkg" -o ~/Downloads/pg_search.pkg
  sudo installer -pkg ~/Downloads/pg_search.pkg -target /
  ```
</CodeGroup>

### ParadeDB Enterprise

If you are a [ParadeDB Enterprise](/deploy/enterprise) user, you should have received a copy of the enterprise binaries. Please [contact sales](mailto:sales@paradedb.com) for access.

## Update `postgresql.conf`

Next, add the extension(s) to `shared_preload_libraries` in `postgresql.conf`. Adding `pg_search` to `shared_preload_libraries` is unnecessary if your Postgres version is 17 or higher.

```ini theme={null}
shared_preload_libraries = 'pg_search'
```

Reload the Postgres server for these changes to take effect.

## Load the Extension

Finally, connect to your Postgres database via your client of choice (e.g. `psql`) and run the following command:

```sql theme={null}
CREATE EXTENSION pg_search;
```

<Note>
  `pg_search` can be combined with `pgvector` for hybrid search. You can find
  the instructions for installing `pgvector` [on the `pgvector` GitHub
  repository](https://github.com/pgvector/pgvector?tab=readme-ov-file#installation).
</Note>


# Configuring High Availability
Source: https://docs.paradedb.com/deploy/self-hosted/high-availability/configuration

Use read replicas to minimize downtime in production

High availability (HA) minimizes downtime in the event of failures and is crucial for production deployments. To achieve high availability, you need to have
[ParadeDB Enterprise](/deploy/enterprise) deployed inside a [CNPG Kubernetes cluster](/deploy/self-hosted/kubernetes).

## How High Availability Works

In a highly available configuration, ParadeDB deploys as a cluster of Postgres instances. One instance is designated as the **primary** while the other instances are designated as
**standby** instances. The primary server sends write-ahead logs (WAL) to the standby servers, which replicate the primary by replaying these logs.

If the primary server goes down, a standby server is promoted to become the new primary server. This process is called failover.

For a thorough architecture overview, please consult the [CloudNativePG Architecture documentation](https://cloudnative-pg.io/docs/1.28/architecture).

## Enable High Availability

Prior to starting the CNPG cluster, modify the `values.yaml` file to increase the number of instances.

```yaml ParadeDB Enterprise theme={null}
type: paradedb-enterprise
mode: standalone

cluster:
  instances: 3
  storage:
    size: 256Mi
```

The number of replicas is equal to `instances - 1`. Having at least `3` instances guarantees that a standby will be available even while a failover process is occurring.

## Synchronous Replication

Between physical replicas, ParadeDB requires the use of a few settings (which are automatically set by [CNPG](/deploy/self-hosted/kubernetes)) in order to avoid query cancellation due to ongoing reorganization of the data on the primary replica.

* `hot_standby_feedback=on` - The [`hot_standby_feedback`](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-HOT-STANDBY-FEEDBACK) setting controls whether nodes acting as `hot_standby`s (the replicas in physical replication) send feedback to the leader about their current transaction status. ParadeDB uses this transaction status to determine when it is safe for the primary to garbage collect its segments.
* `primary_slot_name=$something` - The [`primary_slot_name`](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-PRIMARY-SLOT-NAME) setting declares the name of the replication slot that a replica should use when it connects to the primary. In order for `hot_standby_feedback` to be used and persistent, a replication slot must be used.

Without these settings, ParadeDB physical replicas will see much more frequent query cancels, and will report a message recommending that they are used.

## Synchronous Replication

By default, ParadeDB ships with asynchronuous replication, meaning transactions on the primary **do not** wait for confirmation from
the standby instances before committing.

**Quorum-based synchronous replication** ensures that a transaction is successfully written to standbys before it completes.
Please consult the [CloudNativePG Replication documentation](https://cloudnative-pg.io/docs/1.28/replication#synchronous-replication) for details.

## Backup and Disaster Recovery

ParadeDB supports backups to cloud object stores (e.g. S3, GCS, etc.) and point-in-time-recovery
via [Barman](https://pgbarman.org/). To configure the frequency and location of backups, please consult the [CloudNativePG Backup documentation](https://cloudnative-pg.io/docs/1.28/backup).


# Kubernetes
Source: https://docs.paradedb.com/deploy/self-hosted/kubernetes

How to deploy ParadeDB as a Kubernetes cluster into production

Kubernetes is the recommended way to run ParadeDB in production. Both ParadeDB Community and Enterprise binaries
can be deployed on Kubernetes.

<Note>
  We recommend running ParadeDB Enterprise, not Community, with Kubernetes in
  production to maximize uptime. See [overview](/deploy/overview#self-hosted).
</Note>

This guide uses the [ParadeDB Helm Chart](https://github.com/paradedb/charts). The chart is also available on [Artifact Hub](https://artifacthub.io/packages/helm/paradedb/paradedb).

## Prerequisites

This guide assumes you have installed [Helm](https://helm.sh/docs/intro/install/) and have a Kubernetes cluster running v1.25+.
For local testing, we recommend [Minikube](https://minikube.sigs.k8s.io/docs/start/).

## Install the Prometheus Stack

The ParadeDB Helm chart supports monitoring via Prometheus and Grafana. To enable this, you need to have the Prometheus CRDs installed before installing the CloudNativePG operator. If you do not yet have the Prometheus CRDs installed on your Kubernetes cluster, you can install it with:

```bash theme={null}
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --atomic --install prometheus-community \
--create-namespace \
--namespace prometheus-community \
--values https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/main/docs/src/samples/monitoring/kube-stack-config.yaml \
prometheus-community/kube-prometheus-stack
```

## Install the CloudNativePG Operator

Skip this step if the CloudNativePG operator is already installed in your cluster. If you do not wish to monitor your cluster, omit the `--set` commands.

```bash theme={null}
helm repo add cnpg https://cloudnative-pg.github.io/charts
helm upgrade --atomic --install cnpg \
--create-namespace \
--namespace cnpg-system \
--set monitoring.podMonitorEnabled=true \
--set monitoring.grafanaDashboard.create=true \
cnpg/cloudnative-pg
```

## Start a ParadeDB CNPG Cluster

Create a `values.yaml` and configure it to your requirements. Here is a basic example:

<CodeGroup>
  ```yaml ParadeDB Community theme={null}
  type: paradedb
  mode: standalone

  cluster:
    instances: 1
    storage:
      size: 256Mi
  ```

  ```yaml ParadeDB Enterprise theme={null}
  type: paradedb-enterprise
  mode: standalone

  cluster:
    instances: 1
    storage:
      size: 256Mi
  ```
</CodeGroup>

<Note>
  If you are using ParadeDB Enterprise, `instances` should be set to a number
  greater than `1` for [high
  availability](/deploy/self-hosted/high-availability/configuration).
</Note>

Next, create a namespace for this step or use an existing namespace. The namespace can be any value.

```bash theme={null}
kubectl create namespace <your-namespace>
```

For ParadeDB Enterprise, you should have received an enterprise Docker username and personal access token. The following step passes these
credentials to Kubernetes and should be skipped if you are deploying ParadeDB Community.

```bash ParadeDB Enterprise theme={null}
kubectl create secret docker-registry paradedb-enterprise-registry-cred
--namespace <your-namespace>
--docker-server="https://index.docker.io/v1/"
--docker-username="<enterprise_docker_username>"
--docker-password="<enterprise_docker_access_token>"
```

Finally, launch the ParadeDB cluster.

```bash theme={null}
helm repo add paradedb https://paradedb.github.io/charts
helm upgrade --atomic --install paradedb \
--namespace <your-namespace> \
--values values.yaml \
--set cluster.monitoring.enabled=true \
paradedb/paradedb
```

## Connect to the Cluster

The command to connect to the primary instance of the cluster will be printed in your terminal. If you do not modify any settings, it will be:

```bash theme={null}
kubectl --namespace paradedb exec --stdin --tty services/paradedb-rw -- bash
```

This will launch a Bash shell inside the instance. You can connect to the ParadeDB database via `psql` with:

```bash theme={null}
psql -d paradedb
```

## Connect to the Grafana Dashboard

To connect to the Grafana dashboard for your cluster, we suggested port forwarding the Kubernetes service running Grafana to localhost:

```bash theme={null}
kubectl --namespace prometheus-community port-forward svc/prometheus-community-grafana 3000:80
```

`You can then access the Grafana dasbhoard at `localhost:3000` using the credentials`admin`as username
and`prom-operator` as password. These default credentials are defined in the [`kube-stack-config.yaml`](https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/main/docs/src/samples/monitoring/kube-stack-config.yaml)
file used as the `values.yaml`file in [Installing the Prometheus CRDs](#installing-the-prometheus-stack) and can be modified by providing
your own`values.yaml\` file. A more detailed guide on monitoring the cluster can be found in the [CloudNativePG documentation](https://cloudnative-pg.io/docs/1.28/monitoring).


# Schema Changes
Source: https://docs.paradedb.com/deploy/self-hosted/logical-replication/configuration

Handle DDL/schema changes when running ParadeDB as a logical replica

<Note>
  This section assumes that you have successfully completed the [getting
  started](/deploy/self-hosted/logical-replication/getting-started) guide.
</Note>

## Schema Changes

ParadeDB leverages PostgreSQL’s built-in logical replication to provide flexible and efficient data synchronization, and is subject to the same limitations. A well-known caveat of logical replication is that schema changes (DDL commands) are not replicated. This means that any changes to the schema on the source database, such as adding new columns or tables, will not be automatically applied to the subscriber.

```sql theme={null}
-- On Publisher
ALTER TABLE mock_items ADD COLUMN num_stock;
INSERT INTO mock_items (description, category, in_stock, latest_available_time, last_updated_date, metadata, created_at, rating, num_stock)
VALUES ('Green running shoes', 'Footwear', true, '14:00:00', '2024-07-09', '{}', '2024-07-09 14:00:00', 2, 900);

-- On Subscriber
ERROR: logical replication target relation "public.mock_items" is missing some replicated columns
```

To work around this, pause the subscription on the subscriber, manually apply the schema changes, then resume the subscription:

```sql theme={null}
-- On Subscriber
ALTER SUBSCRIPTION mock_items_sub DISABLE;
ALTER TABLE entries ADD COLUMN num_stock int;
ALTER SUBSCRIPTION mock_items_sub ENABLE;
```

If new tables are added and your publication is not `FOR ALL TABLES`, add them to the publication manually:

```sql theme={null}
-- On Publisher
ALTER PUBLICATION mock_items_pub ADD TABLE newly_added_table;

-- On Subscriber
ALTER SUBSCRIPTION mock_items_sub REFRESH PUBLICATION;
```


# Configuring Logical Replication
Source: https://docs.paradedb.com/deploy/self-hosted/logical-replication/getting-started

Set up ParadeDB as a logical replica of an existing Postgres like Amazon RDS

<Note>
  In order for ParadeDB to run as a logical replica, ParadeDB must be using
  Postgres 17+.
</Note>

<Note>
  Logical replication can be configured to replicate specific tables, column
  lists, or rows. Please see the [Postgres
  documentation](https://www.postgresql.org/docs/current/logical-replication.html).
</Note>

In production, ParadeDB is designed to be deployed as a logical replica of your primary Postgres. This
allows ParadeDB to stay in sync with managed Postgres providers like AWS RDS, Google CloudSQL, and Azure Postgres with zero additional
infrastructure overhead. It also isolates search and analytical workloads from your primary Postgres, which can prevent downtime caused by long-running queries.

ParadeDB supports logical replication from any primary Postgres. This could be AWS RDS/Aurora, Azure Database, GCP CloudSQL for PostgreSQL, AlloyDB for PostgreSQL, or a self-hosted Postgres.

## AWS RDS/Aurora

Please follow AWS' [tutorial for configuring logical replication](https://aws.amazon.com/blogs/database/using-logical-replication-to-replicate-managed-amazon-rds-for-postgresql-and-amazon-aurora-to-self-managed-postgresql/).
In this tutorial, AWS RDS/Aurora is the **publisher** and ParadeDB is the **subscriber**.

## Azure Database

Please follow Azure's [tutorial for configuring logical replication](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-logical).
In this tutorial, Azure Database is the **publisher** and ParadeDB is the **subscriber**.

<Note>
  Azure Cosmos DB for PostgreSQL [does not support logical
  replication](https://learn.microsoft.com/en-us/answers/questions/1193391/does-azure-cosmos-db-for-postgresql-support-logica).
</Note>

## GCP CloudSQL for PostgreSQL

Please follow Google's [tutorial for configuring logical replication](https://cloud.google.com/sql/docs/postgres/replication/configure-logical-replication#set-up-native-postgresql-logical-replication).
In this tutorial, GCP CloudSQL for PostgreSQL is the **publisher** and ParadeDB is the **subscriber**.

## AlloyDB for PostgreSQL

Please follow Google's [tutorial for configuring logical replication](https://cloud.google.com/alloydb/docs/omni/replicate-data-omni-other-db).
In this tutorial, AlloyDB is the **publisher** and ParadeDB is the **subscriber**.

## Self-Hosted Postgres

This guide demonstrates how to configure replication between a self-hosted primary Postgres and ParadeDB.

### Environment Setup

We'll use the following environment setup for our demonstration:

**Primary Server**

* **OS**: Ubuntu 24.04
* **IP**: 192.168.0.30
* **Database Name**: `marketplace`
* **Replication User**: `replicator`
* **Replication Password**: `passw0rd`

**Standby Servers**

* **OS**: Ubuntu 24.04
* **IP**: `192.168.0.31`, `192.168.0.32`

### Primary Server Setup

#### 1. Edit `postgresql.conf`

The `postgresql.conf` file is the main configuration file for PostgreSQL. It contains all the server settings that control the behavior and performance of your PostgreSQL instance.
Ensure that your `postgresql.conf` has the following settings applied:

```ini theme={null}
listen_addresses = 'localhost,192.168.0.30'
max_wal_senders = 10
```

* `listen_addresses` specifies the IP addresses on which PostgreSQL listens for connections. By default, PostgreSQL only listens on `localhost`. To allow other servers (like your standby servers) to connect for replication, you need to include their IP addresses.
* `max_wal_senders` determines the maximum number of concurrent connections that can send WAL (Write-Ahead Log) data.

If you are running `pg_search` on the primary server, make sure to add it to `shared_preload_libraries` if your Postgres version is less than 17. If you are installing it only on the standby server as a search replica,
you should skip this step.

```ini theme={null}
# Include this only if pg_search is installed on the primary
shared_preload_libraries = 'pg_search'
```

#### 2. Edit `pg_hba.conf`

The `pg_hba.conf` file (PostgreSQL Host-Based Authentication file) controls client authentication, defining who can connect to the PostgreSQL server, how they can connect, and from where. It ensures that only authorized users and systems can access the database.

We need to allow the replication user to connect from the standby servers' IP addresses. Add the following lines to allow replication from the local network and localhost. This configuration ensures that the replication user can connect from the specified IP range.

```ini theme={null}
local   replication     all                                     peer
host    replication     all             127.0.0.1/32            scram-sha-256
host    replication     all             ::1/128                 scram-sha-256
host    replication     all             192.168.0.0/24          scram-sha-256
```

#### 3. Create the Replication User

In PostgreSQL, a user is required to handle replication connections securely. This user needs specific permissions to read the WAL and perform replication tasks.

Creating a dedicated replication user ensures that the replication process is secure and isolated from other database operations. This user will have the necessary permissions to perform replication but will not have broader administrative privileges.

```bash theme={null}
sudo -u postgres createuser --pwprompt --replication replicator
```

`--pwprompt` prompts you to enter a password for the new user. In this tutorial, we will be setting the password to `passw0rd`.

`--replication` grants the replication privilege to the new user, allowing it to handle replication tasks.

#### 4. Create Mock Table on Primary

Create a database:

```bash theme={null}
sudo -u postgres -H createdb marketplace
```

Insert some data:

```sql theme={null}
CREATE TABLE mock_items (
  id SERIAL PRIMARY KEY,
  description TEXT,
  rating INTEGER CHECK (rating BETWEEN 1 AND 5),
  category VARCHAR(255),
  in_stock BOOLEAN,
  metadata JSONB,
  created_at TIMESTAMP,
  last_updated_date DATE,
  latest_available_time TIME
);


INSERT INTO mock_items (description, category, in_stock, latest_available_time, last_updated_date, metadata, created_at, rating)
VALUES ('Red sports shoes', 'Footwear', true, '12:00:00', '2024-07-10', '{}', '2024-07-10 12:00:00', 1);
```

If you have installed `pg_search` on the primary, you can build a BM25 index over the table.

```sql theme={null}
-- Skip this if pg_search is not installed on the primary
CREATE INDEX mock_items_bm25_idx ON public.mock_items
USING bm25 (id, description) WITH (key_field='id');
```

### Standby Server Setup

Standby servers do not have any specific configuration necessary other than the tables in which you are ingesting logical replication must start off empty. Configuring
replication consists of two steps:

* Running pg\_basebackup, which will create a snapshot of your primary onto your standby servers. This makes the initial replication much faster.
* Subscribing via logical replication, which will create a subscription that continuously propagates changes on the primary server to the standby.

#### 1. Run pg\_basebackup

You should run `pg_basebackup` on your standby server to create a physical byte-for-byte replica of your primary cluster. The `--pgdata` directory specifies the where the standby cluster will be created. The directory must exist, and must be empty.

```bash theme={null}
mkdir -p /var/lib/postgresql/17/main

pg_basebackup --create-slot --slot standby1 --host 192.168.0.30 --pgdata /var/lib/postgresql/17/main --progress --username replicator --write-recovery-conf
pg_basebackup --create-slot --slot standby2 --host 192.168.0.30 --pgdata /var/lib/postgresql/17/main --progress --username replicator --write-recovery-conf
```

#### 2. Start PostgreSQL on Standby Server

Start PostgreSQL on standby servers:

```bash theme={null}
sudo systemctl start postgresql
```

#### 3. Verify Replication

Now you can verify the replication. The data inserted on the primary server should appear on the standby servers, confirming that streaming replication with ParadeDB is working correctly.

```sql theme={null}
SELECT * FROM mock_items WHERE description @@@ 'shoes';
```

### Logical Replication Setup

Logical replication allows you to replicate changes to specific tables between PostgreSQL instances. After creating a physical replica with `pg_basebackup`, you can set up a subscription to propagate changes on the primary server to the standby.

#### 1. Create Publication on Primary

```sql theme={null}
CREATE PUBLICATION mock_items_pub FOR TABLE mock_items;
```

#### 2. Create Subscription on Standby

```sql theme={null}
CREATE SUBSCRIPTION sub_marketplace
CONNECTION 'host=172.16.0.1 port=5432 dbname=marketplace user=replicator password=passw0rd'
PUBLICATION pub_marketplace;
```

#### 3. Insert Data on Publisher

```sql theme={null}
INSERT INTO mock_items (description, category, in_stock, latest_available_time, last_updated_date, metadata, created_at, rating)
VALUES ('Blue running shoes', 'Footwear', true, '14:00:00', '2024-07-10', '{}', '2024-07-10 14:00:00', 2)
```

#### 4. Verify Replication

Verify the replication on the Subscriber server:

```sql theme={null}
SELECT * FROM mock_items WHERE description @@@ '"running shoes"';
```

The data inserted on the Publisher server should appear on the Subscriber server.


# Multi-Database Replication for Microservices
Source: https://docs.paradedb.com/deploy/self-hosted/logical-replication/multi-database

Consolidate multiple microservice databases into a single ParadeDB instance for app-wide search and cross-database joins

## Problem Statement

Organizations often have multiple Postgres databases, each connected to a different microservice. The goal is to logically replicate all of these databases into a single ParadeDB instance. This enables:

* App-wide search across all microservices
* Cross-database joins for analytics and reporting
* Centralized data access without modifying individual microservices

However, table naming collisions can occur since each microservice and its database operate independently.

## Logical Replication Background

Postgres' Logical Replication is designed from the perspective of one source database and one destination database. Logical replication carries `table_name` and `schema_name` as part of the WAL (Write-Ahead Log) being emitted. It does not have native primitives to allow any schema or table name mutations in the middle.

## Solution

For logical replication to work, all source database tables need to have a unique signature that avoids name collisions. They also need to be identifiable by their source database. This can be achieved by using a different schema in each database instead of the `public` schema. The schema name should match the database name.

### Architecture

The solution involves replicating multiple independent microservice databases into a single ParadeDB instance. Each source database uses a schema named after the database itself, ensuring no naming conflicts.

<img alt="Multi-database replication architecture" />

As shown in the diagram:

* Each microservice database (db1, db2, db3) uses a schema matching its database name
* All databases replicate to a single ParadeDB instance via logical replication
* In ParadeDB, tables are accessible with fully-qualified names (e.g., `db1.table1`, `db2.table1`)
* This enables cross-database joins like: `SELECT db1.users.user_id FROM db1.users, db2.orders WHERE db1.users.id = db2.orders.user_id`

Instead of having all tables in the `public` schema across multiple databases:

```
Database: users_service
Schema: public
  - users
  - profiles

Database: orders_service
Schema: public
  - orders
  - payments
```

Reorganize each database to use a dedicated schema:

```
Database: users_service
Schema: users_service
  - users
  - profiles

Database: orders_service
Schema: orders_service
  - orders
  - payments
```

This approach ensures that when replicated to ParadeDB, all tables have unique fully-qualified names and you can identify the source of each table.

## Zero-Downtime Migration

This migration strategy reorganizes tables from the `public` schema into dedicated schemas while maintaining complete backwards compatibility through updatable views.

### Migration Steps

For each microservice database, execute the following:

```sql theme={null}
BEGIN;

-- Create new schema named after the database
CREATE SCHEMA IF NOT EXISTS <database_name>;

-- Move tables to new schema
ALTER TABLE public.table1 SET SCHEMA <database_name>;
ALTER TABLE public.table2 SET SCHEMA <database_name>;
-- Repeat for all tables...

-- Create backwards-compatible views in public schema
CREATE OR REPLACE VIEW public.table1 AS SELECT * FROM <database_name>.table1;
CREATE OR REPLACE VIEW public.table2 AS SELECT * FROM <database_name>.table2;
-- Repeat for all tables...

COMMIT;
```

### Example

For a `users_service` database:

```sql theme={null}
BEGIN;

-- Create new schema
CREATE SCHEMA IF NOT EXISTS users_service;

-- Move tables
ALTER TABLE public.users SET SCHEMA users_service;
ALTER TABLE public.profiles SET SCHEMA users_service;

-- Create backwards-compatible views
CREATE OR REPLACE VIEW public.users AS SELECT * FROM users_service.users;
CREATE OR REPLACE VIEW public.profiles AS SELECT * FROM users_service.profiles;

COMMIT;
```

### Benefits of This Approach

* **Zero Downtime**: Existing applications continue to function without modification during the transition period for all queries (SELECT, INSERT, UPDATE, DELETE)
* **Gradual Migration**: Application queries can be updated over time to reference the new schema directly
* **Rollback Capability**: Each migration step is reversible if needed
* **View Cleanup**: Once applications are updated, views in the `public` schema can be safely removed

### Setting Up Logical Replication

After completing the schema migration for all source databases:

1. Configure each source database as a publisher following the [getting started guide](/deploy/self-hosted/logical-replication/getting-started)
2. Set up ParadeDB as a subscriber for all source databases
3. Create publications on each source database for their respective schemas:

```sql theme={null}
-- On users_service database
CREATE PUBLICATION users_pub FOR TABLES IN SCHEMA users_service;

-- On orders_service database
CREATE PUBLICATION orders_pub FOR TABLES IN SCHEMA orders_service;
```

4. Create subscriptions on ParadeDB for each source database:

```sql theme={null}
-- On ParadeDB instance
CREATE SUBSCRIPTION users_sub
    CONNECTION 'host=users_db port=5432 dbname=users_service user=replicator password=...'
    PUBLICATION users_pub;

CREATE SUBSCRIPTION orders_sub
    CONNECTION 'host=orders_db port=5432 dbname=orders_service user=replicator password=...'
    PUBLICATION orders_pub;
```

## Trade-offs

### Pros

* **Multi Database BM25 Search**: Perform full-text search across tables distributed across multiple microservice databases in a single query
* **Avoid Distributed Joins in Application**: Execute cross-database joins directly in ParadeDB instead of implementing complex join logic in your application
* **Simple Architecture**: Uses standard PostgreSQL logical replication without extra infrastructure
* **Namespace Isolation**: Schema-based separation prevents naming conflicts
* **No Source Database Changes**: Microservices continue operating independently; ParadeDB acts as a read replica

### Cons

* Source databases will access tables from their dedicated schema (e.g., `users_service`) instead of `public`
* Requires coordination across microservice teams for initial migration
* Existing database tooling may need configuration updates to work with non-public schemas


# Installing Third Party Extensions
Source: https://docs.paradedb.com/deploy/third-party-extensions

How to install additional extensions into ParadeDB

<Note>
  [Foreign data
  wrapper](https://www.postgresql.org/docs/current/ddl-foreign-data.html)
  extensions can be used to query AWS S3 and other external data stores directly
  from ParadeDB.
</Note>

Postgres has a rich ecosystem of extensions. ParadeDB is designed to work alongside other PostgreSQL extensions for a complete data platform.

## Pre-installed Extensions

To keep the ParadeDB Docker image size manageable, the following extensions are pre-installed:

* **`pg_search`** — Full-text and hybrid search with BM25
* **`pgvector`** — Vector similarity search
* **`postgis`** — Geospatial queries and indexing
* **`pg_ivm`** — Incremental materialized views
* **`pg_cron`** — Scheduled jobs and background tasks

<Note>
  `pg_cron` is configured on the default `postgres` database and cannot be
  changed.
</Note>

## Compatible Extensions

ParadeDB has been tested with and supports the following popular extensions:

* **[Citus](/deploy/citus)** — Distributed PostgreSQL for horizontal scaling
* **`pg_partman`** — Automated partition management
* **`pg_stat_statements`** — Query performance monitoring
* **`postgres_fdw`** — Foreign data wrappers for federated queries

<Note>
  If you encounter any issues with extension compatibility, please [open an
  issue](https://github.com/paradedb/paradedb/issues) or reach out to our
  [community](https://join.slack.com/t/paradedbcommunity/shared_invite/zt-32abtyjg4-yoYoi~RPh9MSW8tDbl0BQw).
</Note>

## Installing Third Party Extensions

The process for installing an extension varies by extension. Generally speaking, it requires:

* Download the prebuilt binaries inside ParadeDB
* Install the extension binary and any dependencies inside ParadeDB
* Add the extension to `shared_preload_libraries` in `postgresql.conf`, if required by the extension
* Run `CREATE EXTENSION <extension name>`

We recommend installing third party extensions from prebuilt binaries to keep the image size small. As an
example, let's install [pg\_partman](https://github.com/pgpartman/pg_partman), an extension for managing table partition sets.

### Install Prebuilt Binaries

First, enter a shell with root permissions in the ParadeDB image.

```bash theme={null}
docker exec -it --user root paradedb bash
```

<Note>
  This command assumes that your ParadeDB container name is `paradedb`.
</Note>

Next, install the [prebuilt binaries](https://pkgs.org/search/?q=partman).
Most popular Postgres extensions can be installed with `apt-get install`.

```bash theme={null}
apt-get update
apt-get install -y --no-install-recommends postgresql-17-partman
```

<Note>
  If the extension is not available with `apt-get install`, you can usually
  `curl` the prebuilt binary from a GitHub Release page. You will need to first
  install `curl` via `apt-get install` if you are taking this approach.
</Note>

### Add to `shared_preload_libraries`

<Accordion title="Modifying shared_preload_libraries">
  If you are installing an extension which requires this step, you can do so
  via the following command, replacing `<extension_name>` with your extension's name:

  ```bash theme={null}
  sed -i "/^shared_preload_libraries/s/'\([^']*\)'/'\1,<extension_name>'/" /var/lib/postgresql/data/postgresql.conf
  ```

  For `pg_partman`, the command is:

  ```bash theme={null}
  sed -i "/^shared_preload_libraries/s/'\([^']*\)'/'\1,pg_partman_bgw'/" /var/lib/postgresql/data/postgresql.conf
  ```
</Accordion>

Postgres must be restarted afterwards. We recommend simply restarting the Docker container.

### Create the Extension

Connect to ParadeDB via `psql` and create the extension.

```sql theme={null}
CREATE EXTENSION pg_partman;
```

`pg_partman` is now ready to use!

Note that this is a simple example of installing `pg_partman`. The full list of settings and optional dependencies can be found in the [official installation instructions](https://github.com/pgpartman/pg_partman?tab=readme-ov-file#installation).


# Upgrading ParadeDB
Source: https://docs.paradedb.com/deploy/upgrading

How to update ParadeDB to the latest version

## Overview

ParadeDB ships its functionality inside a Postgres extension, `pg_search`. Upgrading ParadeDB is as simple as updating the `pg_search` extension.

<Note>
  ParadeDB uses `pgvector` for vector search. This extension is not managed by
  ParadeDB. Please refer to the [pgvector
  documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file#upgrading)
  for instructions on how to upgrade it.
</Note>

## Getting the Current Version

To inspect the current version of an extension, run the following command.

```sql theme={null}
SELECT extversion FROM pg_extension WHERE extname = 'pg_search';
```

Verify that it matches `paradedb.version_info()`:

```sql theme={null}
SELECT * FROM paradedb.version_info();
```

The reason that there are two statements is because `paradedb.version_info()` is the actual version of `pg_search` that is installed,
whereas `pg_extension` is what Postgres' catalog thinks the version of the extension is.

If `paradedb.version_info()` is greater than `pg_extension`, it typically means that `ALTER EXTENSION` was not run after the previous upgrade, and that the SQL upgrade scripts were not applied.
If `pg_extension` is greater than `paradedb.version_info()`, it means that the extension didn't fully upgrade, and that Postgres needs to be restarted.

## Getting the Latest Version

The latest version of `pg_search` is `0.21.8`. Please refer to the [releases](https://github.com/paradedb/paradedb/releases) page for all available versions of `pg_search`.

## Updating ParadeDB

### Helm Chart

To upgrade the ParadeDB Helm chart:

1. Update the `paradedb` chart to the latest version.

```bash theme={null}
helm repo update
```

2. Get the latest version of the `paradedb` chart.

```bash theme={null}
helm search repo paradedb
```

3. Get the latest version of the ParadeDB extension, which is the value of `version.paradedb` in the chart [README](https://github.com/paradedb/charts/tree/dev/charts/paradedb#values).

4. Run `helm upgrade` with the latest version of the chart and the latest version of the extension.

```bash theme={null}
helm upgrade paradedb paradedb/paradedb --namespace paradedb --reuse-values --version <helm_version> --set version.paradedb=<paradedb_version> --atomic
```

Replace `<helm_version>` with the latest version of the chart and `<paradedb_version>` with the latest version of the extension.

5. If you are using [ParadeDB BYOC](/deploy/byoc), an automatic rollout will begin. One by one, the pods will be restarted to apply the new version of the extension.

### Docker Image

To upgrade the ParadeDB Docker image while preserving your data volume:

1. Stop the ParadeDB Docker image via `docker stop paradedb`.

2. Run the following command to pull a specific version of the Docker image. You can set the version number
   to `latest` to pull the latest Docker image. You can find the full list of available tags on [Docker Hub](https://hub.docker.com/r/paradedb/paradedb/tags).

```bash theme={null}
docker pull paradedb/paradedb:0.21.8
```

The latest version of the Docker image should be `0.21.8`.

3. Start the new ParadeDB Docker image via `docker run paradedb`.

### Self-Managed Postgres

To upgrade the extensions running in a self-managed Postgres:

1. Stop Postgres (e.g. `pg_ctl stop -D </path/to/data/directory>`).
2. Download and install the extension you wish to upgrade in the same way that it was initially installed.
3. Start Postgres (e.g. `pg_ctl start -D </path/to/data/directory>`).

## Alter Extension

After ParadeDB has been upgraded, connect to it and run the following command in all databases that `pg_search` is installed in. This step is required regardless of the environment that ParadeDB is installed in (Helm, Docker, or self-managed Postgres).

```sql theme={null}
ALTER EXTENSION pg_search UPDATE TO '0.21.8';
```

## Verify the Upgrade

After upgrading the extension and restarting Postgres, verify that the version numbers returned by the following commands match:

```sql theme={null}
SELECT extversion FROM pg_extension WHERE extname = 'pg_search';
SELECT * FROM paradedb.version_info();
```

If the two versions do not match, restart Postgres and try again.