Basic Usage

You can change how individual fields are tokenized and stored by passing JSON strings to the WITH clause of CREATE INDEX. For instance, the following statement configures an ngram tokenizer for the description field.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "ngram", "min_gram": 2, "max_gram": 3, "prefix_only": false}
        }
    }'
);

The key(s) of the JSON string correspond to field names, and the values are the configuration options. If a configuration option or field name is not specified, the default values are used (see all configuration options).

Configure Multiple Fields

To configure multiple fields, simply pass more keys to the JSON string. For instance, the following statement specifies tokenizers for both the description and category fields.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description, category)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "ngram", "min_gram": 2, "max_gram": 3, "prefix_only": false}
        },
        "category": {
            "tokenizer": {"type": "ngram", "min_gram": 2, "max_gram": 3, "prefix_only": false}
        }
    }'
);

All Configuration Options

Text Fields

Options for columns of type VARCHAR, TEXT, UUID, and their corresponding array types should be passed to text_fields.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "fast": true,
          "tokenizer": {"type": "ngram", "min_gram": 2, "max_gram": 3, "prefix_only": false}
        }
    }'
);

The nested configuration JSON for text_fields accepts the following keys.

fast
default: false

See fast fields for when this option should be set to true.

tokenizer

See tokenizers for how to configure the tokenizer.

record
default: "position"

See record for a list of available record types.

normalizer

See normalizers for how to configure the normalizer.

JSON Fields

Options for columns of type JSON and JSONB should be passed to json_fields.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, metadata)
WITH (
  key_field = 'id',
  json_fields = '{
    "metadata": {
      "fast": true
    }
  }'
);

The nested configuration JSON for json_fields accepts the following keys.

fast
default: false

See fast fields for when this option should be set to true.

tokenizer

See tokenizers for how to configure the tokenizer.

record
default: "position"

See record for a list of available record types.

normalizer

See normalizers for how to configure the normalizer.

expand_dots
default: true

If true, JSON keys containing a . will be expanded. For instance, if expand_dots is true, {"metadata.color": "red"} will be indexed as if it was {"metadata": {"color": "red"}}.

Advanced Options

In addition to text and JSON, ParadeDB exposes options for numeric, datetime, boolean, range, and enum fields. For most use cases, it is not necessary to change these options.

Numeric Fields

Options for columns of type SMALLINT, INTEGER, BIGINT, OID, REAL, DOUBLE PRECISION, NUMERIC, and their corresponding array types should be passed to numeric_fields.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, rating)
WITH (
    key_field = 'id',
    numeric_fields = '{
        "rating": {"fast": true}
    }'
);

Boolean Fields

Options for columns of type BOOLEAN and BOOLEAN[] should be passed to boolean_fields.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, in_stock)
WITH (
  key_field = 'id',
  boolean_fields = '{
      "in_stock": {"fast": true}
  }'
);

CREATE_INDEX accepts several configuration options for boolean_fields:

Datetime Fields

Options for columns of type DATE, TIMESTAMP, TIMESTAMPTZ, TIME, TIMETZ, and their corresponding array types should be passed to datetime_fields.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, created_at)
WITH (
  key_field = 'id',
  datetime_fields = '{
      "created_at": {"fast": true}
  }'
);

CREATE INDEX accepts several configuration options for datetime_fields:

Range Fields

Options for columns of type int4range, int8range, numrange, tsrange, and tstzrange should be passed to range_fields. The range term query is used to filter over these fields.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, weight_range)
WITH (
  key_field = 'id',
  range_fields = '{
    "weight_range": {"stored": true}
  }'
);

CREATE INDEX accepts several configuration options for range_fields:

Enumerated Types

Options for custom Postgres enums should be passed to numeric_fields. Enums should be queried with term queries.

If the ordering of the enum is changed with ADD VALUE ... [ BEFORE | AFTER ], the BM25 index should be dropped and recreated to account for the new enum ordinal values.

Deprecated Syntax Migration

In ParadeDB v0.13.0, the old paradedb.create_bm25 function was deprecated in favor of the CREATE INDEX syntax. To make migration to the new CREATE INDEX syntax easier, a new paradedb.format_create_bm25 function has been introduced. This function accepts the same arguments as the deprecated paradedb.create_bm25 function and outputs an equivalent CREATE INDEX statement which can be copy, pasted, and executed.

SELECT * FROM paradedb.format_create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description') || paradedb.field('category'),
  numeric_fields => paradedb.field('rating'),
  boolean_fields => paradedb.field('in_stock'),
  datetime_fields => paradedb.field('created_at'),
  json_fields => paradedb.field('metadata'),
  range_fields => paradedb.field('weight_range')
);

paradedb.format_create_bm25 does not create the index. It simply outputs a CREATE INDEX statement for you to run.