Basic Usage

The following code block creates a BM25 index called search_idx over multiple columns of the mock_items table.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description') || paradedb.field('category'),
  numeric_fields => paradedb.field('rating'),
  boolean_fields => paradedb.field('in_stock'),
  datetime_fields => paradedb.field('created_at'),
  json_fields => paradedb.field('metadata'),
  range_fields => paradedb.field('weight_range')
);
index_name
required

The name of the index being created.

table_name
required

The name of the table being indexed.

key_field
required

The name of a column in the table that represents a unique identifier for each record. Usually, this is the same column that is the primary key of the table.

schema_name
default: "CURRENT SCHEMA"

The name of the schema, or namespace, of the table.

Indexing Settings

Several settings exist to control the amount of resources to allocate during index creation. While they have sensible defaults, overall indexing performance can be drastically improved if configured for the host system.

Field Types

In addition to text fields, numeric, datetime, boolean, and JSON fields can also be indexed. For optimal performance, we recommend indexing all fields that are relevant to your search query.

For instance, by indexing both description and rating, filters over rating can be pushed down to the full text query for optimal query speed.

SELECT id, description, rating
FROM mock_items
WHERE description @@@ 'shoes' AND rating @@@ '>3'
ORDER BY id;

Text Fields

Columns of type VARCHAR, TEXT, UUID, and their corresponding array types can be indexed as text fields.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description', indexed => true)
);

paradedb.field accepts the following configuration options for text fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: false

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

fieldnorms
default: true

Fieldnorms store information about the length of the text field. Must be true to calculate the BM25 score.

tokenizer

See tokenizers for how to configure the tokenizer.

record
default: "position"

Describes the amount of information indexed. See record for a list of available record types.

normalizer

The name of the tokenizer used for fast fields. This field is ignored unless fast=true. See normalizers for a list of available normalizers.

Numeric Fields

Columns of type SMALLINT, INTEGER, BIGINT, OID, REAL, DOUBLE PRECISION, NUMERIC, and their corresponding array types can be indexed as numeric_fields. The main reason to index a numeric field is if it is used for filtering or aggregations as part of the search query.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  numeric_fields => paradedb.field('rating', indexed => true)
);

paradedb.field accepts the following configuration options numeric fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: true

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

Boolean Fields

Columns of type BOOLEAN and BOOLEAN[] can be indexed as boolean_fields. Indexing a boolean field is useful if it is used for filtering as part of the search query.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  boolean_fields => paradedb.field('in_stock')
);

paradedb.field accepts several configuration options for boolean fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: true

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

JSON Fields

Columns of type JSON and JSONB can be indexed as json_fields. Once indexed, search can be performed on nested text fields within JSON values.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  json_fields => paradedb.field('metadata')
);

paradedb.field accepts several configuration options for JSON fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: false

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

expand_dots
default: true

If true, JSON keys containing a . will be expanded. For instance, if expand_dots is true, {"metadata.color": "red"} will be indexed as if it was {"metadata": {"color": "red"}}.

tokenizer

See tokenizers for how to configure the tokenizer.

record
default: "position"

Describes the amount of information indexed. See record for a list of available record types.

normalizer

The name of the tokenizer used for fast fields. This field is ignored unless fast=true. See normalizers for a list of available normalizers.

Datetime Fields

Columns of type DATE, TIMESTAMP, TIMESTAMPTZ, TIME, TIMETZ, and their corresponding array types can be indexed as datetime_fields. Indexing a datetime field is useful if it is used for filtering as part of the search query.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  datetime_fields => paradedb.field('created_at')
);

paradedb.field accepts several configuration options for boolean fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: true

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

Array Fields

text_fields, numeric_fields, boolean_fields, and datetime_fields support array types. For instance, columns of type TEXT[] can be passed into text_fields, and columns of type INT[] can be passed into numeric_fields.

The only exception are JSON[] and JSONB[] types, which are not yet supported.

Range Fields

Columns of type int4range, int8range, numrange, tsrange, and tstzrange can be indexed as range_fields. The range term query is used to filter over these fields.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  range_fields => paradedb.field('weight_range')
);
stored
default: true

Whether the original value of the field is stored.

Enumerated Types

Custom Postgres enums should be indexed as numeric_fields. This is because enums are stored as float values in Postgres.

CREATE TYPE color AS ENUM ('red', 'green', 'blue');
ALTER TABLE mock_items ADD COLUMN color color;

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  numeric_fields => paradedb.field('color')
);

Enums should be queried with term queries.

If the ordering of the enum is changed with ADD VALUE ... [ BEFORE | AFTER ], the BM25 index should be dropped and recreated to account for the new enum ordinal values.

Multiple Fields

The || operator can be used to index multiple fields.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description') || paradedb.field('category')
);

Choosing a Key Field

While the key field can be any unique text, numeric, or datetime value, an integer key field will be the most performant. The Postgres SERIAL type is an easy way to create a unique integer column.

Additionally, the key field is not intended to be tokenized. For instance, the following configuration is not allowed:

-- This will throw an error
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'description',
  text_fields => paradedb.field('description')
);

Legacy Syntax

The paradedb.field and paradedb.tokenizer functions were introduced in 0.8.6. These functions are syntactic sugar for generating JSONB. Users that prefer the old syntax can still pass jsonb into _fields.

-- These two queries are equivalent
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => '{"description": {"tokenizer": {"type": "default"}}}'
);

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field(
    'description',
    tokenizer => paradedb.tokenizer('default')
  )
);