Basic Usage

The following code block creates a BM25 index called search_idx over multiple columns of the mock_items table.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description') || paradedb.field('category'),
  numeric_fields => paradedb.field('rating'),
  boolean_fields => paradedb.field('in_stock'),
  datetime_fields => paradedb.field('created_at'),
  json_fields => paradedb.field('metadata')
);
index_name
required

The name of the index. The index name can be anything, as long as doesn’t conflict with an existing index or schema. A new schema with associated query functions will be created with this name.

table_name
required

The name of the table being indexed.

key_field
required

The name of a column in the table that represents a unique identifier for each record. Usually, this is the same column that is the primary key of the table.

schema_name
default: "CURRENT SCHEMA"

The name of the schema, or namespace, of the table.

Field Types

In addition to text fields, numeric, datetime, boolean, and JSON fields can also be indexed. For optimal performance, we recommend indexing all fields that are relevant to your search query.

In the query above, both description and rating were indexed to efficiently find rows where description matches a query and rating is above a certain number.

SELECT id, description, rating
FROM search_idx.search('description:shoes AND rating:>3');

Text Fields

Columns of type VARCHAR, TEXT, UUID, and their corresponding array types can be indexed as text fields.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description', indexed => true)
);

paradedb.field accepts the following configuration options for text fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: false

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

fieldnorms
default: true

Fieldnorms store information about the length of the text field. Must be true to calculate the BM25 score.

tokenizer

See tokenizers for how to configure the tokenizer.

record
default: "position"

Describes the amount of information indexed. See record for a list of available record types.

normalizer

The name of the tokenizer used for fast fields. This field is ignored unless fast=true. See normalizers for a list of available normalizers.

Numeric Fields

Columns of type SMALLINT, INTEGER, BIGINT, OID, REAL, DOUBLE PRECISION, NUMERIC, and their corresponding array types can be indexed as numeric_fields. The main reason to index a numeric field is if it is used for filtering or aggregations as part of the search query.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  numeric_fields => paradedb.field('rating', indexed => true)
);

paradedb.field accepts the following configuration options numeric fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: true

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

Boolean Fields

Columns of type BOOLEAN and BOOLEAN[] can be indexed as boolean_fields. Indexing a boolean field is useful if it is used for filtering as part of the search query.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  boolean_fields => paradedb.field('in_stock')
);

paradedb.field accepts several configuration options for boolean fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: true

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

JSON Fields

Columns of type JSON and JSONB can be indexed as json_fields. Once indexed, search can be performed on nested text fields within JSON values.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  json_fields => paradedb.field('metadata')
);

paradedb.field accepts several configuration options for JSON fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: false

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

expand_dots
default: true

If true, JSON keys containing a . will be expanded. For instance, if expand_dots is true, {"metadata.color": "red"} will be indexed as if it was {"metadata": {"color": "red"}}.

tokenizer

See tokenizers for how to configure the tokenizer.

record
default: "position"

Describes the amount of information indexed. See record for a list of available record types.

normalizer

The name of the tokenizer used for fast fields. This field is ignored unless fast=true. See normalizers for a list of available normalizers.

Datetime Fields

Columns of type DATE, TIMESTAMP, TIMESTAMPTZ, TIME, TIMETZ, and their corresponding array types can be indexed as datetime_fields. Indexing a datetime field is useful if it is used for filtering as part of the search query.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  datetime_fields => paradedb.field('created_at')
);

paradedb.field accepts several configuration options for boolean fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: true

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

Array Fields

text_fields, numeric_fields, boolean_fields, and datetime_fields support array types. For instance, columns of type TEXT[] can be passed into text_fields, and columns of type INT[] can be passed into numeric_fields.

The only exception are JSON[] and JSONB[] types, which are not yet supported.

Multiple Fields

The || operator can be used to index multiple fields.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description') || paradedb.field('category')
);

Choosing a Key Field

The key_field option is used to uniquely identify documents within an index and cannot be tokenized. For instance, the following configuration is not allowed:

-- This will throw an error
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'description',
  text_fields => paradedb.field('description')
);

Legacy Syntax

The paradedb.field and paradedb.tokenizer functions were introduced in 0.8.6. These functions are syntactic sugar for generating JSONB. Users that prefer the old syntax can still pass jsonb into _fields.

-- These two queries are equivalent
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => '{"description": {"tokenizer": {"type": "default"}}}'
);

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field(
    'description',
    tokenizer => paradedb.tokenizer('default')
  )
);