Overview

A BM25 index must be created over table’s columns before they can be searched. This index is strongly consistent, which means that new data is immediately searchable across all connections. Once an index is created, it automatically stays in sync with the underlying table as the data changes.

Basic Usage

The following code block creates an index called search_idx over the description and rating columns of the mock_items table, which was created in the quickstart.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description'),
  numeric_fields => paradedb.field('rating')
);
index_name
required

The name of the index. The index name can be anything, as long as doesn’t conflict with an existing index or schema. A new schema with associated query functions will be created with this name.

table_name
required

The name of the table being indexed.

key_field
required

The name of a column in the table that represents a unique identifier for each record. Usually, this is the same column that is the primary key of the table. In version 0.7.5 and earlier, only integer IDs were supported. From version 0.7.6 onwards, non-integer IDs are also supported..

schema_name
default: "CURRENT SCHEMA"

The name of the schema, or namespace, of the table.

This example query will create a schema called search_idx, which contains a search function.

SELECT * FROM search_idx.search('description:keyboard');

Field Types

Text Fields

Columns of type VARCHAR, TEXT, UUID, VARCHAR[], and TEXT[] can be indexed as text fields.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description', indexed => true)
);

paradedb.field accepts the following configuration options for text fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: false

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

fieldnorms
default: true

Fieldnorms store information about the length of the text field. Must be true to calculate the BM25 score.

tokenizer
default: "default"

A JSONB produced by paradedb.tokenizer which specifies the tokenizer and tokenizer configuration options. See tokenizers for a list of available tokenizers.

record
default: "position"

Describes the amount of information indexed. See record for a list of available record types.

normalizer
default: "raw"

The name of the tokenizer used for fast fields. This field is ignored unless fast=true. See normalizers for a list of available normalizers.

Numeric Fields

Columns of type SMALLINT, INTEGER, BIGINT, OID, REAL, DOUBLE PRECISION, and NUMERIC can be indexed as numeric_fields. The main reason to index a numeric field is if it is used for filtering or aggregations as part of the search query.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  numeric_fields => paradedb.field('rating', indexed => true)
);

paradedb.field accepts the following configuration options numeric fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: true

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

Boolean Fields

Columns of type BOOLEAN can be indexed as boolean_fields. Indexing a boolean field is useful if it is used for filtering as part of the search query.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  boolean_fields => paradedb.field('in_stock')
);

paradedb.field accepts several configuration options for boolean fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: true

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

JSON Fields

Columns of type JSON and JSONB can be indexed as json_fields. Once indexed, search can be performed on nested text fields within JSON values.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  json_fields => paradedb.field('metadata')
);

paradedb.field accepts several configuration options for JSON fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: false

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

expand_dots
default: true

If true, JSON keys containing a . will be expanded. For instance, if expand_dots is true, {"metadata.color": "red"} will be indexed as if it was {"metadata": {"color": "red"}}.

tokenizer
default: "default"

The name of the tokenizer. See tokenizers for a list of available tokenizers.

record
default: "position"

Describes the amount of information indexed. See record for a list of available record types.

normalizer
default: "raw"

The name of the tokenizer used for fast fields. This field is ignored unless fast=true. See normalizers for a list of available normalizers.

Datetime Fields

Columns of type DATE, TIMESTAMP, TIMESTAMPTZ, TIME, and TIMETZ can be indexed as datetime_fields. Indexing a datetime field is useful if it is used for filtering as part of the search query.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  datetime_fields => paradedb.field('created_at')
);

paradedb.field accepts several configuration options for boolean fields:

indexed
default: true

Whether the field is indexed. Must be true in order for the field to be tokenized and searchable.

stored
default: true

Whether the original value of the field is stored.

fast
default: true

Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast set to true. Fast fields are also useful for accelerated scoring and filtering.

Multiple Fields

The || operator can be used to index multiple fields.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description') || paradedb.field('category')
);

Configuration Options

Tokenizers

Tokenizers are responsible for breaking down text fields into smaller, searchable components called tokens. Once a field is tokenized, all search queries against that field are automatically tokenized the same way.

The following code block demonstrates the syntax for specifying a tokenizer for a text field.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field(
    name => 'description',
    tokenizer => paradedb.tokenizer('en_stem')
  )
);

Some tokenizers like the stem and ngram tokenizer accept configuration options. They can be passed to the paradedb.tokenizer function.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field(
    name => 'description',
    tokenizer => paradedb.tokenizer('stem', language => 'French')
  )
);

The following tokenizer names are accepted by the paradedb.tokenizer function.

default

Chops the text on according to whitespace and punctuation, removes tokens that are too long, and converts to lowercase. Filters out tokens larger than 255 bytes.

raw

Does not process nor tokenize text. Filters out tokens larger than 255 bytes.

en_stem

Like default, but also applies stemming on the resulting tokens. Filters out tokens larger than 255 bytes.

whitespace

Tokenizes the text by splitting on whitespaces.

ngram

Tokenizes text by splitting words into overlapping substrings based on the specified parameters.

chinese_compatible

Tokenizes text considering Chinese character nuances. Splits based on whitespace and punctuation. Filters out tokens larger than 255 bytes.

chinese_lindera

Tokenizes text using the Lindera tokenizer, which uses the CC-CEDICT dictionary to segment and tokenize text.

korean_lindera

Tokenizes text using the Lindera tokenizer, which uses the KoDic dictionary to segment and tokenize text.

japanese_lindera

Tokenizes text using the Lindera tokenizer, which uses the IPADIC dictionary to segment and tokenize text.

icu

Tokenizes text using the ICU tokenizer, which uses Unicode Text Segmentation and is suitable for tokenizing most languages.

stem

Applies multi-language stemming to tokens, filtering out those larger than 255 bytes. The language can be specified with the language parameter.

Fast Fields

A field that is indexed as fast is stored in a column-oriented fashion. Fast fields are necessary for aggregations/faceting. They can also improve the query times of filters and BM25 scoring.

The following code block demonstrates how to specify a fast field.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  numeric_fields => paradedb.field('rating', fast => true)
);

Normalizers

Normalizers specify how text and JSON fast fields should be processed. This option is ignored over any other type of field.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('category', fast => true, normalizer => 'raw')
);
raw

Does not process nor tokenize text. Filters out tokens larger than 255 bytes.

lowercase

Applies a lowercase transformation on the text. Filters token larger than 255 bytes.

Record

The record option specifies how much information is recorded with an indexed field.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description', record => 'position')
);
basic
Records only the document IDs.
freq

Records the document IDs as well as term frequency. This is useful for BM25 scoring.

position

Records the document ID, term frequency and positions of occurrences. Required to run a phrase query.

Deleting a BM25 Index

The following command deletes a BM25 index, as well as its associated schema and query functions:

CALL paradedb.drop_bm25(index_name => '<index_name>');
index_name
required

The name of the index you wish to delete.

schema_name
default: "CURRENT SCHEMA"

The name of the schema that the index was created in.

Inspecting a BM25 Index

The schema function returns a table with information about the index schema.

SELECT * FROM <index_name>.schema();
index_name
required

The name of the index.

Partial BM25 Index

The following code block demonstrates how to pass predicates to create_bm25 to construct a partial index. Partial indexes are useful for reducing index size on disk and improving update speeds over non-indexed rows.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description'),
  predicates => 'category = ''Electronics'' AND rating > 2'
);

Legacy Syntax

The paradedb.field and paradedb.tokenizer functions were introduced in 0.8.6. These functions are syntactic sugar for generating JSONB. Users that prefer the old syntax can still pass jsonb into _fields.

-- These two queries are equivalent
CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => '{"description": {"tokenizer": {"type": "en_stem"}}}'
);

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field(
    'description',
    tokenizer => paradedb.tokenizer('en_stem')
  )
);