Overview

A BM25 index must be created over a table before it can be searched. This index is strongly consistent, which means that new data is immediately searchable across all connections. Once an index is created, it automatically stays in sync with the underlying table as the data changes.

Creating a BM25 Index

The following command creates a BM25 index over a table along with a new schema containing query functions.

paradedb.create_bm25(
  index_name => '<index_name>',
  table_name => '<table_name>',
  key_field => '<key_field>'
  text_fields => '<text_fields>',
  numeric_fields => '<numeric_fields>',
  boolean_fields => '<boolean_fields>',
  json_fields => '<json_fields>',
  datetime_fields => '<datetime_fields>',
);

The index_name input will become the name of the new schema that is created. Querying that schema makes use of an “object.method” syntax, with methods like search, rank, and highlight.

Each _fields input to create_bm25() accepts a JSON5-formatted string. Keys don’t need to be quoted, and trailing commas and comments are allowed. JSON5 is backwards-compatible, so standard JSON works too.

index_name
required

The name of the index. The index name can be anything, as long as doesn’t conflict with an existing index or schema. A new schema with associated query functions will be created with this name.

table_name
required

The name of the table being indexed.

key_field
required

The name of a column in the table that represents a unique identifier for each record. Usually, this is the same column that is the primary key of table. Currently, only integer IDs are supported, so if necessary you may consider creating a dedicated column to use e.g: ALTER TABLE mock_items ADD COLUMN bm25_id SERIAL.

schema_name
default: "CURRENT SCHEMA"

The name of the schema, or namespace, of the table.

text_fields

A JSON5 string which specifies which text columns should be indexed and how they should be indexed. Keys are the names of columns, and values are config options. Accepts columns of type varchar, text, varchar[], and text[].

numeric_fields

A JSON5 string which specifies which numeric columns should be indexed and how they should be indexed. Keys are the names of columns, and values are config options. Accepts columns of type int2, int4, int8, oid, xid, float4, float8, and numeric.

boolean_fields

A JSON5 string which specifies which boolean columns should be indexed and how they should be indexed. Keys are the names of columns, and values are config options. Accepts columns of type boolean.

json_fields

A JSON5 string which specifies which JSON columns should be indexed and how they should be indexed. Keys are the names of columns, and values are config options. Accepts columns of type json and jsonb. Once indexed, search can be performed on nested text fields within JSON values.

datetime_fields

A JSON5 string which specifies which datetime columns should be indexed and how they should be indexed. Keys are the names of columns, and values are config options. Accepts columns of type date, timestamp, timestamptz, time, and timetz. Search terms will use the UTC time zone if not specified and need to be in RFC3339 format for the search function.

-- To demonstrate time zones, all of these queries are equivalent
SELECT * FROM bm25_search.search('created_at:"2023-05-01T09:12:34Z"');
SELECT * FROM bm25_search.search('created_at:"2023-05-01T04:12:34-05:00"');
SELECT * FROM bm25_search.search(
  query => paradedb.term(
    field => 'created_at',
    value => TIMESTAMP '2023-05-01 09:12:34'
  )
);
SELECT * FROM bm25_search.search(
  query => paradedb.term(
    field => 'created_at',
    value => TIMESTAMP WITH TIME ZONE '2023-05-01 04:12:34 EST'
  )
);

Deleting a BM25 Index

The following command deletes a BM25 index, as well as its associated schema and query functions:

CALL paradedb.drop_bm25('<index_name>');
index_name
required

The name of the index you wish to delete.

Recreating a BM25 Index

A BM25 index only needs to be recreated if the underlying table schema changes — for instance, if a new column is added or the name of a column changes. To recreate the index, simply delete the index and create a new one using the commands provided above.

Getting Info on a BM25 Index

The schema function returns a table with information about the index schema.

SELECT * FROM <index_name>.schema();
index_name
required

The name of the index.

Tokenizers

default

Chops the text on according to whitespace and punctuation, removes tokens that are too long, and converts to lowercase. Filters out tokens larger than 255 bytes.

raw

Does not process nor tokenize text. Filters out tokens larger than 255 bytes.

en_stem

Like default, but also applies stemming on the resulting tokens. Filters out tokens larger than 255 bytes.

whitespace

Tokenizes the text by splitting on whitespaces.

ngram

Tokenizes text by splitting words into overlapping substrings based on the specified parameters:

min_gram: Defines the minimum length for the n-grams. For instance, if set to 2, the smallest token created would be of length 2 characters.

max_gram: Determines the maximum length of the n-grams. If set to 5, the largest token produced would be of length 5 characters.

prefix_only: When set to true, the tokenizer generates n-grams that start from the beginning of the word only, ensuring a prefix progression. If false, n-grams are created from all possible character combinations within the min_gram and max_gram range.

chinese_compatible

Tokenizes text considering Chinese character nuances. Splits based on whitespace and punctuation. Filters out tokens larger than 255 bytes.

chinese_lindera

Tokenizes text using the Lindera tokenizer, which uses the CC-CEDICT dictionary to segment and tokenize text.

korean_lindera

Tokenizes text using the Lindera tokenizer, which uses the KoDic dictionary to segment and tokenize text.

japanese_lindera

Tokenizes text using the Lindera tokenizer, which uses the IPADIC dictionary to segment and tokenize text.

icu

Tokenizes text using the ICU tokenizer, which uses Unicode Text Segmentation and is suitable for tokenizing most languages.

Normalizers

raw

Does not process nor tokenize text. Filters out tokens larger than 255 bytes.

lowercase

Applies a lowercase transformation on the text. Filters token larger than 255 bytes.

Records

basic
Records only the document IDs.
freq

Records the document IDs as well as term frequency.

position

Records the document ID, term frequency and positions of occurrences.