Index Creation
Overview
A BM25 index must be created over table’s columns before they can be searched. This index is strongly consistent, which means that new data is immediately searchable across all connections. Once an index is created, it automatically stays in sync with the underlying table as the data changes.
Basic Usage
The following code block creates an index called search_idx
over the description
and rating
columns of the mock_items
table, which was created in the quickstart.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field('description'),
numeric_fields => paradedb.field('rating')
);
The name of the index. The index name can be anything, as long as doesn’t conflict with an existing index or schema. A new schema with associated query functions will be created with this name.
The name of the table being indexed.
The name of a column in the table that represents a unique identifier for each record. Usually, this is the same column that is the primary key of the table. In version 0.7.5 and earlier, only integer IDs were supported. From version 0.7.6 onwards, non-integer IDs are also supported..
The name of the schema, or namespace, of the table.
This example query will create a schema called search_idx
, which contains a search
function.
SELECT * FROM search_idx.search('description:keyboard');
Field Types
Text Fields
Columns of type VARCHAR
, TEXT
, UUID
, VARCHAR[]
, and TEXT[]
can be indexed as text fields.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field('description', indexed => true)
);
paradedb.field
accepts the following configuration options for text fields:
Whether the field is indexed. Must be true
in order for the field to be
tokenized and searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must
have fast
set to true
. Fast fields are also useful for accelerated scoring
and filtering.
Fieldnorms store information about the length of the text field. Must be
true
to calculate the BM25 score.
A JSONB
produced by paradedb.tokenizer
which specifies the tokenizer and
tokenizer configuration options. See tokenizers for a list of
available tokenizers.
Describes the amount of information indexed. See record for a list of available record types.
The name of the tokenizer used for fast fields. This field is ignored unless
fast=true
. See normalizers for a list of available
normalizers.
Numeric Fields
Columns of type SMALLINT
, INTEGER
, BIGINT
, OID
, REAL
, DOUBLE PRECISION
, and NUMERIC
can be indexed
as numeric_fields
. The main reason to index a numeric field is if it is used for filtering
or aggregations as part of the search query.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
numeric_fields => paradedb.field('rating', indexed => true)
);
paradedb.field
accepts the following configuration options numeric fields:
Whether the field is indexed. Must be true
in order for the field to be
tokenized and searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must
have fast
set to true
. Fast fields are also useful for accelerated scoring
and filtering.
Boolean Fields
Columns of type BOOLEAN
can be indexed as boolean_fields
. Indexing a boolean field is useful if
it is used for filtering as part of the search query.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
boolean_fields => paradedb.field('in_stock')
);
paradedb.field
accepts several configuration options for boolean fields:
Whether the field is indexed. Must be true
in order for the field to be
tokenized and searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must
have fast
set to true
. Fast fields are also useful for accelerated scoring
and filtering.
JSON Fields
Columns of type JSON
and JSONB
can be indexed as json_fields
. Once indexed, search can be
performed on nested text fields within JSON values.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
json_fields => paradedb.field('metadata')
);
paradedb.field
accepts several configuration options for JSON fields:
Whether the field is indexed. Must be true
in order for the field to be tokenized and
searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast
set to true
.
Fast fields are also useful for accelerated scoring and filtering.
If true
, JSON keys containing a .
will be expanded. For instance, if expand_dots
is true
,
{"metadata.color": "red"}
will be indexed as if it was {"metadata": {"color": "red"}}
.
The name of the tokenizer. See tokenizers for a list of available tokenizers.
Describes the amount of information indexed. See record for a list of available record types.
The name of the tokenizer used for fast fields. This field is ignored unless fast=true
. See
normalizers for a list of available normalizers.
Datetime Fields
Columns of type DATE
, TIMESTAMP
, TIMESTAMPTZ
, TIME
, and TIMETZ
can be indexed as datetime_fields
.
Indexing a datetime field is useful if it is used for filtering as part of the search query.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
datetime_fields => paradedb.field('created_at')
);
paradedb.field
accepts several configuration options for boolean fields:
Whether the field is indexed. Must be true
in order for the field to be
tokenized and searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must
have fast
set to true
. Fast fields are also useful for accelerated scoring
and filtering.
Multiple Fields
The ||
operator can be used to index multiple fields.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field('description') || paradedb.field('category')
);
Configuration Options
Tokenizers
Tokenizers are responsible for breaking down text fields into smaller, searchable components called tokens. Once a field is tokenized, all search queries against that field are automatically tokenized the same way.
The following code block demonstrates the syntax for specifying a tokenizer for a text field.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field(
name => 'description',
tokenizer => paradedb.tokenizer('en_stem')
)
);
Some tokenizers like the stem
and ngram
tokenizer accept configuration options. They can be passed to the
paradedb.tokenizer
function.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field(
name => 'description',
tokenizer => paradedb.tokenizer('stem', language => 'French')
)
);
The following tokenizer names are accepted by the paradedb.tokenizer
function.
Chops the text on according to whitespace and punctuation, removes tokens that are too long, and converts to lowercase. Filters out tokens larger than 255 bytes.
Does not process nor tokenize text. Filters out tokens larger than 255 bytes.
Like default
, but also applies stemming on the resulting tokens. Filters out
tokens larger than 255 bytes.
Tokenizes the text by splitting on whitespaces.
Tokenizes text by splitting words into overlapping substrings based on the specified parameters.
Tokenizes text considering Chinese character nuances. Splits based on whitespace and punctuation. Filters out tokens larger than 255 bytes.
Tokenizes text using the Lindera tokenizer, which uses the CC-CEDICT dictionary to segment and tokenize text.
Tokenizes text using the Lindera tokenizer, which uses the KoDic dictionary to segment and tokenize text.
Tokenizes text using the Lindera tokenizer, which uses the IPADIC dictionary to segment and tokenize text.
Tokenizes text using the ICU tokenizer, which uses Unicode Text Segmentation and is suitable for tokenizing most languages.
Applies multi-language stemming to tokens, filtering out those larger than 255 bytes. The language can be specified with the
language
parameter.
Fast Fields
A field that is indexed as fast
is stored in a column-oriented fashion. Fast fields are necessary for
aggregations/faceting. They can also improve the query times of filters
and BM25 scoring.
The following code block demonstrates how to specify a fast field.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
numeric_fields => paradedb.field('rating', fast => true)
);
Normalizers
Normalizers specify how text and JSON fast fields should be processed. This option is ignored over any other type of field.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field('category', fast => true, normalizer => 'raw')
);
Does not process nor tokenize text. Filters out tokens larger than 255 bytes.
Applies a lowercase transformation on the text. Filters token larger than 255 bytes.
Record
The record
option specifies how much information is recorded with an indexed field.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field('description', record => 'position')
);
Records the document IDs as well as term frequency. This is useful for BM25 scoring.
Records the document ID, term frequency and positions of occurrences. Required to run a phrase query.
Deleting a BM25 Index
The following command deletes a BM25 index, as well as its associated schema and query functions:
CALL paradedb.drop_bm25(index_name => '<index_name>');
The name of the index you wish to delete.
The name of the schema that the index was created in.
Inspecting a BM25 Index
The schema
function returns a table with information about the index schema.
SELECT * FROM <index_name>.schema();
The name of the index.
Partial BM25 Index
The following code block demonstrates how to pass predicates to create_bm25
to construct a partial index. Partial
indexes are useful for reducing index size on disk and improving update speeds over non-indexed rows.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field('description'),
predicates => 'category = ''Electronics'' AND rating > 2'
);
Legacy Syntax
The paradedb.field
and paradedb.tokenizer
functions were introduced in 0.8.6
. These functions are
syntactic sugar for generating JSONB
. Users that prefer the old syntax can still pass jsonb
into _fields
.
-- These two queries are equivalent
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => '{"description": {"tokenizer": {"type": "en_stem"}}}'
);
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field(
'description',
tokenizer => paradedb.tokenizer('en_stem')
)
);
Was this page helpful?