Create an Index
Basic Usage
The following code block creates a BM25 index called search_idx
over multiple columns of the mock_items
table.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field('description') || paradedb.field('category'),
numeric_fields => paradedb.field('rating'),
boolean_fields => paradedb.field('in_stock'),
datetime_fields => paradedb.field('created_at'),
json_fields => paradedb.field('metadata')
);
The name of the index. The index name can be anything, as long as doesn’t conflict with an existing index or schema. A new schema with associated query functions will be created with this name.
The name of the table being indexed.
The name of a column in the table that represents a unique identifier for each record. Usually, this is the same column that is the primary key of the table.
The name of the schema, or namespace, of the table.
Field Types
In addition to text fields, numeric, datetime, boolean, and JSON fields can also be indexed. For optimal performance, we recommend indexing all fields that are relevant to your search query.
In the query above, both description
and rating
were indexed to efficiently find rows where description
matches a query and rating
is above a certain number.
SELECT id, description, rating
FROM search_idx.search('description:shoes AND rating:>3');
Text Fields
Columns of type VARCHAR
, TEXT
, UUID
, and their corresponding array types can be indexed as text fields.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field('description', indexed => true)
);
paradedb.field
accepts the following configuration options for text fields:
Whether the field is indexed. Must be true
in order for the field to be
tokenized and searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must
have fast
set to true
. Fast fields are also useful for accelerated scoring
and filtering.
Fieldnorms store information about the length of the text field. Must be
true
to calculate the BM25 score.
See tokenizers for how to configure the tokenizer.
Describes the amount of information indexed. See record for a list of available record types.
The name of the tokenizer used for fast fields. This field is ignored unless
fast=true
. See
normalizers for a list of
available normalizers.
Numeric Fields
Columns of type SMALLINT
, INTEGER
, BIGINT
, OID
, REAL
, DOUBLE PRECISION
, NUMERIC
, and their corresponding array types can be indexed
as numeric_fields
. The main reason to index a numeric field is if it is used for filtering or aggregations as part of the search query.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
numeric_fields => paradedb.field('rating', indexed => true)
);
paradedb.field
accepts the following configuration options numeric fields:
Whether the field is indexed. Must be true
in order for the field to be
tokenized and searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must
have fast
set to true
. Fast fields are also useful for accelerated scoring
and filtering.
Boolean Fields
Columns of type BOOLEAN
and BOOLEAN[]
can be indexed as boolean_fields
. Indexing a boolean field is useful if
it is used for filtering as part of the search query.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
boolean_fields => paradedb.field('in_stock')
);
paradedb.field
accepts several configuration options for boolean fields:
Whether the field is indexed. Must be true
in order for the field to be
tokenized and searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must
have fast
set to true
. Fast fields are also useful for accelerated scoring
and filtering.
JSON Fields
Columns of type JSON
and JSONB
can be indexed as json_fields
. Once indexed, search can be
performed on nested text fields within JSON values.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
json_fields => paradedb.field('metadata')
);
paradedb.field
accepts several configuration options for JSON fields:
Whether the field is indexed. Must be true
in order for the field to be tokenized and
searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast
set to true
.
Fast fields are also useful for accelerated scoring and filtering.
If true
, JSON keys containing a .
will be expanded. For instance, if expand_dots
is true
,
{"metadata.color": "red"}
will be indexed as if it was {"metadata": {"color": "red"}}
.
See tokenizers for how to configure the tokenizer.
Describes the amount of information indexed. See record for a list of available record types.
The name of the tokenizer used for fast fields. This field is ignored unless
fast=true
. See normalizers for a list of available
normalizers.
Datetime Fields
Columns of type DATE
, TIMESTAMP
, TIMESTAMPTZ
, TIME
, TIMETZ
, and their corresponding array types can be indexed as datetime_fields
.
Indexing a datetime field is useful if it is used for filtering as part of the search query.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
datetime_fields => paradedb.field('created_at')
);
paradedb.field
accepts several configuration options for boolean fields:
Whether the field is indexed. Must be true
in order for the field to be
tokenized and searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must
have fast
set to true
. Fast fields are also useful for accelerated scoring
and filtering.
Array Fields
text_fields
, numeric_fields
, boolean_fields
, and datetime_fields
support array types. For instance, columns of
type TEXT[]
can be passed into text_fields
, and columns of type INT[]
can be passed into numeric_fields
.
The only exception are JSON[]
and JSONB[]
types, which are not yet
supported.
Multiple Fields
The ||
operator can be used to index multiple fields.
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field('description') || paradedb.field('category')
);
Choosing a Key Field
The key_field
option is used to uniquely identify documents within an index and cannot be tokenized. For instance,
the following configuration is not allowed:
-- This will throw an error
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'description',
text_fields => paradedb.field('description')
);
Legacy Syntax
The paradedb.field
and paradedb.tokenizer
functions were introduced in 0.8.6
. These functions are
syntactic sugar for generating JSONB
. Users that prefer the old syntax can still pass jsonb
into _fields
.
-- These two queries are equivalent
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => '{"description": {"tokenizer": {"type": "default"}}}'
);
CALL paradedb.create_bm25(
index_name => 'search_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field(
'description',
tokenizer => paradedb.tokenizer('default')
)
);
Was this page helpful?