Create an Index
Basic Usage
The following code block creates a BM25 index called search_idx
over multiple columns of the mock_items
table.
The name of the index being created.
The name of the table being indexed.
The name of a column in the table that represents a unique identifier for each record. Usually, this is the same column that is the primary key of the table.
The name of the schema, or namespace, of the table.
Indexing Settings
Several settings exist to control the amount of resources to allocate during index creation. While they have sensible defaults, overall indexing performance can be drastically improved if configured for the host system.
Field Types
In addition to text fields, numeric, datetime, boolean, and JSON fields can also be indexed. For optimal performance, we recommend indexing all fields that are relevant to your search query.
For instance, by indexing both description
and rating
, filters over rating
can be pushed down to the full text query for optimal query speed.
Text Fields
Columns of type VARCHAR
, TEXT
, UUID
, and their corresponding array types can be indexed as text fields.
paradedb.field
accepts the following configuration options for text fields:
Whether the field is indexed. Must be true
in order for the field to be
tokenized and searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must
have fast
set to true
. Fast fields are also useful for accelerated scoring
and filtering.
Fieldnorms store information about the length of the text field. Must be
true
to calculate the BM25 score.
See tokenizers for how to configure the tokenizer.
Describes the amount of information indexed. See record for a list of available record types.
The name of the tokenizer used for fast fields. This field is ignored unless
fast=true
. See
normalizers for a list of
available normalizers.
Numeric Fields
Columns of type SMALLINT
, INTEGER
, BIGINT
, OID
, REAL
, DOUBLE PRECISION
, NUMERIC
, and their corresponding array types can be indexed
as numeric_fields
. The main reason to index a numeric field is if it is used for filtering or aggregations as part of the search query.
paradedb.field
accepts the following configuration options numeric fields:
Whether the field is indexed. Must be true
in order for the field to be
tokenized and searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must
have fast
set to true
. Fast fields are also useful for accelerated scoring
and filtering.
Boolean Fields
Columns of type BOOLEAN
and BOOLEAN[]
can be indexed as boolean_fields
. Indexing a boolean field is useful if
it is used for filtering as part of the search query.
paradedb.field
accepts several configuration options for boolean fields:
Whether the field is indexed. Must be true
in order for the field to be
tokenized and searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must
have fast
set to true
. Fast fields are also useful for accelerated scoring
and filtering.
JSON Fields
Columns of type JSON
and JSONB
can be indexed as json_fields
. Once indexed, search can be
performed on nested text fields within JSON values.
paradedb.field
accepts several configuration options for JSON fields:
Whether the field is indexed. Must be true
in order for the field to be tokenized and
searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must have fast
set to true
.
Fast fields are also useful for accelerated scoring and filtering.
If true
, JSON keys containing a .
will be expanded. For instance, if expand_dots
is true
,
{"metadata.color": "red"}
will be indexed as if it was {"metadata": {"color": "red"}}
.
See tokenizers for how to configure the tokenizer.
Describes the amount of information indexed. See record for a list of available record types.
The name of the tokenizer used for fast fields. This field is ignored unless
fast=true
. See normalizers for a list of available
normalizers.
Datetime Fields
Columns of type DATE
, TIMESTAMP
, TIMESTAMPTZ
, TIME
, TIMETZ
, and their corresponding array types can be indexed as datetime_fields
.
Indexing a datetime field is useful if it is used for filtering as part of the search query.
paradedb.field
accepts several configuration options for boolean fields:
Whether the field is indexed. Must be true
in order for the field to be
tokenized and searchable.
Whether the original value of the field is stored.
Fast fields can be random-accessed rapidly. Fields used for aggregation must
have fast
set to true
. Fast fields are also useful for accelerated scoring
and filtering.
Array Fields
text_fields
, numeric_fields
, boolean_fields
, and datetime_fields
support array types. For instance, columns of
type TEXT[]
can be passed into text_fields
, and columns of type INT[]
can be passed into numeric_fields
.
The only exception are JSON[]
and JSONB[]
types, which are not yet
supported.
Range Fields
Columns of type int4range
, int8range
, numrange
, tsrange
, and tstzrange
can be indexed as range_fields
.
The range term query is used to filter over these fields.
Whether the original value of the field is stored.
Enumerated Types
Custom Postgres enums should be indexed as numeric_fields
.
This is because enums are stored as float values in Postgres.
Enums should be queried with term queries.
If the ordering of the enum is changed with ADD VALUE ... [ BEFORE | AFTER ]
, the BM25 index should be dropped
and recreated to account for the new enum ordinal values.
Multiple Fields
The ||
operator can be used to index multiple fields.
Choosing a Key Field
While the key field can be any unique text, numeric, or datetime value, an integer key field will be the most
performant. The Postgres SERIAL
type is an easy way to create a unique integer column.
Additionally, the key field is not intended to be tokenized. For instance, the following configuration is not allowed:
Legacy Syntax
The paradedb.field
and paradedb.tokenizer
functions were introduced in 0.8.6
. These functions are
syntactic sugar for generating JSONB
. Users that prefer the old syntax can still pass jsonb
into _fields
.
Was this page helpful?