lore ipsum dolor
. If we tokenize this phrase by splitting on whitespace, users can find this phrase if
they search for lore
, ipsum
, or dolor
.
text_fields
or json_fields
JSON configs accept a tokenizer
key.
tokenizer
is specified, the default
tokenizer is used.paradedb.tokenizers()
function returns a list of all available tokenizer names.
default
, but splits based on whitespace only. Filters out tokens that are larger than 255 bytes and converts to lowercase.
keyword
, using equality operators such as =
, <=
, <
, >
, >=
, <>
with that field can be pushed down to the index.
The keyword
tokenizer is supported for both TEXT
and VARCHAR
fields.
pattern
parameter.
For instance, \\W+
splits on non-word characters.
cheese
into che
, hee
, ees
, and ese
.
During search, an ngram-tokenized query is considered a match only if all its ngram tokens match. For instance, the 3-grams of chse
do not match against cheese
because the token hse
does not match with any of the tokens of cheese
. However, the query hees
matches because all of its 3-grams match against
those of cheese
.
true
, the tokenizer generates n-grams that start from the
beginning of the word only, ensuring a prefix progression. If false, n-grams
are created from all possible character combinations within the min_gram
and
max_gram
range.chinese_compatible
tokenizer performs simple character splitting by treating each CJK (Chinese, Japanese, Korean) character as a single token and grouping non-CJK characters as a single token. Non-alphanumeric characters like punctuation are ignored and not included in any token.
chinese_lindera
uses the CC-CEDICT dictionary, korean_lindera
uses the KoDic dictionary, and japanese_lindera
uses the IPADIC dictionary.
chinese_lindera
or chinese_compatible
.
paradedb.tokenize
. This function is useful for comparing different tokenizers or
passing tokens directly into a term-level query.
WITH
options to CREATE INDEX
. The configuration should contain a "column"
key that points to the table column containing the data for that field.
Here’s an example of how to create a BM25 index with multiple tokenizers for the same field: