Tokenizers
Tokenizers determine how text is split up when indexed. Picking the right tokenizer is crucial for returning the results that you want. Different tokenizers are optimized for different query types and languages.
For instance, consider the phrase lore ipsum dolor
. If we tokenize this phrase by splitting on whitespace, users can find this phrase if
they search for lore
, ipsum
, or dolor
.
Basic Usage
Both text_fields
or json_fields
JSON configs accept a tokenizer
key.
tokenizer
is specified, the default
tokenizer is used.Available Tokenizers
The paradedb.tokenizers()
function returns a list of all available tokenizer names.
Default
Tokenizes the text by splitting on whitespace and punctuation, filters out tokens that are larger than 255 bytes, and converts to lowercase.
Whitespace
Like default
, but splits based on whitespace only. Filters out tokens that are larger than 255 bytes and converts to lowercase.
Raw
Treats the entire text as a single token. Filters out tokens larger than 255 bytes and converts to lowercase.
Regex
Tokenizes text using a regular expression. The regular expression can be specified with the pattern
parameter.
For instance, \\W+
splits on non-word characters.
The regular expression pattern used to tokenize the text.
Ngram
Tokenizes text by splitting words into overlapping substrings based on the specified parameters. For instance, a 3-gram tokenizer splits the word cheese
into che
, hee
, ees
, and ese
.
During search, an ngram-tokenized query is considered a match only if all its ngram tokens match. For instance, the 3-grams of chse
do not match against cheese
because the token hse
does not match with any of the tokens of cheese
. However, the query hees
matches because all of its 3-grams match against
those of cheese
.
Defines the minimum length for the n-grams. For instance, if set to 2, the smallest token created would be of length 2 characters.
Determines the maximum length of the n-grams. If set to 5, the largest token produced would be of length 5 characters.
When set to true
, the tokenizer generates n-grams that start from the
beginning of the word only, ensuring a prefix progression. If false, n-grams
are created from all possible character combinations within the min_gram
and
max_gram
range.
Source Code
Tokenizes the text by splitting based on casing conventions commonly used in code, such as camelCase or PascalCase. Filters out tokens that exceed 255 bytes, and converts them to lowercase with ASCII folding.
Chinese Compatible
The chinese_compatible
tokenizer performs simple character splitting by treating each CJK (Chinese, Japanese, Korean) character as a single token and grouping non-CJK characters as a single token. Non-alphanumeric characters like punctuation are ignored and not included in any token.
Lindera
The Lindera tokenizer is a more advanced CJK tokenizer that uses prebuilt Chinese, Japanese, or Korean dictionaries to break text into meaningful tokens (words or phrases) rather than on individual characters.
chinese_lindera
uses the CC-CEDICT dictionary, korean_lindera
uses the KoDic dictionary, and japanese_lindera
uses the IPADIC dictionary.
ICU
The ICU (International Components for Unicode) tokenizer breaks down text according to the Unicode standard. It can be used to tokenize most languages and recognizes the nuances in word boundaries across different languages.
Tokenizing a Query
To manually tokenize input text with a specified tokenizer, use paradedb.tokenize
. This function is useful for comparing different tokenizers or
passing tokens directly into a term-level query.
Multiple Tokenizers
ParadeDB supports using multiple tokenizers for the same field within a single BM25 index. This feature allows for more flexible and powerful querying capabilities, enabling you to employ various strategies to match against an index term.
To setup a field with multiple tokenizers, you should configure it with an alias in the WITH
options to CREATE INDEX
. The configuration should contain a "column"
key that points to the table column containing the data for that field.
Here’s an example of how to create a BM25 index with multiple tokenizers for the same field:
Was this page helpful?