Tokenizers determine how text is split up when indexed. Picking the right tokenizer is crucial for returning the results that you want. Different tokenizers are optimized for different query types and languages.

For instance, consider the phrase lore ipsum dolor. If we tokenize this phrase by splitting on whitespace, users can find this phrase if they search for lore, ipsum, or dolor.

Basic Usage

A tokenizer can be provided to any text or JSON field. If no tokenizer is specified, the default tokenizer is used, which splits on whitespace and punctuation, filters out tokens that are larger than 255 bytes, and lowercases all tokens.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field(
    name => 'description',
    tokenizer => paradedb.tokenizer('default')
  )
);

Queries are automatically tokenized in the same way as the underlying field. For instance, in the following query, the word keyboard is tokenized with the default tokenizer.

SELECT * FROM search_idx.search('description:keyboard');

Available Tokenizers

The paradedb.tokenizers() function returns a list of all available tokenizer names.

SELECT * FROM paradedb.tokenizers();

Default

Tokenizes the text by splitting on whitespace and punctuation, filters out tokens that are larger than 255 bytes, and converts to lowercase.

paradedb.tokenizer('default')

Whitespace

Like default, but splits based on whitespace only. Filters out tokens that are larger than 255 bytes and converts to lowercase.

paradedb.tokenizer('whitespace')

Raw

Treats the entire text as a single token. Filters out tokens larger than 255 bytes and converts to lowercase.

paradedb.tokenizer('raw')

Regex

Tokenizes text using a regular expression. The regular expression can be specified with the pattern parameter. For instance, \\W+ splits on non-word characters.

paradedb.tokenizer('regex', pattern => '\\W+')
pattern

The regular expression pattern used to tokenize the text.

Ngram

Tokenizes text by splitting words into overlapping substrings based on the specified parameters. For instance, a 3-gram tokenizer splits the word cheese into che, hee, ees, and ese.

During search, an ngram-tokenized query is considered a match only if all its ngram tokens match. For instance, the 3-grams of chse do not match against cheese because the token hse does not match with any of the tokens of cheese. However, the query hees matches because all of its 3-grams match against those of cheese.

paradedb.tokenizer('ngram', min_gram => 2, max_gram => 4, prefix_only => false)
min_gram

Defines the minimum length for the n-grams. For instance, if set to 2, the smallest token created would be of length 2 characters.

max_gram

Determines the maximum length of the n-grams. If set to 5, the largest token produced would be of length 5 characters.

prefix_only

When set to true, the tokenizer generates n-grams that start from the beginning of the word only, ensuring a prefix progression. If false, n-grams are created from all possible character combinations within the min_gram and max_gram range.

Source Code

Tokenizes the text by splitting based on casing conventions commonly used in code, such as camelCase or PascalCase. Filters out tokens that exceed 255 bytes, and converts them to lowercase with ASCII folding.

paradedb.tokenizer('source_code')

Chinese Compatible

The chinese_compatible tokenizer performs simple character splitting by treating each CJK (Chinese, Japanese, Korean) character as a single token and grouping non-CJK characters as a single token. Non-alphanumeric characters like punctuation are ignored and not included in any token.

paradedb.tokenizer('chinese_compatible')

Lindera

The Lindera tokenizer is a more advanced CJK tokenizer that uses prebuilt Chinese, Japanese, or Korean dictionaries to break text into meaningful tokens (words or phrases) rather than on individual characters.

chinese_lindera uses the CC-CEDICT dictionary.

paradedb.tokenizer('chinese_lindera');

korean_lindera uses the KoDic dictionary.

paradedb.tokenizer('korean_lindera');

japanese_lindera uses the IPADIC dictionary.

paradedb.tokenizer('japanese_lindera');

ICU

The ICU (International Components for Unicode) tokenizer breaks down text according to the Unicode standard. It can be used to tokenize most languages and recognizes the nuances in word boundaries across different languages.

paradedb.tokenizer('icu');

Tokenizing a Query

To manually tokenize input text with a specified tokenizer, use paradedb.tokenize. This function is useful for comparing different tokenizers or passing tokens directly into a term-level query.

SELECT * FROM paradedb.tokenize(
  paradedb.tokenizer('ngram', min_gram => 3, max_gram => 3, prefix_only => false),
  'keyboard'
);

Multiple Tokenizers

In ParadeDB, only one tokenizer is allowed per field and only one BM25 index is allowed per table. However, some scenarios may require the same field to be tokenized in multiple ways. For instance, suppose we wish to use both the whitespace and ngrams tokenizer to tokenize the description field. One way to achieve this is to use generated columns.

ALTER TABLE mock_items
ADD COLUMN description_ngrams TEXT GENERATED ALWAYS AS (description) STORED;

-- Optional: Drop existing search_idx if exists
CALL paradedb.drop_bm25('search_idx')

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields =>
    paradedb.field('description', tokenizer => paradedb.tokenizer('whitespace')) ||
    paradedb.field('description_ngrams', tokenizer => paradedb.tokenizer('ngram', min_gram => 3, max_gram => 3, prefix_only => false))
);