How Tokenizers Work

Before text is indexed, it is first split into searchable units called tokens. The default tokenizer in ParadeDB is the unicode_words tokenizer. It splits text according to word boundaries defined by the Unicode Standard Annex #29 rules. All characters are lowercased by default. To visualize how this tokenizer works, you can cast a text string to the tokenizer type, and then to text[]:

SELECT 'Hello world!'::pdb.unicode_words::text[];

Expected Response

     text
---------------
 {hello,world}
(1 row)

On the other hand, the ngrams tokenizer splits text into “grams” of size n. In this example, n = 3:

SELECT 'Hello world!'::pdb.ngram(3,3)::text[];

Expected Response

                      text
-------------------------------------------------
 {hel,ell,llo,"lo ","o w"," wo",wor,orl,rld,ld!}
(1 row)

Choosing the right tokenizer is crucial to getting the search results you want. For instance, the simple tokenizer works best for whole-word matching like “hello” or “world”, while the ngram tokenizer enables partial matching. To configure a tokenizer for a column in the index, simply cast it to the desired tokenizer type:

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.ngram(3,3)))
WITH (key_field='id');

Documentation

Documentation Index