Skip to main content
Tokenizers split large chunks of text into small, searchable units called tokens. Different tokenizers have different strategies for how to split text. The default tokenizer in ParadeDB is the simple tokenizer. It splits text on whitespace, punctuation, and also lowercases. To visualize how this tokenizer works, you can cast a text string to the tokenizer type, and then to text[]:
SELECT 'Hello world!'::pdb.simple::text[];
Expected Response
     text
---------------
 {hello,world}
(1 row)
On the other hand, the ngrams tokenizer splits text into “grams” of size n. In this example, n = 3:
SELECT 'Hello world!'::pdb.ngram(3,3)::text[];
Expected Response
                      text
-------------------------------------------------
 {hel,ell,llo,"lo ","o w"," wo",wor,orl,rld,ld!}
(1 row)
Choosing the right tokenizer is crucial to getting the search results you want. For instance, the simple tokenizer works best for whole-word matching like “hello” or “world”, while the ngram tokenizer enables partial matching. To configure a tokenizer for a column in the index, simply cast it to the desired tokenizer type:
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.ngram(3,3)))
WITH (key_field='id');
I