ParadeDB Docs

The ngram tokenizer splits text into “grams,” where each “gram” is of a certain length. The tokenizer takes two arguments. The first is the minimum character length of a “gram,” and the second is the maximum character length. Grams will be generated for all sizes between the minimum and maximum gram size, inclusive. For example, pdb.ngram(2,5) will generate tokens of size 2, 3, 4, and 5. To generate grams of a single fixed length, set the minimum and maximum gram size equal to each other.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.ngram(3,3)))
WITH (key_field='id');

To get a feel for this tokenizer, run the following command and replace the text with your own:

SELECT 'Tokenize me!'::pdb.ngram(3,3)::text[];

Expected Response

                      text
-------------------------------------------------
 {tok,oke,ken,eni,niz,ize,"ze ","e m"," me",me!}
(1 row)

Documentation (v2)

Ngram