Edge Ngram - ParadeDB

The edge ngram tokenizer first splits text into words at character-class boundaries, then generates n-grams anchored to the beginning of each word. This makes it ideal for “search-as-you-type” functionality, where users find matches as they type partial words. The tokenizer takes two required arguments: the minimum and maximum gram length. For each word, it emits prefix tokens from min_gram to max_gram characters long (clamped to the word length). Words shorter than min_gram are skipped.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.edge_ngram(2,5)))
WITH (key_field='id');

To get a feel for this tokenizer, run the following command and replace the text with your own:

SELECT 'Quick Fox'::pdb.edge_ngram(2,5)::text[];

Expected Response

            text
-----------------------------
 {qu,qui,quic,quick,fo,fox}
(1 row)

Token Chars

By default, the edge ngram tokenizer treats letters and digits as token content and everything else (spaces, punctuation, symbols) as word delimiters. You can customize this with token_chars, which accepts a comma-separated list of character classes: letter, digit, whitespace, punctuation, symbol. Character classification uses Unicode general categories, matching Elasticsearch’s behavior. For example, including punctuation keeps hyphens as part of words:

SELECT 'Quick-Fox'::pdb.edge_ngram(2,5,'token_chars=letter,digit,punctuation')::text[];

Expected Response

          text
-------------------------
 {qu,qui,quic,quick}
(1 row)

Documentation Index

​Token Chars

Token Chars