Skip to main content
The edge ngram tokenizer first splits text into words at character-class boundaries, then generates n-grams anchored to the beginning of each word. This makes it ideal for “search-as-you-type” functionality, where users find matches as they type partial words. The tokenizer takes two required arguments: the minimum and maximum gram length. For each word, it emits prefix tokens from min_gram to max_gram characters long (clamped to the word length). Words shorter than min_gram are skipped.
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.edge_ngram(2,5)))
WITH (key_field='id');
To get a feel for this tokenizer, run the following command and replace the text with your own:
SELECT 'Quick Fox'::pdb.edge_ngram(2,5)::text[];
Expected Response
            text
-----------------------------
 {qu,qui,quic,quick,fo,fox}
(1 row)

Token Chars

By default, the edge ngram tokenizer treats letters and digits as token content and everything else (spaces, punctuation, symbols) as word delimiters. You can customize this with token_chars, which accepts a comma-separated list of character classes: letter, digit, whitespace, punctuation, symbol. Character classification uses Unicode general categories, matching Elasticsearch’s behavior. For example, including punctuation keeps hyphens as part of words:
SELECT 'Quick-Fox'::pdb.edge_ngram(2,5,'token_chars=letter,digit,punctuation')::text[];
Expected Response
          text
-------------------------
 {qu,qui,quic,quick}
(1 row)