The edge ngram tokenizer first splits text into words at character-class boundaries, then generates n-grams anchored to the beginning of each word. This makes it ideal for “search-as-you-type” functionality, where users find matches as they type partial words. The tokenizer takes two required arguments: the minimum and maximum gram length. For each word, it emits prefix tokens fromDocumentation Index
Fetch the complete documentation index at: https://docs.paradedb.com/llms.txt
Use this file to discover all available pages before exploring further.
min_gram to max_gram characters long (clamped to the word length). Words shorter than min_gram are skipped.
Expected Response
Token Chars
By default, the edge ngram tokenizer treats letters and digits as token content and everything else (spaces, punctuation, symbols) as word delimiters. You can customize this withtoken_chars, which accepts a comma-separated
list of character classes: letter, digit, whitespace, punctuation, symbol. Character classification uses
Unicode general categories, matching Elasticsearch’s behavior.
For example, including punctuation keeps hyphens as part of words:
Expected Response