- Index time — edge ngram:
"shoes"→s,sh,sho,shoe,shoes - Search time — unicode:
"sho"→sho
"sho" would produce s, sh, sho — matching far too many documents.
Usage
Setsearch_tokenizer as a WITH option on the index to define a default search-time tokenizer for all text and JSON fields:
- Index time:
titleis tokenized with edge ngram to create prefix tokens - Search time: queries against
titleautomatically use the unicode tokenizer
search_tokenizer value can include parameters, e.g. search_tokenizer='simple(lowercase=false)'.
Because search_tokenizer only affects query-time behavior, you can change it without reindexing:
Example
search_tokenizer, the query 'sho' would be edge-ngrammed into s, sh, sho and match
every title starting with s — not just those starting with sho.
Overriding at Query Time
You can still override the search tokenizer for a specific query by casting the query string:Priority
When resolving which tokenizer to use at search time, ParadeDB checks in this order:- Query-level cast — e.g.
'sho'::pdb.ngram(...)(highest priority) - Index-level WITH option — e.g.
WITH (search_tokenizer='unicode_words') - Index-time tokenizer — the tokenizer used to build the index (fallback)
Supported Tokenizers
Any available tokenizer can be used as asearch_tokenizer:
unicode_words, simple, whitespace, ngram, literal, literal_normalized, chinese_compatible,
lindera, icu, jieba, source_code.