Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.paradedb.com/llms.txt

Use this file to discover all available pages before exploring further.

By default, ParadeDB uses the same tokenizer at both index time and search time. This makes sense for most cases — you want queries tokenized the same way the data was indexed. But sometimes you need different tokenizers. The classic example is autocomplete:
  • Index time — edge ngram: "shoes"s, sh, sho, shoe, shoes
  • Search time — unicode: "sho"sho
If you used edge ngram at search time too, typing "sho" would produce s, sh, sho — matching far too many documents.

Usage

Set search_tokenizer as a WITH option on the index to define a default search-time tokenizer for all text and JSON fields:
CREATE INDEX search_idx ON products
USING bm25 (
  id,
  (title::pdb.ngram(1, 10, 'prefix_only=true'))
) WITH (key_field='id', search_tokenizer='unicode_words');
With this configuration:
  • Index time: title is tokenized with edge ngram to create prefix tokens
  • Search time: queries against title automatically use the unicode tokenizer
The search_tokenizer value can include parameters, e.g. search_tokenizer='simple(lowercase=false)'. Because search_tokenizer only affects query-time behavior, you can change it without reindexing:
ALTER INDEX search_idx SET (search_tokenizer = 'simple(lowercase=false)');

Example

CREATE TABLE products (
    id serial8 NOT NULL PRIMARY KEY,
    title text
);
INSERT INTO products (title) VALUES
    ('shoes'), ('shirt'), ('shorts'), ('shoelaces'), ('socks');

CREATE INDEX idx_products ON products USING bm25
    (id, (title::pdb.ngram(1, 10, 'prefix_only=true')))
    WITH (key_field = 'id', search_tokenizer = 'unicode_words');

-- "sho" stays as one token → matches shoes, shorts, shoelaces
SELECT id, title FROM products WHERE title ||| 'sho' ORDER BY id;

-- "s" stays as one token → matches all five titles
SELECT id, title FROM products WHERE title ||| 's' ORDER BY id;
Without search_tokenizer, the query 'sho' would be edge-ngrammed into s, sh, sho and match every title starting with s — not just those starting with sho.

Overriding at Query Time

You can still override the search tokenizer for a specific query by casting the query string:
-- Force edge ngram tokenization at query time
SELECT id, title FROM products WHERE title ||| 'sho'::pdb.ngram(1, 10, 'prefix_only=true') ORDER BY id;

Priority

When resolving which tokenizer to use at search time, ParadeDB checks in this order:
  1. Query-level cast — e.g. 'sho'::pdb.ngram(...) (highest priority)
  2. Index-level WITH option — e.g. WITH (search_tokenizer='unicode_words')
  3. Index-time tokenizer — the tokenizer used to build the index (fallback)

Supported Tokenizers

Any available tokenizer can be used as a search_tokenizer: unicode_words, simple, whitespace, ngram, literal, literal_normalized, chinese_compatible, lindera, icu, jieba, source_code.