> ## Documentation Index
> Fetch the complete documentation index at: https://docs.paradedb.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Search Tokenizer

> Use a different tokenizer at search time than at index time

By default, ParadeDB uses the same tokenizer at both index time and search time. This makes sense for most cases — you want queries
tokenized the same way the data was indexed.

But sometimes you need different tokenizers. The classic example is **autocomplete**:

* **Index time** — edge ngram: `"shoes"` → `s`, `sh`, `sho`, `shoe`, `shoes`
* **Search time** — unicode: `"sho"` → `sho`

If you used edge ngram at search time too, typing `"sho"` would produce `s`, `sh`, `sho` — matching far too many documents.

## Usage

Set `search_tokenizer` as a `WITH` option on the index to define a default search-time tokenizer for all text and JSON fields:

```sql theme={null}
CREATE INDEX search_idx ON products
USING bm25 (
  id,
  (title::pdb.ngram(1, 10, 'prefix_only=true'))
) WITH (key_field='id', search_tokenizer='unicode_words');
```

With this configuration:

* **Index time**: `title` is tokenized with edge ngram to create prefix tokens
* **Search time**: queries against `title` automatically use the unicode tokenizer

The `search_tokenizer` value can include parameters, e.g. `search_tokenizer='simple(lowercase=false)'`.

Because `search_tokenizer` only affects query-time behavior, you can change it without reindexing:

```sql theme={null}
ALTER INDEX search_idx SET (search_tokenizer = 'simple(lowercase=false)');
```

## Example

```sql theme={null}
CREATE TABLE products (
    id serial8 NOT NULL PRIMARY KEY,
    title text
);
INSERT INTO products (title) VALUES
    ('shoes'), ('shirt'), ('shorts'), ('shoelaces'), ('socks');

CREATE INDEX idx_products ON products USING bm25
    (id, (title::pdb.ngram(1, 10, 'prefix_only=true')))
    WITH (key_field = 'id', search_tokenizer = 'unicode_words');

-- "sho" stays as one token → matches shoes, shorts, shoelaces
SELECT id, title FROM products WHERE title ||| 'sho' ORDER BY id;

-- "s" stays as one token → matches all five titles
SELECT id, title FROM products WHERE title ||| 's' ORDER BY id;
```

Without `search_tokenizer`, the query `'sho'` would be edge-ngrammed into `s`, `sh`, `sho` and match
every title starting with `s` — not just those starting with `sho`.

## Overriding at Query Time

You can still override the search tokenizer for a specific query by casting the query string:

```sql theme={null}
-- Force edge ngram tokenization at query time
SELECT id, title FROM products WHERE title ||| 'sho'::pdb.ngram(1, 10, 'prefix_only=true') ORDER BY id;
```

## Priority

When resolving which tokenizer to use at search time, ParadeDB checks in this order:

1. **Query-level cast** — e.g. `'sho'::pdb.ngram(...)` (highest priority)
2. **Index-level WITH option** — e.g. `WITH (search_tokenizer='unicode_words')`
3. **Index-time tokenizer** — the tokenizer used to build the index (fallback)

## Supported Tokenizers

Any [available tokenizer](/documentation/tokenizers/overview) can be used as a `search_tokenizer`:
`unicode_words`, `simple`, `whitespace`, `ngram`, `literal`, `literal_normalized`, `chinese_compatible`,
`lindera`, `icu`, `jieba`, `source_code`.
