> ## Documentation Index
> Fetch the complete documentation index at: https://docs.paradedb.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Tokenizers

<Danger>
  **Legacy Docs:** This page describes our legacy API. It will be deprecated in
  a future version. Please use the [v2 API](/) where possible.
</Danger>

Tokenizers determine how text is split up when indexed. Picking the right tokenizer is crucial for returning the results that you want.
Different tokenizers are optimized for different query types and languages.

For instance, consider the phrase `lore ipsum dolor`. If we tokenize this phrase by splitting on whitespace, users can find this phrase if
they search for `lore`, `ipsum`, or `dolor`.

## Basic Usage

Both `text_fields` or `json_fields` JSON configs accept a `tokenizer` key.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "whitespace"}}
    }'
);
```

<Note>If no `tokenizer` is specified, the `default` tokenizer is used.</Note>

## Available Tokenizers

The `paradedb.tokenizers()` function returns a list of all available tokenizer names.

```sql theme={null}
SELECT * FROM paradedb.tokenizers();
```

### Default

Tokenizes the text by splitting on whitespace and punctuation, filters out tokens that are larger than 255 bytes, and converts to lowercase.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "default"}
        }
    }'
);
```

### Whitespace

Like `default`, but splits based on whitespace only. Filters out tokens that are larger than 255 bytes and converts to lowercase.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "whitespace"}
        }
    }'
);
```

### Raw

Treats the entire text as a single token. Filters out tokens larger than 255 bytes and converts to lowercase.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "raw"}
        }
    }'
);
```

### Keyword

Treats the entire text as a single token, as-is. If a field is tokenized as `keyword`, using equality operators such as `=`, `<=`, `<`, `>`, `>=`, `<>` with that field can be pushed down to the index.

The `keyword` tokenizer is supported for both `TEXT` and `VARCHAR` fields.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "keyword"}
        }
    }'
);
```

### Regex

Tokenizes text using a regular expression. The regular expression can be specified with the `pattern` parameter.
For instance, `\\W+` splits on non-word characters.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "regex", "pattern": "\\W+"}
        }
    }'
);
```

<ParamField body="pattern" required>
  The regular expression pattern used to tokenize the text.
</ParamField>

### Ngram

Tokenizes text by splitting words into overlapping substrings based on the specified parameters. For instance, a 3-gram tokenizer splits the word `cheese`
into `che`, `hee`, `ees`, and `ese`.

During search, an ngram-tokenized query is considered a match only if **all** its ngram tokens match. For instance, the 3-grams of `chse` do not match against `cheese`
because the token `hse` does not match with any of the tokens of `cheese`. However, the query `hees` matches because all of its 3-grams match against
those of `cheese`.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "ngram", "min_gram": 2, "max_gram": 3, "prefix_only": false}
        }
    }'
);
```

<ParamField body="min_gram" required>
  Defines the minimum length for the n-grams. For instance, if set to 2, the
  smallest token created would be of length 2 characters.
</ParamField>

<ParamField body="max_gram" required>
  Determines the maximum length of the n-grams. If set to 5, the largest token
  produced would be of length 5 characters.
</ParamField>

<ParamField body="prefix_only" required>
  When set to `true`, the tokenizer generates n-grams that start from the
  beginning of the word only, ensuring a prefix progression. If false, n-grams
  are created from all possible character combinations within the `min_gram` and
  `max_gram` range.
</ParamField>

### Source Code

Tokenizes the text by splitting based on casing conventions commonly used in code, such as camelCase or PascalCase. Filters out tokens that exceed 255 bytes, and converts them to lowercase with ASCII folding.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "source_code"}
        }
    }'
);
```

### Chinese Compatible

The `chinese_compatible` tokenizer performs simple character splitting by treating each CJK (Chinese, Japanese, Korean) character as a single token and grouping non-CJK characters as a single token. Non-alphanumeric characters like punctuation are ignored and not included in any token.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "chinese_compatible"}
        }
    }'
);
```

### Lindera

The Lindera tokenizer is a more advanced CJK tokenizer that uses prebuilt Chinese, Japanese, or Korean dictionaries to break text into meaningful tokens (words or phrases) rather than on
individual characters.

`chinese_lindera` uses the CC-CEDICT dictionary, `korean_lindera` uses the KoDic dictionary, and `japanese_lindera` uses the IPADIC dictionary.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "chinese_lindera"}
        }
    }'
);
```

### Jieba

[Jieba](https://github.com/jiegec/tantivy-jieba) is a Chinese language tokenizer that uses a dictionary-based approach for word segmentation. It is generally superior for tokenization accuracy of Chinese text, but may run significantly slower than either `chinese_lindera` or `chinese_compatible`.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "jieba"}
        }
    }'
);
```

### ICU

The ICU (International Components for Unicode) tokenizer breaks down text according to the Unicode standard. It can be used to tokenize most languages
and recognizes the nuances in word boundaries across different languages.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field = 'id',
    text_fields = '{
        "description": {
          "tokenizer": {"type": "icu"}
        }
    }'
);
```

## Tokenizing a Query

To manually tokenize input text with a specified tokenizer, use `paradedb.tokenize`. This function is useful for comparing different tokenizers or
passing tokens directly into a [term-level query](/legacy/advanced/term/term).

```sql theme={null}
SELECT * FROM paradedb.tokenize(
  paradedb.tokenizer('ngram', min_gram => 3, max_gram => 3, prefix_only => false),
  'keyboard'
);
```

## Multiple Tokenizers

ParadeDB supports using multiple tokenizers for the same field within a single BM25 index. This feature allows for more flexible and powerful querying capabilities, enabling you to employ various strategies to match against an index term.

To setup a field with multiple tokenizers, you should configure it with an alias in the `WITH` options to `CREATE INDEX`. The configuration should contain a `"column"` key that points to the table column containing the data for that field.

Here's an example of how to create a BM25 index with multiple tokenizers for the same field:

```sql theme={null}
CREATE INDEX search_idx ON public.mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "whitespace"}},
        "description_ngram": {"tokenizer": {"type": "ngram", "min_gram": 3, "max_gram": 3, "prefix_only": false}, "column": "description"},
        "description_stem": {"tokenizer": {"type": "default", "stemmer": "English"}, "column": "description"}
    }'
);

-- Example queries
SELECT * FROM mock_items WHERE id @@@ paradedb.parse('description_ngram:cam AND description_stem:digitally');
SELECT * FROM mock_items WHERE id @@@ paradedb.parse('description:"Soft cotton" OR description_stem:shirts');
```
