> ## Documentation Index
> Fetch the complete documentation index at: https://docs.paradedb.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Token Filters

<Danger>
  **Legacy Docs:** This page describes our legacy API. It will be deprecated in
  a future version. Please use the [v2 API](/) where possible.
</Danger>

Token filters apply additional processing to tokens after they have been created.

## Stemmer

Stemming is the process of reducing words to their root form. In English, for example, the root form of `running` and `runs` is `run`. The `stemmer` filter can
be applied to any tokenizer.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "stemmer": "English"}}
    }'
);
```

<ParamField body="stemmer">
  Available stemmers are `Arabic`, `Czech`, `Danish`, `Dutch`, `English`,
  `Finnish`, `French`, `German`, `Greek`, `Hungarian`, `Italian`, `Norwegian`,
  `Polish`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`,
  `Tamil`, and `Turkish`.
</ParamField>

## Remove Long

The `remove_long` filter removes all tokens longer than a fixed number of bytes. If not specified,
`remove_long` defaults to `255`.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "remove_long": 255}}
    }'
);
```

## Lowercase

The `lowercase` filter lowercases all tokens. If not specified, `lowercase` defaults to `true`.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "lowercase": false}}
    }'
);
```

## Stopwords Language

<Note>This filter is not supported for the ngram tokenizer.</Note>

`stopwords_language` removes common "stop words" for a specific language from the original text before tokenization.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "stopwords_language": "English"}}
    }'
);
```

<ParamField body="stopwords_language">
  Available languages are `Czech`, `Danish`, `Dutch`, `English`, `Finnish`,
  `French`, `German`, `Hungarian`, `Italian`, `Norwegian`, `Polish`,
  `Portuguese`, `Russian`, `Spanish`, and `Swedish`.
</ParamField>

## Custom Stopwords

<Note>This filter is not supported for the ngram tokenizer.</Note>

`stopwords` removes custom words from the original text before tokenization.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "stopwords": ["shoes", "boots"]}}
    }'
);
```

## ASCII Folding

The ASCII folding filter strips away diacritical marks (accents, umlauts, tildes, etc.) while leaving the base character intact.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "ascii_folding": true}}
    }'
);
```
