Token Filters - ParadeDB

Token filters apply additional processing to tokens after they have been created.

Stemmer

Stemming is the process of reducing words to their root form. In English, for example, the root form of running and runs is run. The stemmer filter can be applied to any tokenizer.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "stemmer": "English"}}
    }'
);

stemmer

Available stemmers are Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, and Turkish.

Remove Long

The remove_long filter removes all tokens longer than a fixed number of bytes. If not specified, remove_long defaults to 255.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "remove_long": 255}}
    }'
);

Lowercase

The lowercase filter lowercases all tokens. If not specified, lowercase defaults to true.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "lowercase": false}}
    }'
);

Stopwords Language

This filter is not supported for the ngram tokenizer.

stopwords_language removes common “stop words” for a specific language from the original text before tokenization.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "stopwords_language": "English"}}
    }'
);

stopwords_language

Available languages are Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish.

Custom Stopwords

This filter is not supported for the ngram tokenizer.

stopwords removes custom words from the original text before tokenization.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "stopwords": ["shoes", "boots"]}}
    }'
);

ASCII Folding

The ASCII folding filter strips away diacritical marks (accents, umlauts, tildes, etc.) while leaving the base character intact.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "ascii_folding": true}}
    }'
);

Documentation

​Stemmer

​Remove Long

​Lowercase

​Stopwords Language

​Custom Stopwords

​ASCII Folding

Stemmer

Remove Long

Lowercase

Stopwords Language

Custom Stopwords

ASCII Folding