Token filters apply additional processing to tokens after they have been created.

Stemmer

Stemming is the process of reducing words to their root form. In English, for example, the root form of running and runs is run. The stemmer filter can be applied to any tokenizer.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "stemmer": "English"}}
    }'
);
stemmer

Available stemmers are Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, and Turkish.

Remove Long

The remove_long filter removes all tokens longer than a fixed number of bytes. If not specified, remove_long defaults to 255.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "remove_long": 255}}
    }'
);

Lowercase

The lowercase filter lowercases all tokens. If not specified, lowercase defaults to true.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (
    key_field='id',
    text_fields='{
        "description": {"tokenizer": {"type": "default", "lowercase": false}}
    }'
);