Token Filters
Token filters apply additional processing to tokens after they have been created.
Stemmer
Stemming is the process of reducing words to their root form. In English, for example, the root form of running
and runs
is run
. The stemmer
filter can
be applied to any tokenizer.
Available stemmers are Arabic
, Danish
, Dutch
, English
, Finnish
,
French
, German
, Greek
, Hungarian
, Italian
, Norwegian
,
Portuguese
, Romanian
, Russian
, Spanish
, Swedish
, Tamil
, and
Turkish
.
Remove Long
The remove_long
filter removes all tokens longer than a fixed number of bytes. If not specified,
remove_long
defaults to 255
.
Lowercase
The lowercase
filter lowercases all tokens. If not specified, lowercase
defaults to true
.
Stopwords Language
stopwords_language
removes common “stop words” for a specific language from the original text before tokenization.
Available languages are Danish
, Dutch
, English
, Finnish
, French
,
German
, Hungarian
, Italian
, Norwegian
, Portuguese
, Russian
,
Spanish
, and Swedish
.
Custom Stopwords
stopwords
removes custom words from the original text before tokenization.