running
and runs
is run
. The stemmer
filter can
be applied to any tokenizer.
Arabic
, Danish
, Dutch
, English
, Finnish
,
French
, German
, Greek
, Hungarian
, Italian
, Norwegian
,
Portuguese
, Romanian
, Russian
, Spanish
, Swedish
, Tamil
, and
Turkish
.remove_long
filter removes all tokens longer than a fixed number of bytes. If not specified,
remove_long
defaults to 255
.
lowercase
filter lowercases all tokens. If not specified, lowercase
defaults to true
.
stopwords_language
removes common “stop words” for a specific language from the original text before tokenization.
Danish
, Dutch
, English
, Finnish
, French
,
German
, Hungarian
, Italian
, Norwegian
, Portuguese
, Russian
,
Spanish
, and Swedish
.stopwords
removes custom words from the original text before tokenization.