Stemmer - ParadeDB

Stemming is the process of reducing words to their root form. In English, for example, the root form of “running” and “runs” is “run”. Stemming can be configured for any tokenizer besides the literal tokenizer. Stemmers in ParadeDB are based on stemming algorithms obtained from the official Snowball website. To set a stemmer, append stemmer=<language> to the tokenizer’s arguments.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('stemmer=english')))
WITH (key_field='id');

Valid languages are arabic, czech, danish, dutch, english, finnish, french, german, greek, hungarian, italian, norwegian, polish, portuguese, romanian, russian, spanish, swedish, tamil, and turkish. To demonstrate this token filter, let’s compare the output of the following two statements:

SELECT
  'I am running'::pdb.simple::text[],
  'I am running'::pdb.simple('stemmer=english')::text[];

Expected Response

      text      |    text
----------------+------------
 {i,am,running} | {i,am,run}
(1 row)