Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.paradedb.com/llms.txt

Use this file to discover all available pages before exploring further.

Stemming is the process of reducing words to their root form. In English, for example, the root form of “running” and “runs” is “run”. Stemming can be configured for any tokenizer besides the literal tokenizer. Stemmers in ParadeDB are based on stemming algorithms obtained from the official Snowball website. To set a stemmer, append stemmer=<language> to the tokenizer’s arguments.
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('stemmer=english')))
WITH (key_field='id');
Valid languages are arabic, czech, danish, dutch, english, finnish, french, german, greek, hungarian, italian, norwegian, polish, portuguese, romanian, russian, spanish, swedish, tamil, and turkish. To demonstrate this token filter, let’s compare the output of the following two statements:
SELECT
  'I am running'::pdb.simple::text[],
  'I am running'::pdb.simple('stemmer=english')::text[];
Expected Response
      text      |    text
----------------+------------
 {i,am,running} | {i,am,run}
(1 row)