Remove Stopwords - ParadeDB

Stopwords are words that are so common or semantically insignificant in most contexts that they can be ignored during indexing. In English, for example, stopwords include “a”, “and”, “or”, etc. All tokenizers besides the literal tokenizer can be configured to automatically remove stopwords for one or more languages.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('stopwords_language=english')))
WITH (key_field='id');

Valid languages are Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, and Swedish. Language names are case-insensitive.

Multiple Languages

For documents containing multiple languages, you can specify multiple stopword languages as a comma-separated list:

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('stopwords_language=English,French')))
WITH (key_field='id');

SELECT 'the quick fox and le renard et'::pdb.simple('stopwords_language=English,French')::text[];

Expected Response

        text
--------------------
 {quick,fox,renard}
(1 row)

Example

To demonstrate this token filter, let’s compare the output of the following two statements:

SELECT
  'The cat in the hat'::pdb.simple::text[],
  'The cat in the hat'::pdb.simple('stopwords_language=English')::text[];

Expected Response

         text         |   text
----------------------+-----------
 {the,cat,in,the,hat} | {cat,hat}
(1 row)

Documentation

​Multiple Languages

​Example

Multiple Languages

Example