Token Length - ParadeDB

The token length filter automatically removes tokens that are above or below a certain length in bytes. To remove all tokens longer than a certain length, append a remove_long configuration to the tokenizer:

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('remove_long=100')))
WITH (key_field='id');

To remove all tokens shorter than a length, use remove_short:

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.simple('remove_short=3')))
WITH (key_field='id');

All tokenizers besides the literal tokenizer accept these configurations. To demonstrate this token filter, let’s compare the output of the following two statements:

SELECT
  'A supersupersuperlong token'::pdb.simple::text[],
  'A supersupersuperlong token'::pdb.simple('remove_short=2', 'remove_long=10')::text[];

Expected Response

             text              |  text
-------------------------------+---------
 {a,supersupersuperlong,token} | {token}
(1 row)

Documentation

Documentation Index