Ngram - ParadeDB

The ngram tokenizer splits text into “grams,” where each “gram” is of a certain length. The tokenizer takes two arguments. The first is the minimum character length of a “gram,” and the second is the maximum character length. Grams will be generated for all sizes between the minimum and maximum gram size, inclusive. For example, pdb.ngram(2,5) will generate tokens of size 2, 3, 4, and 5. To generate grams of a single fixed length, set the minimum and maximum gram size equal to each other.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.ngram(3,3)))
WITH (key_field='id');

To get a feel for this tokenizer, run the following command and replace the text with your own:

SELECT 'Tokenize me!'::pdb.ngram(3,3)::text[];

Expected Response

                      text
-------------------------------------------------
 {tok,oke,ken,eni,niz,ize,"ze ","e m"," me",me!}
(1 row)

Ngram Prefix Only

The generate ngram tokens for only the first n characters in the text, set prefix_only to true.

SELECT 'Tokenize me!'::pdb.ngram(3,3,'prefix_only=true')::text[];

Expected Response

 text
-------
 {tok}
(1 row)

Phrase and Proximity Queries with Ngram

Because multiple ngram tokens can overlap, the ngram tokenizer does not store token positions. As a result, queries that rely on token positions like phrase, phrase prefix, regex phrase and proximity are not supported over ngram-tokenized fields. An exception is if the min gram size equals the max gram size, which guarantees unique token positions. In this case, setting positions=true enables these queries.

SELECT 'Tokenize me!'::pdb.ngram(3,3,'positions=true')::text[];

Exact Substring Matching with Phrase Queries

With positions=true, phrase queries over ngram fields perform exact substring matching. This is faster than using match conjunction on an ngram field, which creates a Must clause for every ngram token and intersects them independently. A phrase query uses a single positional intersection instead. The tradeoff is that phrase queries are stricter: they require tokens at consecutive positions within a single field value, while match conjunction only requires all tokens to appear somewhere in the document.

CREATE TABLE books (id SERIAL PRIMARY KEY, titles TEXT[]);
INSERT INTO books (titles) VALUES
    (ARRAY['The Dragon Hatchling', 'Wings of Gold']),
    (ARRAY['Dragon Slayer', 'Hatchling Care']);

CREATE INDEX ON books
USING bm25 (id, (titles::pdb.ngram(4,4,'positions=true')))
WITH (key_field='id');

-- Phrase: matches exact substring "Dragon Hatchling" — only row 1
SELECT * FROM books WHERE titles ### 'Dragon Hatchling';

-- Match conjunction: matches all ngrams anywhere — also only row 1 here,
-- but on larger datasets could match rows where the ngrams are scattered
SELECT * FROM books WHERE titles ||| 'Dragon Hatchling';

DROP TABLE books;

When constructing queries as JSON, use tokenized_phrase to achieve the same result as the ### operator. It tokenizes the input string with the field’s tokenizer and builds a phrase query from the resulting tokens:

{ "tokenized_phrase": { "field": "titles", "phrase": "Dragon Hatchling" } }

Documentation

​Ngram Prefix Only

​Phrase and Proximity Queries with Ngram

​Exact Substring Matching with Phrase Queries

Ngram Prefix Only

Phrase and Proximity Queries with Ngram

Exact Substring Matching with Phrase Queries