> ## Documentation Index
> Fetch the complete documentation index at: https://docs.paradedb.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Ngram

> Splits text into small chunks called grams, useful for partial matching

The ngram tokenizer splits text into "grams," where each "gram" is of a certain length.

The tokenizer takes two arguments. The first is the minimum character length of a "gram," and the second is the maximum character length. Grams will be generated for all sizes between
the minimum and maximum gram size, inclusive. For example, `pdb.ngram(2,5)` will generate tokens of size `2`, `3`, `4`, and `5`.

To generate grams of a single fixed length, set the minimum and maximum gram size equal to each other.

```sql theme={null}
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.ngram(3,3)))
WITH (key_field='id');
```

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Tokenize me!'::pdb.ngram(3,3)::text[];
```

```ini Expected Response theme={null}
                      text
-------------------------------------------------
 {tok,oke,ken,eni,niz,ize,"ze ","e m"," me",me!}
(1 row)
```

## Ngram Prefix Only

The generate ngram tokens for only the first `n` characters in the text, set `prefix_only` to `true`.

```sql theme={null}
SELECT 'Tokenize me!'::pdb.ngram(3,3,'prefix_only=true')::text[];
```

```ini Expected Response theme={null}
 text
-------
 {tok}
(1 row)
```

## Phrase and Proximity Queries with Ngram

Because multiple ngram tokens can overlap, the ngram tokenizer does not store token positions. As a result,
queries that rely on token positions like [phrase](/documentation/full-text/phrase), [phrase prefix](/documentation/query-builder/phrase/phrase-prefix), [regex phrase](/documentation/query-builder/phrase/regex-phrase) and [proximity](/documentation/full-text/proximity) are not supported over ngram-tokenized
fields.

An exception is if the min gram size equals the max gram size, which guarantees unique token positions. In this case, setting
`positions=true` enables these queries.

```sql theme={null}
SELECT 'Tokenize me!'::pdb.ngram(3,3,'positions=true')::text[];
```

### Exact Substring Matching with Phrase Queries

With `positions=true`, [phrase queries](/documentation/full-text/phrase) over ngram fields perform exact substring matching.
This is faster than using [match conjunction](/documentation/full-text/match#match-conjunction) on an ngram field, which
creates a `Must` clause for every ngram token and intersects them independently. A phrase query uses a single positional
intersection instead.

The tradeoff is that phrase queries are stricter: they require tokens at consecutive positions within a single field value,
while match conjunction only requires all tokens to appear somewhere in the document.

```sql theme={null}
CREATE TABLE books (id SERIAL PRIMARY KEY, titles TEXT[]);
INSERT INTO books (titles) VALUES
    (ARRAY['The Dragon Hatchling', 'Wings of Gold']),
    (ARRAY['Dragon Slayer', 'Hatchling Care']);

CREATE INDEX ON books
USING bm25 (id, (titles::pdb.ngram(4,4,'positions=true')))
WITH (key_field='id');

-- Phrase: matches exact substring "Dragon Hatchling" — only row 1
SELECT * FROM books WHERE titles ### 'Dragon Hatchling';

-- Match conjunction: matches all ngrams anywhere — also only row 1 here,
-- but on larger datasets could match rows where the ngrams are scattered
SELECT * FROM books WHERE titles ||| 'Dragon Hatchling';

DROP TABLE books;
```

When constructing queries as JSON, use `tokenized_phrase` to achieve the same
result as the `###` operator. It tokenizes the input string with the field's tokenizer and builds
a phrase query from the resulting tokens:

```json theme={null}
{ "tokenized_phrase": { "field": "titles", "phrase": "Dragon Hatchling" } }
```
