Columnar Storage - ParadeDB

By default, all non-text and non-JSON fields are indexed using ParadeDB’s columnar format. This enables fast filtering pushdown, Top K ordering, and aggregates over these fields. For example, in the following index definition, rating and id are columnar indexed because they are integers, whereas description is not because it is text.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, description, rating)
WITH (key_field = 'id');

To enable columnar indexing for text and JSON fields, cast the field to a tokenizer with columnar set to true.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.unicode_words('columnar=true')), rating)
WITH (key_field = 'id');

The columnar option for tokenizers is available in versions 0.22.0 and above.

Columnar defaults to false for all tokenizers besides literal and literal normalized, which default to true and do not require an explicit setting. The reason is that tokenized fields can represent large documents and would be expensive to store column-wise, whereas literal and literal normalized fields are typically single-value and much more compact.

The columnar field stores the raw text value regardless of the tokenizer. For example, if Hello world is split into tokens hello and world, the columnar value remains Hello world.This is important because operations like filtering and sorting require the original field value, not the tokens.

Internally, Tantivy refers to columnar fields as fast fields. Our legacy docs also refer to these fields as fast.

Partial Indexes

Reindexing