This quickstart covers the latest version of ParadeDB, which is a significant API refactor compared to versions of ParadeDB below v0.4. Users below v0.4 must upgrade to the latest version, as the old documentation has been deprecated.

This guide will walk you through the following steps to get started with ParadeDB for search:

  1. Full text search
  2. Similarity (i.e. vector) search
  3. Hybrid search

ParadeDB comes with a helpful procedure that creates a table populated with mock data to help you get started. Once connected with psql, run the following commands to create and inspect this table.

CALL paradedb.create_bm25_test_table(
  schema_name => 'public',
  table_name => 'mock_items'
);

SELECT description, rating, category
FROM mock_items
LIMIT 3;

Next, let’s create a BM25 index called search_idx on this table. We’ll index the description and category fields and configure the tokenizer on the description field.

CALL paradedb.create_bm25(
        index_name => 'search_idx',
        schema_name => 'public',
        table_name => 'mock_items',
        key_field => 'id',
        text_fields => '{description: {tokenizer: {type: "en_stem"}}, category: {}}'
);

Note the mandatory key_field option. Every BM25 index needs a key_field, which should be the name of a column that will function as a row’s unique identifier within the index. Usually, the key_field can just be the name of your table’s primary key column.

Currently, only integer IDs are supported in the key_field column, so if your primary key is a composite or a string, just add another column with serial IDs to your table and use that as your key_field.

ALTER TABLE mock_items
ADD COLUMN bm25_id SERIAL;

We’re now ready to execute a full-text search. We’ll look for rows where description matches the shoes or category matches electronics.

SELECT description, rating, category
FROM search_idx.search('description:keyboard OR category:electronics')
LIMIT 5;

Next, let’s see how ParadeDB handles a phrase query like bluetooth speaker. Let’s also surface results even if there is a word between bluetooth and speaker.

SELECT description, rating, category
FROM search_idx.search('description:"bluetooth speaker"~1')
LIMIT 5;

Note that phrases must be wrapped in double quotes. Also note our use of the ~1 slop operator, which tells ParadeDB to return matches even if they are separated by 1 word.

Next, let’s match against a partial word like blue. To do this, we’ll create a new index that uses the ngrams tokenizer, which splits text into chunks of size n.

CALL paradedb.create_bm25(
        index_name => 'ngrams_idx',
        schema_name => 'public',
        table_name => 'mock_items',
        key_field => 'id',
        text_fields => '{description: {tokenizer: {type: "ngram", min_gram: 4, max_gram: 4, prefix_only: false}}, category: {}}'
);

SELECT description, rating, category
FROM ngrams_idx.search('description:blue');

Finally, let’s examine the BM25 scores and generate highlighted snippets for our results.

SELECT s.description, h.highlight_bm25, r.rank_bm25 FROM ngrams_idx.search('description:blue') as s
LEFT JOIN ngrams_idx.highlight('description:blue', highlight_field => 'description') as h ON s.id = h.id
LEFT JOIN ngrams_idx.rank('description:blue') as r ON s.id = r.id;

ParadeDB supports a variety of additional full text search features like fuzzy search, autocomplete,and multi-language tokenizers.

For vector similarity search, let’s first generate a vector embeddings column. For the sake of this tutorial, we’ll randomly generate these embeddings.

ALTER TABLE mock_items ADD COLUMN embedding vector(3);

UPDATE mock_items m
SET embedding = ('[' ||
    ((m.id + 1) % 10 + 1)::integer || ',' ||
    ((m.id + 2) % 10 + 1)::integer || ',' ||
    ((m.id + 3) % 10 + 1)::integer || ']')::vector;

SELECT description, rating, category, embedding
FROM mock_items
LIMIT 3;

Next, let’s create an HNSW index on the embedding column of our table. While not required, an HNSW index can drastically improve query performance over very large datasets.

CREATE INDEX on mock_items
USING hnsw (embedding vector_l2_ops);

Next, let’s query our table with a vector and order the results by L2 distance:

SELECT description, category, rating, embedding
FROM mock_items
ORDER BY embedding <-> '[1,2,3]'
LIMIT 3;

Finally, let’s implement hybrid search, which combines BM25-based full text scores with vector-based similarity scores. Hybrid search is especially useful in scenarios where you want to match by both exact keywords and semantic meaning.

The rank_hybrid function accepts a BM25 query and a similarity query. It applies minmax normalization to the BM25 and similarity scores and combines them using a weighted average.

SELECT * FROM search_idx.rank_hybrid(
    bm25_query => 'description:keyboard OR category:electronics',
    similarity_query => '''[1,2,3]'' <-> embedding',
    bm25_weight => 0.9,
    similarity_weight => 0.1
) LIMIT 5;

Let’s join this result with our mock_items table to see the full results of our hybrid search:

SELECT m.description, m.category, m.embedding, s.rank_hybrid
FROM mock_items m
LEFT JOIN (
    SELECT * FROM search_idx.rank_hybrid(
        bm25_query => 'description:keyboard OR category:electronics',
        similarity_query => '''[1,2,3]'' <-> embedding',
        bm25_weight => 0.9,
        similarity_weight => 0.1
    )
) s
ON m.id = s.id
LIMIT 5;

As we can see, results with the word keyboard scored higher than results with an embedding of [1,2,3] because we placed a weight of 0.9 on the BM25 scores.

Minmax normalization transforms the lowest and highest scores in a dataset to 0 and 1, respectively. Because of this, the lowest-ranking BM25 or similarity score may be overlooked, as it is transformed to 0.

Congratulations!

You’ve now seen how ParadeDB serves as both a Postgres database and search engine.