This guide will walk you through the following steps to get started with ParadeDB Search:

  1. Full text search
  2. Similarity (i.e. vector) search
  3. Hybrid search

ParadeDB comes with a helpful procedure that creates a table populated with mock data to help you get started. Once connected with psql, run the following commands to create and inspect this table.

CALL paradedb.create_bm25_test_table(
  schema_name => 'public',
  table_name => 'mock_items'
);

SELECT description, rating, category
FROM mock_items
LIMIT 3;

Next, let’s create a BM25 index called search_idx on this table. We’ll index the description and category fields and configure the tokenizer on the description field.

CALL paradedb.create_bm25(
        index_name => 'search_idx',
        schema_name => 'public',
        table_name => 'mock_items',
        key_field => 'id',
        text_fields => '{description: {tokenizer: {type: "en_stem"}}, category: {}}',
        numeric_fields => '{rating: {}}'
);

Note the mandatory key_field option. Every BM25 index needs a key_field, which should be the name of a column that will function as a row’s unique identifier within the index. Usually, the key_field can just be the name of your table’s primary key column.

Currently, only integer IDs are supported in the key_field column, so if your primary key is a composite or a string, just add another column with serial IDs to your table and use that as your key_field.

ALTER TABLE mock_items
ADD COLUMN bm25_id SERIAL;

We’re now ready to execute a full-text search. We’ll look for rows with a rating greater than 2 where description matches shoes or category matches electronics.

SELECT description, rating, category
FROM search_idx.search(
  '(description:keyboard OR category:electronics) AND rating:>2',
  limit_rows => 5
);

Note the usage of limit_rows instead of the SQL LIMIT clause. For optimal performance, we recommend always using limit_rows and offset_rows instead of LIMIT and OFFSET.

Similarly, the rating column was indexed and the rating:>2 filter was used instead of the SQL WHERE clause for efficient filtering.

Next, let’s see how ParadeDB handles a phrase query like bluetooth speaker. Let’s also surface results even if there is a word between bluetooth and speaker.

SELECT description, rating, category
FROM search_idx.search('description:"bluetooth speaker"~1');

Note that phrases must be wrapped in double quotes. Also note our use of the ~1 slop operator, which tells ParadeDB to return matches even if they are separated by 1 word.

Next, let’s match against a partial word like blue. To do this, we’ll create a new index that uses the ngrams tokenizer, which splits text into chunks of size n.

CALL paradedb.create_bm25(
        index_name => 'ngrams_idx',
        schema_name => 'public',
        table_name => 'mock_items',
        key_field => 'id',
        text_fields => '{description: {tokenizer: {type: "ngram", min_gram: 4, max_gram: 4, prefix_only: false}}, category: {}}'
);

SELECT description, rating, category
FROM ngrams_idx.search('description:blue');

Finally, let’s examine the BM25 scores and generate highlighted snippets for our results.

SELECT description, paradedb.highlight(id, field => 'description'), paradedb.rank_bm25(id)
FROM ngrams_idx.search('description:blue');

Please refer to our advanced search queries guide for all available query types.

For vector similarity search, let’s first generate a vector embeddings column. For the sake of this tutorial, we’ll randomly generate these embeddings.

ALTER TABLE mock_items ADD COLUMN embedding vector(3);

UPDATE mock_items m
SET embedding = ('[' ||
    ((m.id + 1) % 10 + 1)::integer || ',' ||
    ((m.id + 2) % 10 + 1)::integer || ',' ||
    ((m.id + 3) % 10 + 1)::integer || ']')::vector;

SELECT description, rating, category, embedding
FROM mock_items
LIMIT 3;

Next, let’s create an HNSW index on the embedding column of our table. While not required, an HNSW index can drastically improve query performance over very large datasets.

CREATE INDEX on mock_items
USING hnsw (embedding vector_l2_ops);

Next, let’s query our table with a vector and order the results by L2 distance:

SELECT description, category, rating, embedding
FROM mock_items
ORDER BY embedding <-> '[1,2,3]'
LIMIT 3;

Finally, let’s implement hybrid search, which combines BM25-based full text scores with vector-based similarity scores. Hybrid search is especially useful in scenarios where you want to match by both exact keywords and semantic meaning.

The rank_hybrid function accepts a BM25 query and a similarity query. It applies minmax normalization to the BM25 and similarity scores and combines them using a weighted average.

SELECT * FROM search_idx.rank_hybrid(
    bm25_query => 'description:keyboard OR category:electronics',
    similarity_query => '''[1,2,3]'' <-> embedding',
    bm25_weight => 0.9,
    similarity_weight => 0.1
) LIMIT 5;

Let’s join this result with our mock_items table to see the full results of our hybrid search:

SELECT m.description, m.category, m.embedding, s.rank_hybrid
FROM mock_items m
LEFT JOIN (
    SELECT * FROM search_idx.rank_hybrid(
        bm25_query => 'description:keyboard OR category:electronics',
        similarity_query => '''[1,2,3]'' <-> embedding',
        bm25_weight => 0.9,
        similarity_weight => 0.1
    )
) s
ON m.id = s.id
LIMIT 5;

As we can see, results with the word keyboard scored higher than results with an embedding of [1,2,3] because we placed a weight of 0.9 on the BM25 scores.

Minmax normalization transforms the lowest and highest scores in a dataset to 0 and 1, respectively. Because of this, the lowest-ranking BM25 or similarity score may be overlooked, as it is transformed to 0.

For Further Assistance

The paradedb.help function opens a Github Discussion that the ParadeDB team will respond to.

SELECT paradedb.help(
  subject => $$Something isn't working$$,
  body => $$Issue description$$
);