This guide will walk you through the following steps to get started with ParadeDB:

  1. Full text search
  2. Similarity (i.e. vector) search
  3. Hybrid search

ParadeDB comes with a helpful procedure that creates a table populated with mock data to help you get started. Once connected with psql, run the following commands to create and inspect this table.

CALL paradedb.create_bm25_test_table(
  schema_name => 'public',
  table_name => 'mock_items'
);

SELECT description, rating, category
FROM mock_items
LIMIT 3;

Next, let’s create a BM25 index called default_idx on this table. We’ll index the description, category, and rating fields.

CALL paradedb.create_bm25(
  index_name => 'default_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description') || paradedb.field('category'),
  numeric_fields => paradedb.field('rating')
);

Note the mandatory key_field option. Every BM25 index needs a key_field, which should be the name of a column that will function as a row’s unique identifier within the index. Usually, the key_field can just be the name of your table’s primary key column.

We’re now ready to execute a full-text search. We’ll look for rows with a rating greater than 2 where description matches keyboard or category matches electronics.

SELECT description, rating, category
FROM default_idx.search(
  '(description:keyboard OR category:electronics) AND rating:>2',
  limit_rows => 5
);

Note the usage of limit_rows instead of the SQL LIMIT clause. For optimal performance, we recommend always using limit_rows and offset_rows instead of LIMIT and OFFSET.

Similarly, the rating column was indexed and the rating:>2 filter was used instead of the SQL WHERE clause for efficient filtering.

Next, let’s see how ParadeDB handles a phrase query like bluetooth speaker. Let’s also surface results even if there is a word between bluetooth and speaker.

SELECT description, rating, category
FROM default_idx.search('description:"bluetooth speaker"~1');

Note that phrases must be wrapped in double quotes. Also note our use of the ~1 slop operator, which tells ParadeDB to return matches even if they are separated by one word.

Finally, let’s use the snippet function to examine the BM25 scores and generate highlighted snippets for our results.

SELECT * FROM default_idx.snippet(
  'description:bluetooth',
  highlight_field => 'description'
);

Let’s join this result with our mock_items table to see the BM25 scores and highlighted snippets next to the original data:

WITH snippet AS (
    SELECT * FROM default_idx.snippet(
      'description:bluetooth',
      highlight_field => 'description'
    )
)
SELECT description, snippet, score_bm25
FROM snippet
LEFT JOIN mock_items ON snippet.id = mock_items.id;

For more on full text search, refer to the full text search API section.

For vector similarity search, let’s first generate a vector embeddings column. For the sake of this tutorial, we’ll randomly generate these embeddings.

ALTER TABLE mock_items ADD COLUMN embedding vector(3);

UPDATE mock_items m
SET embedding = ('[' ||
    ((m.id + 1) % 10 + 1)::integer || ',' ||
    ((m.id + 2) % 10 + 1)::integer || ',' ||
    ((m.id + 3) % 10 + 1)::integer || ']')::vector;

SELECT description, rating, category, embedding
FROM mock_items
LIMIT 3;

Next, let’s create an HNSW index on the embedding column of our table. While not required, an HNSW index can drastically improve query performance over very large datasets.

CREATE INDEX on mock_items
USING hnsw (embedding vector_l2_ops);

Next, let’s query our table with a vector and order the results by L2 distance:

SELECT description, category, rating, embedding
FROM mock_items
ORDER BY embedding <-> '[1,2,3]'
LIMIT 3;

Finally, let’s implement hybrid search, which combines BM25-based full text scores with vector-based similarity scores. Hybrid search is especially useful in scenarios where you want to match by both exact keywords and semantic meaning.

The score_hybrid function accepts a BM25 query and a similarity query. It applies minmax normalization to the BM25 and similarity scores and combines them using a weighted average.

SELECT * FROM default_idx.score_hybrid(
    bm25_query => 'description:keyboard OR category:electronics',
    similarity_query => '''[1,2,3]'' <-> embedding',
    bm25_weight => 0.9,
    similarity_weight => 0.1
) LIMIT 5;

Let’s join this result with our mock_items table to see the full results of our hybrid search:

SELECT m.description, m.category, m.embedding, s.score_hybrid
FROM mock_items m
LEFT JOIN (
    SELECT * FROM default_idx.score_hybrid(
        bm25_query => 'description:keyboard OR category:electronics',
        similarity_query => '''[1,2,3]'' <-> embedding',
        bm25_weight => 0.9,
        similarity_weight => 0.1
    )
) s
ON m.id = s.id
LIMIT 5;

As we can see, results with the word keyboard scored higher than results with an embedding of [1,2,3] because we placed a weight of 0.9 on the BM25 scores.

Minmax normalization transforms the lowest and highest scores in a dataset to 0 and 1, respectively. Because of this, the lowest-ranking BM25 or similarity score may be overlooked, as it is transformed to 0.

For Further Assistance

The paradedb.help function opens a GitHub Discussion that the ParadeDB team will respond to.

SELECT paradedb.help(
  subject => $$Something isn't working$$,
  body => $$Issue description$$
);