Quickstart
This guide will walk you through the following steps to get started with ParadeDB:
- Full text search
- Similarity (i.e. vector) search
- Hybrid search
Full Text Search
ParadeDB comes with a helpful procedure that creates a table populated with mock data to help
you get started. Once connected with psql
, run the following commands to create and inspect
this table.
CALL paradedb.create_bm25_test_table(
schema_name => 'public',
table_name => 'mock_items'
);
SELECT description, rating, category
FROM mock_items
LIMIT 3;
Next, let’s create a BM25 index called default_idx
on this table. We’ll index the
description
, category
, and rating
fields.
CALL paradedb.create_bm25(
index_name => 'default_idx',
table_name => 'mock_items',
key_field => 'id',
text_fields => paradedb.field('description') || paradedb.field('category'),
numeric_fields => paradedb.field('rating')
);
Note the mandatory key_field
option. Every BM25 index needs a key_field
, which should
be the name of a column that will function as a row’s unique identifier within the index. Usually, the key_field
can just be the name of your table’s primary key column.
We’re now ready to execute a full-text search. We’ll look for rows with a rating
greater than 2
where description
matches keyboard
or category
matches electronics
.
SELECT description, rating, category
FROM default_idx.search(
'(description:keyboard OR category:electronics) AND rating:>2',
limit_rows => 5
);
Note the usage of limit_rows
instead of the SQL LIMIT
clause. For optimal performance, we recommend always using
limit_rows
and offset_rows
instead of LIMIT
and OFFSET
.
Similarly, the rating
column was indexed and the rating:>2
filter was used instead of the SQL WHERE
clause
for efficient filtering.
Next, let’s see how ParadeDB handles a phrase query like bluetooth speaker
. Let’s also surface results even
if there is a word between bluetooth
and speaker
.
SELECT description, rating, category
FROM default_idx.search('description:"bluetooth speaker"~1');
Note that phrases must be wrapped in double quotes. Also note our use of the ~1
slop operator,
which tells ParadeDB to return matches even if they are separated by one word.
Finally, let’s use the snippet
function to examine the BM25 scores and generate
highlighted snippets for our results.
SELECT * FROM default_idx.snippet(
'description:bluetooth',
highlight_field => 'description'
);
Let’s join this result with our mock_items
table to see the BM25 scores and highlighted snippets
next to the original data:
WITH snippet AS (
SELECT * FROM default_idx.snippet(
'description:bluetooth',
highlight_field => 'description'
)
)
SELECT description, snippet, score_bm25
FROM snippet
LEFT JOIN mock_items ON snippet.id = mock_items.id;
For more on full text search, refer to the full text search API section.
Similarity Search
For vector similarity search, let’s first generate a vector embeddings column. For the sake of this tutorial, we’ll randomly generate these embeddings.
ALTER TABLE mock_items ADD COLUMN embedding vector(3);
UPDATE mock_items m
SET embedding = ('[' ||
((m.id + 1) % 10 + 1)::integer || ',' ||
((m.id + 2) % 10 + 1)::integer || ',' ||
((m.id + 3) % 10 + 1)::integer || ']')::vector;
SELECT description, rating, category, embedding
FROM mock_items
LIMIT 3;
Next, let’s create an HNSW index on the embedding
column of our table.
While not required, an HNSW index can drastically improve query performance over very large datasets.
CREATE INDEX on mock_items
USING hnsw (embedding vector_l2_ops);
Next, let’s query our table with a vector and order the results by L2 distance:
SELECT description, category, rating, embedding
FROM mock_items
ORDER BY embedding <-> '[1,2,3]'
LIMIT 3;
Hybrid Search
Finally, let’s implement hybrid search, which combines BM25-based full text scores with vector-based similarity scores. Hybrid search is especially useful in scenarios where you want to match by both exact keywords and semantic meaning.
The score_hybrid
function accepts a BM25 query and a similarity query. It applies minmax normalization to the BM25 and
similarity scores and combines them using a weighted average.
SELECT * FROM default_idx.score_hybrid(
bm25_query => 'description:keyboard OR category:electronics',
similarity_query => '''[1,2,3]'' <-> embedding',
bm25_weight => 0.9,
similarity_weight => 0.1
) LIMIT 5;
Let’s join this result with our mock_items
table to see the full results of our hybrid search:
SELECT m.description, m.category, m.embedding, s.score_hybrid
FROM mock_items m
LEFT JOIN (
SELECT * FROM default_idx.score_hybrid(
bm25_query => 'description:keyboard OR category:electronics',
similarity_query => '''[1,2,3]'' <-> embedding',
bm25_weight => 0.9,
similarity_weight => 0.1
)
) s
ON m.id = s.id
LIMIT 5;
As we can see, results with the word keyboard
scored higher than results with an embedding of
[1,2,3]
because we placed a weight of 0.9
on the BM25 scores.
Minmax normalization transforms the lowest and highest scores in a dataset to
0
and 1
, respectively. Because of this, the lowest-ranking BM25 or
similarity score may be overlooked, as it is transformed to 0
.
For Further Assistance
The paradedb.help
function opens a GitHub Discussion that the ParadeDB team will respond to.
SELECT paradedb.help(
subject => $$Something isn't working$$,
body => $$Issue description$$
);
Was this page helpful?