This guide will walk you through the following steps to get started with ParadeDB:

  1. Full text search
  2. Similarity (i.e. vector) search

ParadeDB comes with a helpful procedure that creates a table populated with mock data to help you get started. Once connected with psql, run the following commands to create and inspect this table.

CALL paradedb.create_bm25_test_table(
  schema_name => 'public',
  table_name => 'mock_items'
);

SELECT description, rating, category
FROM mock_items
LIMIT 3;

Next, let’s create a BM25 index called search_idx on this table. A BM25 index is a covering index, which means that multiple columns can be included in the same index. The following code block demonstrates the various Postgres types that can be combined inside a single index.

CALL paradedb.create_bm25(
  index_name => 'search_idx',
  table_name => 'mock_items',
  key_field => 'id',
  text_fields => paradedb.field('description') || paradedb.field('category'),
  numeric_fields => paradedb.field('rating'),
  boolean_fields => paradedb.field('in_stock'),
  datetime_fields => paradedb.field('created_at'),
  json_fields => paradedb.field('metadata')
);

Note the mandatory key_field option. Every BM25 index needs a key_field, which should be the name of a column that will function as a row’s unique identifier within the index. Usually, the key_field can just be the name of your table’s primary key column.

The indexing documentation provides an in-depth explanation of what each of these options means.

We’re now ready to execute a full-text search. We’ll look for rows with a rating greater than 2 where description matches keyboard or category matches electronics.

SELECT description, rating, category
FROM search_idx.search(
  '(description:keyboard OR category:electronics) AND rating:>2',
  limit_rows => 5
);

Note the usage of limit_rows instead of the SQL LIMIT clause. For optimal performance, we recommend always using limit_rows and offset_rows instead of LIMIT and OFFSET.

Similarly, the rating column was indexed and the rating:>2 filter was used instead of the SQL WHERE clause for efficient filtering.

Next, let’s see how ParadeDB handles a phrase query like bluetooth speaker. Let’s also surface results even if there is a word between bluetooth and speaker.

SELECT description, rating, category
FROM search_idx.search('description:"bluetooth speaker"~1');

Note that phrases must be wrapped in double quotes. Also note our use of the ~1 slop operator, which tells ParadeDB to return matches even if they are separated by one word.

Finally, let’s use the snippet function to examine the BM25 scores and generate highlighted snippets for our results.

SELECT * FROM search_idx.snippet(
  'description:bluetooth',
  highlight_field => 'description'
);

Let’s join this result with our mock_items table to see the BM25 scores and highlighted snippets next to the original data:

WITH snippet AS (
    SELECT * FROM search_idx.snippet(
      'description:bluetooth',
      highlight_field => 'description'
    )
)
SELECT description, snippet, score_bm25
FROM snippet
LEFT JOIN mock_items ON snippet.id = mock_items.id;

For more on full text search, refer to the full text search API section.

For vector similarity search, let’s first generate a vector embeddings column. For the sake of this tutorial, we’ll randomly generate these embeddings.

ALTER TABLE mock_items ADD COLUMN embedding vector(3);

UPDATE mock_items m
SET embedding = ('[' ||
    ((m.id + 1) % 10 + 1)::integer || ',' ||
    ((m.id + 2) % 10 + 1)::integer || ',' ||
    ((m.id + 3) % 10 + 1)::integer || ']')::vector;

SELECT description, rating, category, embedding
FROM mock_items
LIMIT 3;

Next, let’s create an HNSW index on the embedding column of our table. While not required, an HNSW index can drastically improve query performance over very large datasets.

CREATE INDEX on mock_items
USING hnsw (embedding vector_l2_ops);

Next, let’s query our table with a vector and order the results by L2 distance:

SELECT description, category, rating, embedding
FROM mock_items
ORDER BY embedding <-> '[1,2,3]'
LIMIT 3;

For Further Assistance

The paradedb.help function opens a GitHub Discussion that the ParadeDB team will respond to.

SELECT paradedb.help(
  subject => $$Something isn't working$$,
  body => $$Issue description$$
);