Indexing
Index Settings
These settings can be set in postgresql.conf
and affect system resources used by the paradedb.create_bm25()
function
and normal INSERT/UPDATE/COPY
statements.
Indexing Threads
paradedb.create_index_parallelism
sets the number of threads used during CREATE INDEX
. The default is 8
. A value of zero will
automatically detect “available parallelism” of the host computer.
For best performance it is recommended to use one less than the total number of host computer cores.
Indexing Memory
paradedb.create_index_memory_budget
sets the amount of memory to dedicate per indexing thread before the index segment needs to be
written to disk.
If unset, Postgres’ maintenance_work_mem
value is used, but is first divided by the total parallelism (paradedb.create_index_parallelism
).
In terms of raw indexing performance, larger is generally better.
The value is measured in megabytes. A value of 1024
is the same as 1GB
.
Indexing Progress
When set to true
, paradedb.log_create_index_progress
creates Postgres LOG:
entries every 100,000
rows with information on the indexing rate (in rows per second).
Defaults to false
.
This can be useful to monitor the progress of a long-running CREATE INDEX
statement. This setting can be changed by any user.
Statement Parallelism
paradedb.statement_parallelism
controls the number of indexing threads used during INSERT/UPDATE/COPY
statements. The default is 8
. A value of
zero will automatically detect “available parallelism” of the host computer.
If your typical update patterns are single-row atomic INSERT
s or UPDATE
s, then a value of 1
is ideal. If
your update patterns typically update many thousands of rows at once, a larger value might be preferred for greater indexing
concurrency.
Statement Memory Budget
paradedb.statement_memory_budget
sets the amount of memory to dedicate per indexing thread before the index segment needs to be
written to disk.
Like indexing memory, it defaults to maintenance_work_mem
divided by the total parallelism, and is measured in megabytes.
If your typical update patterns are single-row atomic INSERT
s or UPDATE
s, then a value of 15MB
is ideal. If
your update patterns typically update many thousands of rows, a larger value might be preferred for greater indexing
concurrency.
Generally speaking, for statement level indexing, the parallelism value dictates the number of new index
segments that will be created, so there’s a trade-off between finishing the INSERT/UPDATE
statement quickly and the
longer-term search overhead of having many additional index segments until VACUUM
runs.
Segment Settings
A segment is an individual unit of an index. Each BM25 index is comprised of multiple segments. Every atomic insert or update creates a new segment.
Target Segment Count
Having too many segments can negatively impact search performance. During a VACUUM
, smaller segments can be merged into larger segments.
target_segment_count
controls the number of segments that should exist and defaults to the number of available CPUs
on the host machine. This allows multiple threads to read from segments in parallel during a search.
While ParadeDB attempts to create target_segment_count
segments, the actual
number of segments may not always equal target_segment_count
. For instance,
it could exceed this number of VACUUM
is not run, or fall below this number
if the table contains a small number of rows.
Merge on Insert
merge_on_insert
indicates if ParadeDB should look for merge opportunities during INSERT
, UPDATE
, or DELETE
operations in addition to VACUUM
s. The default is true
. Setting this to false
can improve INSERT/UPDATE/COPY
throughput at the expense of creating more segments that will later be merged by VACUUM
.
Was this page helpful?