How Text Search Works

Text search in ParadeDB, like Elasticsearch and most search engines, is centered around the concept of token matching. Token matching consists of two steps. First, at indexing time, text is first processed by a tokenizer, which breaks input into discrete units called tokens or terms. For example, the default tokenizer splits the text Sleek running shoes into the tokens sleek, running, and shoes. Second, at query time, the query engine looks for token matches based on the query type. For example, a match disjunction query for running shoes will match any document that contains at least one of the tokens running or shoes. In contrast, a match conjunction query requires all tokens to be present in the document — running and shoes must both appear.

Not Substring Matching

While ParadeDB supports substring matching via regex queries, it’s important to note that token matching is not the same as substring matching. Token matching is a much more versatile and powerful technique. It enables relevance scoring, language-specific analysis, typo tolerance, and more expressive query types — capabilities that go far beyond simply looking for a sequence of characters. For example, a substring search for run might miss running, while a token-based match query will correctly match if the tokenizer includes stemming. This makes token matching a fit for search and discovery use cases where users expect flexible, intelligent results. Text search is different than similarity search, also known as vector search. Whereas text search matches based on token matches, similarity search matches based on semantic meaning. ParadeDB currently does not build its own extensions for similarity search. Most ParadeDB users install pgvector, the Postgres extension for vector search, for this use case. We have tentative long-term plans in our roadmap to make improvements to Postgres’ vector search. If this is useful to you, please reach out.