How Text Search Works
Text search in ParadeDB, like Elasticsearch and most search engines, is centered around the concept of token matching. Token matching consists of two steps. First, at indexing time, text is processed by a tokenizer, which breaks input into discrete units called tokens or terms. For example, the default tokenizer splits the textSleek running shoes
into the tokens sleek
, running
, and shoes
.
Second, at query time, the query engine looks for token matches based on the specified query and query type. Some common query types include:
- Match: Matches documents containing any or all query tokens
- Phrase: Matches documents where all tokens appear in the same order as the query
- Term: Matches documents containing an exact token
- …and many more advanced query types
Not Substring Matching
While ParadeDB supports substring matching via regex queries, it’s important to note that token matching is not the same as substring matching. Token matching is a much more versatile and powerful technique. It enables relevance scoring, language-specific analysis, typo tolerance, and more expressive query types — capabilities that go far beyond simply looking for a sequence of characters. For example, a substring search forrun
might miss running
, while a token-based match query will correctly match if the tokenizer includes stemming. This makes token matching a fit for search and discovery use cases where users expect flexible, intelligent results.