The unicode tokenizer splits text according to word boundaries defined by the Unicode Standard Annex #29
rules. All characters are lowercased by default.This tokenizer is the default text tokenizer. If no tokenizer is specified for a text field, the unicode tokenizer will be used
(unless the text field is the key field, in which case the text is not tokenized).
Copy
Ask AI
-- The following two configurations are equivalentCREATE INDEX search_idx ON mock_itemsUSING bm25 (id, description)WITH (key_field='id');CREATE INDEX search_idx ON mock_itemsUSING bm25 (id, (description::pdb.unicode_words))WITH (key_field='id');
To get a feel for this tokenizer, run the following command and replace the text with your own: