ParadeDB Docs

The Chinese compatible tokenizer is like the simple tokenizer — it lowercases non-CJK characters and splits on whitespace and punctuation. Additionally, it treats each CJK character as its own token.

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.chinese_compatible))
WITH (key_field='id');

To get a feel for this tokenizer, run the following command and replace the text with your own:

SELECT 'Hello world! 你好!'::pdb.chinese_compatible::text[];

Expected Response

        text
---------------------
 {hello,world,你,好}
(1 row)

Documentation (v2)

Chinese Compatible