Expected Response
Available Tokenizers
Chinese Compatible
A simple tokenizer for Chinese, Japanese, and Korean characters
The Chinese compatible tokenizer is like the simple tokenizer — it lowercases non-CJK characters and splits on
any non-alphanumeric character. Additionally, it treats each CJK character as its own token.
To get a feel for this tokenizer, run the following command and replace the text with your own: