The Lindera tokenizer is a more advanced CJK tokenizer that uses prebuilt Chinese, Japanese, or Korean dictionaries to break text into meaningful tokens (words or phrases) rather than on individual characters. Chinese Lindera uses the CC-CEDICT dictionary, Korean Lindera uses the KoDic dictionary, and Japanese Lindera uses the IPADIC dictionary. By default, non-CJK text is lowercased, and punctuation is not ignored. As of version 0.22.4, whitespace is removed by default. On earlier versions it is preserved.Documentation Index
Fetch the complete documentation index at: https://docs.paradedb.com/llms.txt
Use this file to discover all available pages before exploring further.
Expected Response
Keep Whitespace
By default, whitespace is not tokenized. To include it, setkeep_whitespace to true.
Expected Response