The unicode tokenizer splits text according to word boundaries defined by the Unicode Standard Annex #29 rules. All characters are lowercased by default. This tokenizer is the default text tokenizer. If no tokenizer is specified for a text field, the unicode tokenizer will be used (unless the text field is the key field, in which case the text is not tokenized).Documentation Index
Fetch the complete documentation index at: https://docs.paradedb.com/llms.txt
Use this file to discover all available pages before exploring further.
Expected Response
Remove Emojis
By default, emojis in the source text are preserved. To remove emojis, setremove_emojis to true.
Expected Response