> ## Documentation Index
> Fetch the complete documentation index at: https://docs.paradedb.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Unicode

> The default text tokenizer in ParadeDB

The unicode tokenizer splits text according to word boundaries defined by the [Unicode Standard Annex #29](https://www.unicode.org/reports/tr29/)
rules. All characters are [lowercased](/documentation/token-filters/lowercase) by default.

This tokenizer is the default text tokenizer. If no tokenizer is specified for a text field, the unicode tokenizer will be used
(unless the text field is the [key field](/documentation/indexing/create-index#choosing-a-key-field), in which case the text is not tokenized).

```sql theme={null}
-- The following two configurations are equivalent
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (key_field='id');

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.unicode_words))
WITH (key_field='id');
```

To get a feel for this tokenizer, run the following command and replace the text with your own:

```sql theme={null}
SELECT 'Tokenize me!'::pdb.unicode_words::text[];
```

```ini Expected Response theme={null}
     text
---------------
 {tokenize,me}
(1 row)
```

## Remove Emojis

By default, emojis in the source text are preserved. To remove emojis, set `remove_emojis` to `true`.

```sql theme={null}
SELECT 'Tokenize me! 😊'::pdb.unicode_words('remove_emojis=true')::text[];
```

```ini Expected Response theme={null}
     text
---------------
 {tokenize,me}
(1 row)
```
