Creating a table > NLP and tokenization > Supported languages

≫ NLP and tokenization

Manticore doesn't store text exactly as it is for full-text searching. Instead, it breaks the text into words (called tokens) and builds several internal structures to enable fast full-text searches. These structures include a dictionary that helps quickly check if a word exists in the index. Other structures track which documents and fields contain the word, and even where exactly in the field it appears. These are all used during a search to find relevant results.

The process of splitting and handling text like this is called tokenization. Tokenization happens both when adding data to the index and when running a search. It works at both the character and word level.

At the character level, only certain characters are allowed. This is controlled by the charset_table. Any other characters are replaced with a space (which is treated as a word separator). The charset_table also supports things like turning characters into lowercase or replacing one character with another. It can also define characters to be ignored, blended, or treated as a phrase boundary.

At the word level, the engine uses the min_word_len setting to decide the minimum word length (in characters) that should be indexed.

Manticore also supports matching words with different forms. For example, to treat "car" and "cars" as the same word, you can use morphology processors.

If you want different words to be treated as the same—for example, "USA" and "United States" — you can define them using the word forms feature.

Very common words (like "the", "and", "is") can slow down searches and increase index size. You can filter them out using stop words. This can make searches faster and the index smaller.

A more advanced filtering method is bigrams, which creates special tokens by combining a common word with an uncommon one. This can significantly speed up phrase searches when common words are involved.

If you're indexing HTML, it's usually best not to include the HTML tags in the index, since they add a lot of unnecessary content. You can use HTML stripping to remove the tags, but still index certain tag attributes or skip specific elements entirely.

Keep in mind that Manticore has a maximum token length of 42 bytes after token normalization. Any token longer than this will be truncated. This limit applies during both indexing and searching, so both indexed data and search queries are affected.

Tables that use dict=keywords_32k are the exception. They can index normalized tokens up to 32768 bytes in plain and RT tables. Tokens longer than 32768 bytes are skipped with a warning instead of being indexed as truncated terms.

Long keywords_32k tokens are indexed as original tokens. Tokens longer than 42 bytes after normalization do not go through stemming or lemmatization. Tokens up to 42 bytes can still use the configured morphology.

Snippets and highlighting still use the regular token limit. Tokens up to 42 bytes can be highlighted; longer keywords_32k tokens are skipped by snippet/highlight processing.

NLP and tokenization Supported languages

Last modified: June 17, 2026

Manticore supports a wide range of languages, with basic support enabled for most languages via charset_table = non_cont (which is the default value). The non_cjk option which is an alias for non_cont can be used as well: charset_table = non_cjk.

For many languages, Manticore provides a stopwords file that can be used to improve search relevance.

Additionally, advanced morphology is available for a few languages that can significantly improve search relevance by using dictionary-based lemmatization or stemming algorithms for better segmentation and normalization.

The table below lists all supported languages and indicates how to enable:

basic support (column "Supported")
stopwords (column "Stopwords file name")
advanced morphology (column "Advanced morphology")

Language	Supported	Stopwords file name	Advanced morphology	Notes
Afrikaans	charset_table=non_cont	af	-
Arabic	charset_table=non_cont	ar	morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar
Armenian	charset_table=non_cont	hy	-
Assamese	specify charset_table specify charset_table manually	-	-
Basque	charset_table=non_cont	eu	-
Bengali	charset_table=non_cont	bn	-
Bishnupriya	specify charset_table manually	-	-
Buhid	specify charset_table manually	-	-
Bulgarian	charset_table=non_cont	bg	-
Catalan	charset_table=non_cont	ca	morphology=libstemmer_ca
Chinese using ICU	charset_table=chinese	zh	morphology=icu_chinese	More accurate than using ngrams
Chinese using Jieba	charset_table=chinese	zh	morphology=jieba_chinese, requires package `manticore-language-packs`	More accurate than using ngrams
Chinese using ngrams	ngram_chars=chinese	zh	ngram_chars=1	Faster indexing, but the search performance might not be as good
Croatian	charset_table=non_cont	hr	-
Kurdish	charset_table=non_cont	ckb	-
Czech	charset_table=non_cont	cz	morphology=stem_cz (Czech stemmer)
Danish	charset_table=non_cont	da	morphology=libstemmer_da
Dutch	charset_table=non_cont	nl	morphology=libstemmer_nl
English	charset_table=non_cont	en	morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter's English stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_en (English from libstemmer)
Esperanto	charset_table=non_cont	eo	-
Estonian	charset_table=non_cont	et	-
Finnish	charset_table=non_cont	fi	morphology=libstemmer_fi
French	charset_table=non_cont	fr	morphology=libstemmer_fr
Galician	charset_table=non_cont	gl	-
Garo	specify charset_table manually	-	-
German	charset_table=non_cont	de	morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de
Greek	charset_table=non_cont	el	morphology=libstemmer_el
Hebrew	charset_table=non_cont	he	-
Hindi	charset_table=non_cont	hi	morphology=libstemmer_hi
Hmong	specify charset_table manually	-	-
Ho	specify charset_table manually	-	-
Hungarian	charset_table=non_cont	hu	morphology=libstemmer_hu
Indonesian	charset_table=non_cont	id	morphology=libstemmer_id
Irish	charset_table=non_cont	ga	morphology=libstemmer_ga
Italian	charset_table=non_cont	it	morphology=libstemmer_it
Japanese	ngram_chars=japanese	-	ngram_chars=japanese ngram_len=1	Requires ngram-based segmentation
Komi	specify charset_table manually	-	-
Korean	ngram_chars=korean	-	ngram_chars=korean ngram_len=1	Requires ngram-based segmentation
Large Flowery Miao	specify charset_table manually	-	-
Latin	charset_table=non_cont	la	-
Latvian	charset_table=non_cont	lv	-
Lithuanian	charset_table=non_cont	lt	morphology=libstemmer_lt
Maba	specify charset_table manually	-	-
Maithili	specify charset_table manually	-	-
Marathi	specify charset_table manually	-	-
Marathi	charset_table=non_cont	mr	-
Mende	specify charset_table manually	-	-
Mru	specify charset_table manually	-	-
Myene	specify charset_table manually	-	-
Nepali	specify charset_table manually	-	morphology=libstemmer_ne
Ngambay	specify charset_table manually	-	-
Norwegian	charset_table=non_cont	no	morphology=libstemmer_no
Odia	specify charset_table manually	-	-
Persian	charset_table=non_cont	fa	-
Polish	charset_table=non_cont	pl	-
Portuguese	charset_table=non_cont	pt	morphology=libstemmer_pt
Romanian	charset_table=non_cont	ro	morphology=libstemmer_ro
Russian	charset_table=non_cont	ru	morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter's Russian stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_ru (from libstemmer)
Santali	specify charset_table manually	-	-
Sindhi	specify charset_table manually	-	-
Slovak	charset_table=non_cont	sk	-
Slovenian	charset_table=non_cont	sl	-
Somali	charset_table=non_cont	so	-
Sotho	charset_table=non_cont	st	-
Spanish	charset_table=non_cont	es	morphology=libstemmer_es
Swahili	charset_table=non_cont	sw	-
Swedish	charset_table=non_cont	sv	morphology=libstemmer_sv
Sylheti	specify charset_table manually	-	-
Tamil	specify charset_table manually	-	morphology=libstemmer_ta
Thai	charset_table=thai	th	-
Turkish	charset_table=non_cont	tr	morphology=libstemmer_tr
Ukrainian	charset_table=non_cont,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491	-	morphology=lemmatize_uk (single root form); morphology=lemmatize_uk_all (all root forms)	Override `charset_table` to preserve `і`, `ї`, and `ґ`
Vietnamese	charset_table=non_cont	-	-	Uses Latin script. Vietnamese diacritics (ă, â, ê, ô, ơ, ư, đ, and tone marks) are automatically mapped to their base Latin characters by default, so "tiếng" matches "tieng" without additional configuration.
Yoruba	charset_table=non_cont	yo	-
Zulu	charset_table=non_cont	zu	-

Data tokenization Languages with continuous scripts

Last modified: April 30, 2026

Manticore provides built-in support for indexing languages with continuous scripts (i.e., languages that do not use spaces or other marks between words or sentences). This allows you to process texts in these languages in two different ways:

Precise segmentation using the ICU library. Currently, only Chinese is supported.

‹›

SQL
JSON
PHP
Python
Python-asyncio
Javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'cont',
            'morphology' => 'icu_chinese'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'icu_chinese\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'icu_chinese\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'icu_chinese\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'", true);

utils_api.sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'", Some(true)).await;

table products {
  charset_table = cont
  morphology = icu_chinese
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Precise segmentation using the Jieba library. Like ICU, it currently supports only Chinese.

‹›

SQL
JSON
PHP
Python
Python-asyncio
Javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'cont',
            'morphology' => 'jieba_chinese'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'jieba_chinese\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'jieba_chinese\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'jieba_chinese\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'", true);

utils_api.sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'", Some(true)).await;

table products {
  charset_table = cont
  morphology = jieba_chinese
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Basic support using the N-gram options ngram_len and ngram_chars For each language using a continuous script, there are separate character set tables (chinese, korean, japanese, thai) that can be used. Alternatively, you can use the common cont character set table to support all CJK and Thai languages at once, or the cjk charset to include all CJK languages only.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'
/* Or, alternatively */
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cjk,thai'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'"
/* Or, alternatively */
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cjk,thai'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'charset_table' => 'non_cont',
             'ngram_len' => '1',
             'ngram_chars' => 'cont'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cont\' ngram_len = \'1\' ngram_chars = \'cont\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cont\' ngram_len = \'1\' ngram_chars = \'cont\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cont\' ngram_len = \'1\' ngram_chars = \'cont\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'", true);

utils_api.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'", Some(true)).await;

table products {
  charset_table = non_cont
  ngram_len = 1
  ngram_chars = cont
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Additionally, there is built-in support for Chinese stopwords with the alias zh.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'chinese',
            'morphology' => 'icu_chinese',
            'stopwords' => 'zh'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'", true);

utils_api.sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'", Some(true)).await;

table products {
  charset_table = chinese
  morphology = icu_chinese
  stopwords = zh
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Supported languages Low-level tokenization

Last modified: August 28, 2025

≫ NLP and tokenization

Data tokenization

Character-level tokenization

Word-level tokenization

Handling common and noisy words

HTML content

Token length limit

Supported languages

Chinese, Japanese and Korean (CJK) and Thai languages