Creating a table > NLP and tokenization > Languages with continuous scripts

Manticore supports a wide range of languages, with basic support enabled for most languages via charset_table = non_cont (which is the default value). The non_cjk option which is an alias for non_cont can be used as well: charset_table = non_cjk.

For many languages, Manticore provides a stopwords file that can be used to improve search relevance.

Additionally, advanced morphology is available for a few languages that can significantly improve search relevance by using dictionary-based lemmatization or stemming algorithms for better segmentation and normalization.

The table below lists all supported languages and indicates how to enable:

basic support (column "Supported")
stopwords (column "Stopwords file name")
advanced morphology (column "Advanced morphology")

Language	Supported	Stopwords file name	Advanced morphology	Notes
Afrikaans	charset_table=non_cont	af	-
Arabic	charset_table=non_cont	ar	morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar
Armenian	charset_table=non_cont	hy	-
Assamese	specify charset_table specify charset_table manually	-	-
Basque	charset_table=non_cont	eu	-
Bengali	charset_table=non_cont	bn	-
Bishnupriya	specify charset_table manually	-	-
Buhid	specify charset_table manually	-	-
Bulgarian	charset_table=non_cont	bg	-
Catalan	charset_table=non_cont	ca	morphology=libstemmer_ca
Chinese using ICU	charset_table=chinese	zh	morphology=icu_chinese	More accurate than using ngrams
Chinese using Jieba	charset_table=chinese	zh	morphology=jieba_chinese, requires package `manticore-language-packs`	More accurate than using ngrams
Chinese using ngrams	ngram_chars=chinese	zh	ngram_chars=1	Faster indexing, but the search performance might not be as good
Croatian	charset_table=non_cont	hr	-
Kurdish	charset_table=non_cont	ckb	-
Czech	charset_table=non_cont	cz	morphology=stem_cz (Czech stemmer)
Danish	charset_table=non_cont	da	morphology=libstemmer_da
Dutch	charset_table=non_cont	nl	morphology=libstemmer_nl
English	charset_table=non_cont	en	morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter's English stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_en (English from libstemmer)
Esperanto	charset_table=non_cont	eo	-
Estonian	charset_table=non_cont	et	-
Finnish	charset_table=non_cont	fi	morphology=libstemmer_fi
French	charset_table=non_cont	fr	morphology=libstemmer_fr
Galician	charset_table=non_cont	gl	-
Garo	specify charset_table manually	-	-
German	charset_table=non_cont	de	morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de
Greek	charset_table=non_cont	el	morphology=libstemmer_el
Hebrew	charset_table=non_cont	he	-
Hindi	charset_table=non_cont	hi	morphology=libstemmer_hi
Hmong	specify charset_table manually	-	-
Ho	specify charset_table manually	-	-
Hungarian	charset_table=non_cont	hu	morphology=libstemmer_hu
Indonesian	charset_table=non_cont	id	morphology=libstemmer_id
Irish	charset_table=non_cont	ga	morphology=libstemmer_ga
Italian	charset_table=non_cont	it	morphology=libstemmer_it
Japanese	ngram_chars=japanese	-	ngram_chars=japanese ngram_len=1	Requires ngram-based segmentation
Komi	specify charset_table manually	-	-
Korean	ngram_chars=korean	-	ngram_chars=korean ngram_len=1	Requires ngram-based segmentation
Large Flowery Miao	specify charset_table manually	-	-
Latin	charset_table=non_cont	la	-
Latvian	charset_table=non_cont	lv	-
Lithuanian	charset_table=non_cont	lt	morphology=libstemmer_lt
Maba	specify charset_table manually	-	-
Maithili	specify charset_table manually	-	-
Marathi	specify charset_table manually	-	-
Marathi	charset_table=non_cont	mr	-
Mende	specify charset_table manually	-	-
Mru	specify charset_table manually	-	-
Myene	specify charset_table manually	-	-
Nepali	specify charset_table manually	-	morphology=libstemmer_ne
Ngambay	specify charset_table manually	-	-
Norwegian	charset_table=non_cont	no	morphology=libstemmer_no
Odia	specify charset_table manually	-	-
Persian	charset_table=non_cont	fa	-
Polish	charset_table=non_cont	pl	-
Portuguese	charset_table=non_cont	pt	morphology=libstemmer_pt
Romanian	charset_table=non_cont	ro	morphology=libstemmer_ro
Russian	charset_table=non_cont	ru	morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter's Russian stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_ru (from libstemmer)
Santali	specify charset_table manually	-	-
Sindhi	specify charset_table manually	-	-
Slovak	charset_table=non_cont	sk	-
Slovenian	charset_table=non_cont	sl	-
Somali	charset_table=non_cont	so	-
Sotho	charset_table=non_cont	st	-
Spanish	charset_table=non_cont	es	morphology=libstemmer_es
Swahili	charset_table=non_cont	sw	-
Swedish	charset_table=non_cont	sv	morphology=libstemmer_sv
Sylheti	specify charset_table manually	-	-
Tamil	specify charset_table manually	-	morphology=libstemmer_ta
Thai	charset_table=thai	th	-
Turkish	charset_table=non_cont	tr	morphology=libstemmer_tr
Ukrainian	charset_table=non_cont,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491	-	morphology=lemmatize_uk_all	Requires installation of UK lemmatizer
Yoruba	charset_table=non_cont	yo	-
Zulu	charset_table=non_cont	zu	-

Languages with continuous scripts

Manticore provides built-in support for indexing languages with continuous scripts (i.e., languages that do not use spaces or other marks between words or sentences). This allows you to process texts in these languages in two different ways:

Precise segmentation using the ICU library. Currently, only Chinese is supported.

‹›

SQL
JSON
PHP
Python
Python-asyncio
Javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'cont',
            'morphology' => 'icu_chinese'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'icu_chinese\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'icu_chinese\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'icu_chinese\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'", true);

utils_api.sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'", Some(true)).await;

table products {
  charset_table = cont
  morphology = icu_chinese

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Precise segmentation using the Jieba library. Like ICU, it currently supports only Chinese.

‹›

SQL
JSON
PHP
Python
Python-asyncio
Javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'cont',
            'morphology' => 'jieba_chinese'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'jieba_chinese\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'jieba_chinese\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'jieba_chinese\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'", true);

utils_api.sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'", Some(true)).await;

table products {
  charset_table = cont
  morphology = jieba_chinese

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Basic support using the N-gram options ngram_len and ngram_chars For each language using a continuous script, there are separate character set tables (chinese, korean, japanese, thai) that can be used. Alternatively, you can use the common cont character set table to support all CJK and Thai languages at once, or the cjk charset to include all CJK languages only.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'

/* Or, alternatively */
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cjk,thai'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'"

/* Or, alternatively */
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cjk,thai'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'charset_table' => 'non_cont',
             'ngram_len' => '1',
             'ngram_chars' => 'cont'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cont\' ngram_len = \'1\' ngram_chars = \'cont\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cont\' ngram_len = \'1\' ngram_chars = \'cont\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cont\' ngram_len = \'1\' ngram_chars = \'cont\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'", true);

utils_api.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'", Some(true)).await;

table products {
  charset_table = non_cont
  ngram_len = 1
  ngram_chars = cont

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Additionally, there is built-in support for Chinese stopwords with the alias zh.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'chinese',
            'morphology' => 'icu_chinese',
            'stopwords' => 'zh'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'", true);

utils_api.sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'", Some(true)).await;

table products {
  charset_table = chinese
  morphology = icu_chinese
  stopwords = zh

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Supported languages Low-level tokenization

Supported languages

Chinese, Japanese and Korean (CJK) and Thai languages