Creating a table > NLP and tokenization > Supported languages

NLP and tokenization

Manticore doesn't store text as is for performing full-text searching on it. Instead, it extracts words and creates several structures that allow fast full-text searching. From the found words, a dictionary is built, which allows a quick look to discover if the word is present or not in the index. In addition, other structures record the documents and fields in which the word was found (as well as the position of it inside a field). All these are used when a full-text match is performed.

The process of demarcating and classifying words is called tokenization. The tokenization is applied at both indexing and searching, and it operates at the character and word level.

On the character level, the engine allows only certain characters to pass. This is defined by the charset_table. Anything else is replaced with a whitespace (which is considered the default word separator). The charset_table also allows mappings, such as lowercasing or simply replacing one character with another. Besides that, characters can be ignored, blended, defined as a phrase boundary.

At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the index. A common request is to match singular with plural forms of words. For this, morphology processors can be used.

Going further, we might want a word to be matched as another one because they are synonyms. For this, the word forms feature can be used, which allows one or more words to be mapped to another one.

Very common words can have some unwanted effects on searching, mostly because of their frequency they require lots of computing to process their doc/hit lists. They can be blacklisted with the stop words functionality. This helps not only in speeding up queries but also in decreasing the index size.

A more advanced blacklisting is bigrams, which allows creating a special token between a "bigram" (common) word and an uncommon word. This can speed up several times when common words are used in phrase searches.

In case of indexing HTML content, it's important not to index the HTML tags, as they can introduce a lot of "noise" in the index. HTML stripping can be used and can be configured to strip, but index certain tag attributes or completely ignore the content of certain HTML elements.

NLP and tokenization Supported languages

Last modified: March 01, 2023

Manticore supports a wide range of languages, with basic support enabled for most languages via charset_table = non_cjk (which is the default value).

For many languages, Manticore provides a stopwords file that can be used to improve search relevance.

Additionally, advanced morphology is available for a few languages that can significantly improve search relevance by using dictionary-based lemmatization or stemming algorithms for better segmentation and normalization.

The table below lists all supported languages and indicates how to enable:

basic support (column "Supported")
stopwords (column "Stopwords file name")
advanced morphology (column "Advanced morphology")

Language	Supported	Stopwords file name	Advanced morphology	Notes
Afrikaans	charset_table=non_cjk	af	-
Arabic	charset_table=non_cjk	ar	morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar
Armenian	charset_table=non_cjk	hy	-
Assamese	specify charset_table specify charset_table manually	-	-
Basque	charset_table=non_cjk	eu	-
Bengali	charset_table=non_cjk	bn	-
Bishnupriya	specify charset_table manually	-	-
Buhid	specify charset_table manually	-	-
Bulgarian	charset_table=non_cjk	bg	-
Catalan	charset_table=non_cjk	ca	morphology=libstemmer_ca
Chinese	charset_table=chinese or ngram_chars=chinese	zh	morphology=icu_chinese or ngram_chars=1 correspondingly	ICU dictionary based segmentation is much more accurate than ngram-based
Croatian	charset_table=non_cjk	hr	-
Kurdish	charset_table=non_cjk	ckb	-
Czech	charset_table=non_cjk	cz	morphology=stem_cz (Czech stemmer)
Danish	charset_table=non_cjk	da	morphology=libstemmer_da
Dutch	charset_table=non_cjk	nl	morphology=libstemmer_nl
English	charset_table=non_cjk	en	morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter's English stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_en (English from libstemmer)
Esperanto	charset_table=non_cjk	eo	-
Estonian	charset_table=non_cjk	et	-
Finnish	charset_table=non_cjk	fi	morphology=libstemmer_fi
French	charset_table=non_cjk	fr	morphology=libstemmer_fr
Galician	charset_table=non_cjk	gl	-
Garo	specify charset_table manually	-	-
German	charset_table=non_cjk	de	morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de
Greek	charset_table=non_cjk	el	morphology=libstemmer_el
Hebrew	charset_table=non_cjk	he	-
Hindi	charset_table=non_cjk	hi	morphology=libstemmer_hi
Hmong	specify charset_table manually	-	-
Ho	specify charset_table manually	-	-
Hungarian	charset_table=non_cjk	hu	morphology=libstemmer_hu
Indonesian	charset_table=non_cjk	id	morphology=libstemmer_id
Irish	charset_table=non_cjk	ga	morphology=libstemmer_ga
Italian	charset_table=non_cjk	it	morphology=libstemmer_it
Japanese	ngram_chars=japanese	-	ngram_chars=japanese ngram_len=1	Requires ngram-based segmentation
Komi	specify charset_table manually	-	-
Korean	ngram_chars=korean	-	ngram_chars=korean ngram_len=1	Requires ngram-based segmentation
Large Flowery Miao	specify charset_table manually	-	-
Latin	charset_table=non_cjk	la	-
Latvian	charset_table=non_cjk	lv	-
Lithuanian	charset_table=non_cjk	lt	morphology=libstemmer_lt
Maba	specify charset_table manually	-	-
Maithili	specify charset_table manually	-	-
Marathi	specify charset_table manually	-	-
Marathi	charset_table=non_cjk	mr	-
Mende	specify charset_table manually	-	-
Mru	specify charset_table manually	-	-
Myene	specify charset_table manually	-	-
Nepali	specify charset_table manually	-	morphology=libstemmer_ne
Ngambay	specify charset_table manually	-	-
Norwegian	charset_table=non_cjk	no	morphology=libstemmer_no
Odia	specify charset_table manually	-	-
Persian	charset_table=non_cjk	fa	-
Polish	charset_table=non_cjk	pl	-
Portuguese	charset_table=non_cjk	pt	morphology=libstemmer_pt
Romanian	charset_table=non_cjk	ro	morphology=libstemmer_ro
Russian	charset_table=non_cjk	ru	morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter's Russian stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_ru (from libstemmer)
Santali	specify charset_table manually	-	-
Sindhi	specify charset_table manually	-	-
Slovak	charset_table=non_cjk	sk	-
Slovenian	charset_table=non_cjk	sl	-
Somali	charset_table=non_cjk	so	-
Sotho	charset_table=non_cjk	st	-
Spanish	charset_table=non_cjk	es	morphology=libstemmer_es
Swahili	charset_table=non_cjk	sw	-
Swedish	charset_table=non_cjk	sv	morphology=libstemmer_sv
Sylheti	specify charset_table manually	-	-
Tamil	specify charset_table manually	-	morphology=libstemmer_ta
Thai	charset_table=non_cjk	th	-
Turkish	charset_table=non_cjk	tr	morphology=libstemmer_tr
Ukrainian	charset_table=non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491	-	morphology=lemmatize_uk_all	Requires installation of UK lemmatizer
Yoruba	charset_table=non_cjk	yo	-
Zulu	charset_table=non_cjk	zu	-

Data tokenization CJK

Last modified: April 14, 2023

Manticore provides built-in support for indexing CJK texts, allowing you to process CJK texts in two different ways:

Precise segmentation using the ICU library. Currently, only Chinese is supported.

‹›

SQL
JSON
PHP
Python
Javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'cjk',
            'morphology' => 'icu_chinese'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cjk\' morphology = \'icu_chinese\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cjk\' morphology = \'icu_chinese\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'");

table products {
  charset_table = cjk
  morphology = icu_chinese

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Basic support using the N-gram options ngram_len and ngram_chars For each CJK language, there are separate character set tables (chinese, korean, japanese) that can be used, or you can use the common cjk character set table.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'charset_table' => 'non_cjk',
             'ngram_len' => '1',
             'ngram_chars' => 'cjk'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");

table products {
  charset_table = non_cjk
  ngram_len = 1
  ngram_chars = cjk
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Additionally, there is built-in support for Chinese stopwords with the alias zh.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'chinese',
            'morphology' => 'icu_chinese',
            'stopwords' => 'zh'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'");

table products {
  charset_table = chinese
  morphology = icu_chinese
  stopwords = zh

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Supported languages Low-level tokenization

Last modified: May 15, 2023

NLP and tokenization

Data tokenization

Supported languages

Chinese, Japanese and Korean (CJK) languages