Creating a table > NLP and tokenization > CJK

Manticore doesn't store text as is for performing full-text searching on it. Instead, it extracts words and creates several structures that allow fast full-text searching. From the found words, a dictionary is built, which allows a quick look to discover if the word is present or not in the index. In addition, other structures record the documents and fields in which the word was found (as well as the position of it inside a field). All these are used when a full-text match is performed.

The process of demarcating and classifying words is called tokenization. The tokenization is applied at both indexing and searching, and it operates at the character and word level.

On the character level, the engine allows only certain characters to pass. This is defined by the charset_table. Anything else is replaced with a whitespace (which is considered the default word separator). The charset_table also allows mappings, such as lowercasing or simply replacing one character with another. Besides that, characters can be ignored, blended, defined as a phrase boundary.

At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the index. A common request is to match singular with plural forms of words. For this, morphology processors can be used.

Going further, we might want a word to be matched as another one because they are synonyms. For this, the word forms feature can be used, which allows one or more words to be mapped to another one.

Very common words can have some unwanted effects on searching, mostly because of their frequency they require lots of computing to process their doc/hit lists. They can be blacklisted with the stop words functionality. This helps not only in speeding up queries but also in decreasing the index size.

A more advanced blacklisting is bigrams, which allows creating a special token between a "bigram" (common) word and an uncommon word. This can speed up several times when common words are used in phrase searches.

In case of indexing HTML content, it's important not to index the HTML tags, as they can introduce a lot of "noise" in the index. HTML stripping can be used and can be configured to strip, but index certain tag attributes or completely ignore the content of certain HTML elements.

Supported languages

Manticore supports a wide range of languages, with basic support enabled for most languages via charset_table = non_cjk (which is the default value).

For many languages, Manticore provides a stopwords file that can be used to improve search relevance.

Additionally, advanced morphology is available for a few languages that can significantly improve search relevance by using dictionary-based lemmatization or stemming algorithms for better segmentation and normalization.

The table below lists all supported languages and indicates how to enable:

basic support (column "Supported")
stopwords (column "Stopwords file name")
advanced morphology (column "Advanced morphology")

Language	Supported	Stopwords file name	Advanced morphology	Notes
Afrikaans	charset_table=non_cjk	af	-
Arabic	charset_table=non_cjk	ar	morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar
Armenian	charset_table=non_cjk	hy	-
Assamese	specify charset_table specify charset_table manually	-	-
Basque	charset_table=non_cjk	eu	-
Bengali	charset_table=non_cjk	bn	-
Bishnupriya	specify charset_table manually	-	-
Buhid	specify charset_table manually	-	-
Bulgarian	charset_table=non_cjk	bg	-
Catalan	charset_table=non_cjk	ca	morphology=libstemmer_ca
Chinese	charset_table=chinese or ngram_chars=chinese	zh	morphology=icu_chinese or ngram_chars=1 correspondingly	ICU dictionary based segmentation is much more accurate than ngram-based
Croatian	charset_table=non_cjk	hr	-
Kurdish	charset_table=non_cjk	ckb	-
Czech	charset_table=non_cjk	cz	morphology=stem_cz (Czech stemmer)
Danish	charset_table=non_cjk	da	morphology=libstemmer_da
Dutch	charset_table=non_cjk	nl	morphology=libstemmer_nl
English	charset_table=non_cjk	en	morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter's English stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_en (English from libstemmer)
Esperanto	charset_table=non_cjk	eo	-
Estonian	charset_table=non_cjk	et	-
Finnish	charset_table=non_cjk	fi	morphology=libstemmer_fi
French	charset_table=non_cjk	fr	morphology=libstemmer_fr
Galician	charset_table=non_cjk	gl	-
Garo	specify charset_table manually	-	-
German	charset_table=non_cjk	de	morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de
Greek	charset_table=non_cjk	el	morphology=libstemmer_el
Hebrew	charset_table=non_cjk	he	-
Hindi	charset_table=non_cjk	hi	morphology=libstemmer_hi
Hmong	specify charset_table manually	-	-
Ho	specify charset_table manually	-	-
Hungarian	charset_table=non_cjk	hu	morphology=libstemmer_hu
Indonesian	charset_table=non_cjk	id	morphology=libstemmer_id
Irish	charset_table=non_cjk	ga	morphology=libstemmer_ga
Italian	charset_table=non_cjk	it	morphology=libstemmer_it
Japanese	ngram_chars=japanese	-	ngram_chars=japanese ngram_len=1	Requires ngram-based segmentation
Komi	specify charset_table manually	-	-
Korean	ngram_chars=korean	-	ngram_chars=korean ngram_len=1	Requires ngram-based segmentation
Large Flowery Miao	specify charset_table manually	-	-
Latin	charset_table=non_cjk	la	-
Latvian	charset_table=non_cjk	lv	-
Lithuanian	charset_table=non_cjk	lt	morphology=libstemmer_lt
Maba	specify charset_table manually	-	-
Maithili	specify charset_table manually	-	-
Marathi	specify charset_table manually	-	-
Marathi	charset_table=non_cjk	mr	-
Mende	specify charset_table manually	-	-
Mru	specify charset_table manually	-	-
Myene	specify charset_table manually	-	-
Nepali	specify charset_table manually	-	morphology=libstemmer_ne
Ngambay	specify charset_table manually	-	-
Norwegian	charset_table=non_cjk	no	morphology=libstemmer_no
Odia	specify charset_table manually	-	-
Persian	charset_table=non_cjk	fa	-
Polish	charset_table=non_cjk	pl	-
Portuguese	charset_table=non_cjk	pt	morphology=libstemmer_pt
Romanian	charset_table=non_cjk	ro	morphology=libstemmer_ro
Russian	charset_table=non_cjk	ru	morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter's Russian stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_ru (from libstemmer)
Santali	specify charset_table manually	-	-
Sindhi	specify charset_table manually	-	-
Slovak	charset_table=non_cjk	sk	-
Slovenian	charset_table=non_cjk	sl	-
Somali	charset_table=non_cjk	so	-
Sotho	charset_table=non_cjk	st	-
Spanish	charset_table=non_cjk	es	morphology=libstemmer_es
Swahili	charset_table=non_cjk	sw	-
Swedish	charset_table=non_cjk	sv	morphology=libstemmer_sv
Sylheti	specify charset_table manually	-	-
Tamil	specify charset_table manually	-	morphology=libstemmer_ta
Thai	charset_table=thai	th	-
Turkish	charset_table=non_cjk	tr	morphology=libstemmer_tr
Ukrainian	charset_table=non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491	-	morphology=lemmatize_uk_all	Requires installation of UK lemmatizer
Yoruba	charset_table=non_cjk	yo	-
Zulu	charset_table=non_cjk	zu	-

Data tokenization CJK

Manticore provides built-in support for indexing CJK texts, allowing you to process CJK texts in two different ways:

Precise segmentation using the ICU library. Currently, only Chinese is supported.

‹›

SQL
JSON
PHP
Python
Javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'cjk',
            'morphology' => 'icu_chinese'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cjk\' morphology = \'icu_chinese\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cjk\' morphology = \'icu_chinese\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'");

table products {
  charset_table = cjk
  morphology = icu_chinese

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Basic support using the N-gram options ngram_len and ngram_chars For each CJK language, there are separate character set tables (chinese, korean, japanese) that can be used, or you can use the common cjk character set table.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'charset_table' => 'non_cjk',
             'ngram_len' => '1',
             'ngram_chars' => 'cjk'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");

table products {
  charset_table = non_cjk
  ngram_len = 1
  ngram_chars = cjk

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Additionally, there is built-in support for Chinese stopwords with the alias zh.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'chinese',
            'morphology' => 'icu_chinese',
            'stopwords' => 'zh'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'");

table products {
  charset_table = chinese
  morphology = icu_chinese
  stopwords = zh

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Supported languages Low-level tokenization

When text is indexed in Manticore, it is split into words and case folding is done so that words like "Abc", "ABC", and "abc" are treated as the same word.

To perform these operations correctly, Manticore must know:

the encoding of the source text (which should always be UTF-8)
which characters are considered letters and which are not
which letters should be folded to other letters

You can configure these settings on a per-table basis using the charset_table option. charset_table specifies an array that maps letter characters to their case-folded versions (or any other characters that you prefer). Characters that are not present in the array are considered to be non-letters and will be treated as word separators during indexing or searching in this table.

The default character set is non_cjk, which includes most languages.

You can also define text pattern replacement rules. For example, with the following rules:

regexp_filter = \**(\d+)\" => \1 inch
regexp_filter = (BLUE|RED) => COLOR

The text RED TUBE 5" LONG would be indexed as COLOR TUBE 5 INCH LONG, and PLANK 2" x 4" would be indexed as PLANK 2 INCH x 4 INCH. These rules are applied in the specified order. The rules also apply to queries, so a search for BLUE TUBE would actually search for COLOR TUBE.

You can learn more about regexp_filter here.

# default
charset_table = non_cjk

# only English and Russian letters
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451

# english charset defined with alias
charset_table = 0..9, english, _

# you can override character mappings by redefining them, e.g. for case insensitive search with German umlauts you can use:
charset_table = non_cjk, U+00E4, U+00C4->U+00E4, U+00F6, U+00D6->U+00F6, U+00FC, U+00DC->U+00FC, U+00DF, U+1E9E->U+00DF

charset_table specifies an array that maps letter characters to their case folded versions (or any other characters if you like). The default character set is non_cjk which includes most non-CJK languages.

charset_table is a workhorse of Manticore's tokenization process, which extracts keywords from document text or query text. It controls what characters are accepted as valid and how they should be transformed (e.g. whether case should be removed or not).

By default, every character maps to 0, which means that it is not considered a valid keyword and is treated as a separator. Once a character is mentioned in the table, it is mapped to another character (most frequently, either to itself or to a lowercase letter) and is treated as a valid keyword part.

charset_table uses a comma-separated list of mappings to declare characters as valid or to map them to other characters. Syntax shortcuts are available for mapping ranges of characters at once:

Single char mapping: A->a. Declares the source character 'A' as allowed within keywords and maps it to the destination character 'a' (but does not declare 'a' as allowed).
Range mapping: A..Z->a..z. Declares all characters in the source range as allowed and maps them to the destination range. Does not declare the destination range as allowed. Checks the lengths of both ranges.
Stray char mapping: a. Declares a character as allowed and maps it to itself. Equivalent to a->a single char mapping.
Stray range mapping: a..z. Declares all characters in the range as allowed and maps them to themselves. Equivalent to a..z->a..z range mapping.
Checkerboard range mapping: A..Z/2. Maps every pair of characters to the second character. For instance, A..Z/2 is equivalent to A->B, B->B, C->D, D->D, ..., Y->Z, Z->Z. This mapping shortcut is helpful for Unicode blocks where uppercase and lowercase letters go in an interleaved order.

For characters with codes from 0 to 32, and those in the range of 127 to 8-bit ASCII and Unicode characters, Manticore always treats them as separators. To avoid configuration file encoding issues, 8-bit ASCII characters and Unicode characters must be specified in U+XXX form, where XXX is a hexadecimal code point number. The minimal accepted Unicode character code is U+0021.

If the default mappings are insufficient for your needs, you can redefine the character mappings by specifying them again with another mapping. For example, if the built-in non_cjk array includes characters Ä and ä and maps them both to the ASCII character a, you can redefine those characters by adding the Unicode code points for them, like this:

charset_table = non_cjk,U+00E4,U+00C4

for case sensitive search or

charset_table = non_cjk,U+00E4,U+00C4->U+00E4

for case insensitive search.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'");

table products {
  charset_table = 0..9, A..Z->a..z, _, a..z, \
    U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Besides definitions of characters and mappings, there are several built-in aliases that can be used. Current aliases are:

english
russian
non_cjk
cjk

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => '0..9, english, _'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, english, _\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, english, _\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'");

table products {
  charset_table = 0..9, english, _

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

If you want to support different languages in your search, it can be a laborious task to define sets of valid characters and folding rules for all of them. We have simplified this for you by providing default charset tables, non_cjk and cjk, that cover non-CJK and CJK (Chinese, Japanese, Korean) languages respectively. In most cases, these charsets should be sufficient for your needs.

Please note that the following languages are currently not supported:

Assamese
Bishnupriya
Buhid
Garo
Hmong
Ho
Komi
Large Flowery Miao
Maba
Maithili
Marathi
Mende
Mru
Myene
Ngambay
Odia
Santali
Sindhi
Sylheti

All other languages listed in the Unicode languages list are supported by default.

To work with both cjk and non-cjk languages, set the options in your configuration file as shown below (with an exception for Chinese):

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'charset_table' => 'non_cjk',
             'ngram_len' => '1',
             'ngram_chars' => 'cjk'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");

table products {
  charset_table       = non_cjk
  ngram_len           = 1
  ngram_chars         = cjk

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

If you do not require support for cjk-languages, you can simply exclude the ngram_len and ngram_chars options. For more information on these options, refer to the corresponding documentation sections.

To map one character to multiple characters or vice versa, you can use regexp_filter can be helpful.

blend_chars = +, &, U+23
blend_chars = +, &->+

Blended characters list. Optional, default is empty.

Blended characters are indexed as both separators and valid characters. For example, when & is defined as a blended character and AT&T appears in an indexed document, three different keywords will be indexed, at&t, at and t.

Additionally, blended characters can influence indexing in such a way that keywords are indexed as if the blended characters were not typed at all. This behavior is particularly evident when blend_mode = trim_all is specified. For example, the phrase some_thing will be indexed as some, something, and thing with blend_mode = trim_all.

Care should be taken when using blended characters as defining a character as blended means that it is no longer a separator.

Therefore, if you put a comma to the blend_chars and search for dog,cat, it will treat that as a single token dog,cat. If dog,cat was not indexed as dog,cat, but left as dog cat only, then it will not match.
Hence, this behavior should be controlled with the blend_mode setting.

Positions for tokens obtained by replacing blended characters with whitespace are assigned as usual, and regular keywords will be indexed as if there were no blend_chars specified at all. An additional token that mixes blended and non-blended characters will be put at the starting position. For instance, if AT&T company occurs in the very beginning of the text field, at will be given position 1, t position 2, company position 3, and AT&T will also be given position 1, blending with the opening regular keyword. As a result, queries for AT&T or just AT will match that document. A phrase query for "AT T" will also match, as well as a phrase query for "AT&T company".

Blended characters can overlap with special characters used in query syntax, such as T-Mobile or @twitter. Where possible, the query parser will handle the blended character as blended. For instance, if hello @twitter is within quotes (a phrase operator), the query parser will handle the @ symbol as blended. However, if the @ symbol was not within quotes, the character would be handled as an operator. Therefore, it is recommended to escape keywords.

Blended characters can be remapped so that multiple different blended characters can be normalized into one base form. This is useful when indexing multiple alternative Unicode codepoints with equivalent glyphs.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'

POST /cli -d "
CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'blend_chars' => '+, &, U+23, @->_'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) blend_chars = \'+, &, U+23, @->_\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) blend_chars = \'+, &, U+23, @->_\'');

utilsApi.sql("CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'");

utilsApi.Sql("CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'");

table products {
  blend_chars = +, &, U+23, @->_

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

blend_mode = option [, option [, ...]]
option = trim_none | trim_head | trim_tail | trim_both | trim_all | skip_pure

The blended tokens indexing mode is enabled by the blend_mode directive.

By default, tokens that mix blended and non-blended characters get indexed entirely. For example, when both an at-sign and an exclamation are in blend_chars, the string @dude! will be indexed as two tokens: @dude! (with all the blended characters) and dude (without any). As a result, a query of @dude will not match it.

blend_mode adds flexibility to this indexing behavior. It takes a comma-separated list of options, each of which specifies a token indexing variant.

If multiple options are specified, multiple variants of the same token will be indexed. Regular keywords (resulting from that token by replacing blended characters with a separator) are always indexed.

The options are:

trim_none - Index the entire token
trim_head - Trim heading blended characters, and index the resulting token
trim_tail - Trim trailing blended characters, and index the resulting token
trim_both- Trim both heading and trailing blended characters, and index the resulting token
trim_all - Trim heading, trailing, and middle blended characters, and index the resulting token
skip_pure - Do not index the token if it is purely blended, that is, consists of blended characters only

Using blend_mode with the example @dude! string above, the setting blend_mode = trim_head, trim_tail would result in two indexed tokens: @dude and dude!. Using trim_both would have no effect because trimming both blended characters results in dude, which is already indexed as a regular keyword. Indexing @U.S.A. with trim_both (and assuming that dot is blended two) would result in U.S.A being indexed. Lastly, skip_pure enables you to ignore sequences of blended characters only. For example, one @@@ two would be indexed as one two, and match that as a phrase. This is not the case by default because a fully blended token gets indexed and offsets the second keyword position.

Default behavior is to index the entire token, equivalent to blend_mode = trim_none.

Be aware that using blend modes limits your search, even with the default mode trim_none if you assume . is a blended character:

.dog. will become .dog. dog during indexing
and you won't be able to find it by dog..

Using more modes increases the chance your keyword will match something.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'

POST /cli -d "
CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'blend_mode' => 'trim_tail, skip_pure',
            'blend_chars' => '+, &'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) blend_mode = \'trim_tail, skip_pure\' blend_chars = \'+, &\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) blend_mode = \'trim_tail, skip_pure\' blend_chars = \'+, &\'');

utilsApi.sql("CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'");

utilsApi.Sql("CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'");

table products {
  blend_mode = trim_tail, skip_pure
  blend_chars = +, &

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

min_word_len = length

min_word_len is an optional index configuration option in Manticore that specifies the minimum indexed word length. The default value is 1, which means that everything is indexed.

Only those words that are not shorter than this minimum will be indexed. For example, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) min_word_len = '4'

POST /cli -d "
CREATE TABLE products(title text, price float) min_word_len = '4'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'min_word_len' => '4'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) min_word_len = \'4\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_word_len = \'4\'');

utilsApi.sql("CREATE TABLE products(title text, price float) min_word_len = '4'");

utilsApi.Sql("CREATE TABLE products(title text, price float) min_word_len = '4'");

table products {
  min_word_len = 4

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

ngram_len = 1

N-gram lengths for N-gram indexing. Optional, default is 0 (disable n-gram indexing). Known values are 0 and 1.

N-grams provide basic CJK (Chinese, Japanese, Korean) support for unsegmented texts. The issue with CJK searching is that there may be no clear separators between the words. In some cases, you may not want to use dictionary-based segmentation the one available for Chinese. In those cases, n-gram segmentation might work well too.

When this feature is enabled, streams of CJK (or any other characters defined in ngram_chars) are indexed as N-grams. For example, if the incoming text is "ABCDEF" (where A to F represent some CJK characters) and ngram_len is 1, it will be indexed as if it were "A B C D E F". Only ngram_len=1 is currently supported. Only those characters that are listed in ngram_chars table will be split this way; others will not be affected.

Note that if the search query is segmented, i.e. there are separators between individual words, then wrapping the words in quotes and using extended mode will result in proper matches being found even if the text was not segmented. For instance, assume that the original query is BC DEF. After wrapping in quotes on the application side, it should look like "BC" "DEF" (with quotes). This query will be passed to Manticore and internally split into 1-grams too, resulting in "B C" "D E F" query, still with quotes that are the phrase matching operator. And it will match the text even though there were no separators in the text.

Even if the search query is not segmented, Manticore should still produce good results, thanks to phrase-based ranking: it will pull closer phrase matches (which in the case of N-gram CJK words can mean closer multi-character word matches) to the top.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'ngram_chars' => 'cjk',
             'ngram_len' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");

table products {
  ngram_chars = cjk
  ngram_len = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

ngram_chars = cjk

ngram_chars = cjk, U+3000..U+2FA1F

N-gram characters list. Optional, default is empty.

To be used in conjunction with in ngram_len, this list defines characters, sequences of which are subject to N-gram extraction. Words comprised of other characters will not be affected by N-gram indexing feature. The value format is identical to charset_table. N-gram characters cannot appear in the charset_table.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'ngram_chars' => 'U+3000..U+2FA1F',
             'ngram_len' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'U+3000..U+2FA1F\' ngram_len = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'U+3000..U+2FA1F\' ngram_len = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'");

table products {
  ngram_chars = U+3000..U+2FA1F
  ngram_len = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Also you can use an alias for our default N-gram table as in the example. It should be sufficient in most cases.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'ngram_chars' => 'cjk',
             'ngram_len' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");

table products {
  ngram_chars = cjk
  ngram_len = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

ignore_chars = U+AD

Ignored characters list. Optional, default is empty.

Useful in cases when some characters, such as soft hyphenation mark (U+00AD), should be not just treated as separators but rather fully ignored. For example, if '-' is simply not in the charset_table, "abc-def" text will be indexed as "abc" and "def" keywords. On the contrary, if '-' is added to ignore_chars list, the same text will be indexed as a single "abcdef" keyword.

The syntax is the same as for charset_table, but it's only allowed to declare characters, and not allowed to map them. Also, the ignored characters must not be present in charset_table.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'

POST /cli -d "
CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'ignore_chars' => 'U+AD'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) ignore_chars = \'U+AD\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) ignore_chars = \'U+AD\'');

utilsApi.sql("CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'");

utilsApi.Sql("CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'");

table products {
  ignore_chars = U+AD

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

bigram_index = {none|all|first_freq|both_freq}

Bigram indexing mode. Optional, default is none.

Bigram indexing is a feature to accelerate phrase searches. When indexing, it stores a document list for either all or some of the adjacent words pairs into the index. Such a list can then be used at searching time to significantly accelerate phrase or sub-phrase matching.

bigram_index controls the selection of specific word pairs. The known modes are:

all, index every single word pair
first_freq, only index word pairs where the first word is in a list of frequent words (see bigram_freq_words). For example, with bigram_freq_words = the, in, i, a, indexing "alone in the dark" text will result in "in the" and "the dark" pairs being stored as bigrams, because they begin with a frequent keyword (either "in" or "the" respectively), but "alone in" would not be indexed, because "in" is a second word in that pair.
both_freq, only index word pairs where both words are frequent. Continuing with the same example, in this mode indexing "alone in the dark" would only store "in the" (the very worst of them all from searching perspective) as a bigram, but none of the other word pairs.

For most use cases, both_freq would be the best mode, but your mileage may vary.

It's important to note that bigram_index works only at the tokenization level and doesn't account for transformations like morphology, wordforms or stopwords. This means the tokens it creates are very straightforward, which makes searching phrases more exact and strict. While this can improve the accuracy of phrase matching, it also makes the system less able to recognize different forms of words or variations in how words appear.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'

POST /cli -d "
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'bigram_freq_words' => 'the, a, you, i',
            'bigram_index' => 'both_freq'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'both_freq\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'both_freq\'');

utilsApi.sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'");

utilsApi.Sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'");

table products {
  bigram_index = both_freq
  bigram_freq_words = the, a, you, i

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

bigram_freq_words = the, a, you, i

A list of keywords considered "frequent" when indexing bigrams. Optional, default is empty.

Some of the bigram indexing modes (see bigram_index) require to define a list of frequent keywords. These are not to be confused with stop words. Stop words are completely eliminated when both indexing and searching. Frequent keywords are only used by bigrams to determine whether to index a current word pair or not.

bigram_freq_words lets you define a list of such keywords.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'

POST /cli -d "
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'bigram_freq_words' => 'the, a, you, i',
            'bigram_index' => 'first_freq'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'first_freq\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'first_freq\'');

utilsApi.sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'");

utilsApi.Sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'");

table products {
  bigram_freq_words = the, a, you, i
  bigram_index = first_freq

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

dict = {keywords|crc}

The type of keywords dictionary used is identified by one of two known values, 'crc' or 'keywords'. This is optional, with 'keywords' as the default.

Using the keywords dictionary mode (dict=keywords) can significantly decrease the indexing burden and enable substring searches on extensive collections. This mode can be utilized for both plain and RT tables.

CRC dictionaries do not store the original keyword text in the index. Instead, they replace keywords with a control sum value (computed using FNV64) during both searching and indexing processes. This value is used internally within the index. This approach has two disadvantages:

Firstly, there's a risk of control sum collisions between different keywords pairs. This risk grows in proportion to the number of unique keywords in the index. Nonetheless, this concern is minor as the probability of a single FNV64 collision in a dictionary of 1 billion entries is roughly 1 in 16, or 6.25 percent. Most dictionaries will have far fewer than a billion keywords given that a typical spoken human language has between 1 and 10 million word forms.
Secondly, and more crucially, it's not straightforward to perform substring searches with control sums. Manticore addressed this issue by pre-indexing all possible substrings as separate keywords (see min_prefix_len, min_infix_len directives). This method even has an added advantage of matching substrings in the fastest way possible. Yet, pre-indexing all substrings significantly increases the index size (often by factors of 3-10x or more) and subsequently affects the indexing time, making substring searches on large indexes rather impractical.

The keywords dictionary resolves both of these issues. It stores keywords in the index and performs search-time wildcard expansion. For instance, a search for a test* prefix could internally expand to a 'test|tests|testing' query based on the dictionary's contents. This expansion process is entirely invisible to the application, with the exception that the separate per-keyword statistics for all the matched keywords are now also reported.

For substring (infix) searches, extended wildcards can be used. Special characters such as ? and % are compatible with substring (infix) search (e.g., t?st*, run%, *abc*). Note that the wildcards operators and the REGEX only function with dict=keywords.

Indexing with a keywords dictionary is approximately 1.1x to 1.3x slower than regular, non-substring indexing - yet significantly faster than substring indexing (either prefix or infix). The index size should only be slightly larger than that of the standard non-substring table, with a total difference of 1..10% percent. The time it takes for regular keyword searching should be nearly the same or identical across all three index types discussed (CRC non-substring, CRC substring, keywords). Substring searching time can significantly fluctuate based on how many actual keywords match the given substring (i.e., how many keywords the search term expands into). The maximum number of matched keywords is limited by the expansion_limit directive.

In summary, keywords and CRC dictionaries offer two different trade-off decisions for substring searching. You can opt to either sacrifice indexing time and index size to achieve the fastest worst-case searches (CRC dictionary), or minimally impact indexing time but sacrifice worst-case searching time when the prefix expands into a high number of keywords (keywords dictionary).

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) dict = 'keywords'

POST /cli -d "
CREATE TABLE products(title text, price float) dict = 'keywords'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'dict' => 'keywords'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) dict = \'keywords\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) dict = \'keywords\'');

utilsApi.sql("CREATE TABLE products(title text, price float) dict = 'keywords'");

utilsApi.Sql("CREATE TABLE products(title text, price float) dict = 'keywords'");

table products {
  dict = keywords

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

embedded_limit = size

Embedded exceptions, wordforms, or stop words file size limit. Optional, default is 16K.

When you create a table the above mentioned files can be either saved externally along with the table or embedded directly into the table. Files sized under embedded_limit get stored into the table. For bigger files, only the file names are stored. This also simplifies moving table files to a different machine; you may get by just copying a single file.

With smaller files, such embedding reduces the number of the external files on which the table depends, and helps maintenance. But at the same time it makes no sense to embed a 100 MB wordforms dictionary into a tiny delta table. So there needs to be a size threshold, and embedded_limit is that threshold.

‹›

CONFIG

CONFIG

📋

table products {
  embedded_limit = 32K

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

global_idf = /path/to/global.idf

The path to a file with global (cluster-wide) keyword IDFs. Optional, default is empty (use local IDFs).

On a multi-table cluster, per-keyword frequencies are quite likely to differ across different tables. That means that when the ranking function uses TF-IDF based values, such as BM25 family of factors, the results might be ranked slightly differently depending on what cluster node they reside.

The easiest way to fix that issue is to create and utilize a global frequency dictionary, or a global IDF file for short. This directive lets you specify the location of that file. It is suggested (but not required) to use an .idf extension. When the IDF file is specified for a given table and OPTION global_idf is set to 1, the engine will use the keyword frequencies and collection documents counts from the global_idf file, rather than just the local table. That way, IDFs and the values that depend on them will stay consistent across the cluster.

IDF files can be shared across multiple tables. Only a single copy of an IDF file will be loaded by searchd, even when many tables refer to that file. Should the contents of an IDF file change, the new contents can be loaded with a SIGHUP.

You can build an .idf file using indextool utility, by dumping dictionaries using --dumpdict dict.txt --stats switch first, then converting those to .idf format using --buildidf, then merging all the .idf files across cluster using --mergeidf.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'

POST /cli -d "
CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'global_idf' => '/usr/local/manticore/var/global.idf'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) global_idf = \'/usr/local/manticore/var/global.idf\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) global_idf = \'/usr/local/manticore/var/global.idf\'');

utilsApi.sql("CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'");

utilsApi.Sql("CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'");

table products {
  global_idf = /usr/local/manticore/var/global.idf

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

hitless_words = {all|path/to/file}

Hitless words list. Optional, allowed values are 'all', or a list file name.

By default, Manticore full-text index stores not only a list of matching documents for every given keyword, but also a list of its in-document positions (known as hitlist). Hitlists enables phrase, proximity, strict order and other advanced types of searching, as well as phrase proximity ranking. However, hitlists for specific frequent keywords (that can not be stopped for some reason despite being frequent) can get huge and thus slow to process while querying. Also, in some cases we might only care about boolean keyword matching, and never need position-based searching operators (such as phrase matching) nor phrase ranking.

hitless_words lets you create indexes that either do not have positional information (hitlists) at all, or skip it for specific keywords.

Hitless index will generally use less space than the respective regular full-text index (about 1.5x can be expected). Both indexing and searching should be faster, at a cost of missing positional query and ranking support.

If used in positional queries (e.g. phrase queries) the hitless words are taken out from them and used as operand without a position. For example if "hello" and "world" are hitless and "simon" and "says" are not hitless, the phrase query "simon says hello world" will be converted to ("simon says" & hello & world), matching "hello" and "world" anywhere in the document and "simon says" as an exact phrase.

A positional query than contains only hitless words will result in an empty phrase node, therefore the entire query will return an empty result and a warning. If the whole dictionary is hitless (using all) only boolean matching can be used on the respective index.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) hitless_words = 'all'

POST /cli -d "
CREATE TABLE products(title text, price float) hitless_words = 'all'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'hitless_words' => 'all'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) hitless_words = \'all\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) hitless_words = \'all\'');

utilsApi.sql("CREATE TABLE products(title text, price float) hitless_words = 'all'");

utilsApi.Sql("CREATE TABLE products(title text, price float) hitless_words = 'all'");

table products {
  hitless_words = all

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_field_lengths = {0|1}

Enables computing and storing of field lengths (both per-document and average per-index values) into the full-text index. Optional, default is 0 (do not compute and store).

When index_field_lengths is set to 1 Manticore will:

create a respective length attribute for every full-text field, sharing the same name but with __len suffix
compute a field length (counted in keywords) for every document and store in to a respective attribute
compute the per-index averages. The lengths attributes will have a special TOKENCOUNT type, but their values are in fact regular 32-bit integers, and their values are generally accessible.

BM25A() and BM25F() functions in the expression ranker are based on these lengths and require index_field_lengths to be enabled. Historically, Manticore used a simplified, stripped-down variant of BM25 that, unlike the complete function, did not account for document length. There's also support for both a complete variant of BM25, and its extension towards multiple fields, called BM25F. They require per-document length and per-field lengths, respectively. Hence the additional directive.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) index_field_lengths = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) index_field_lengths = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_field_lengths' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) index_field_lengths = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_field_lengths = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) index_field_lengths = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) index_field_lengths = '1'");

table products {
  index_field_lengths = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_token_filter = my_lib.so:custom_blend:chars=@#&

Index-time token filter for full-text indexing. Optional, default is empty.

The index_token_filter directive specifies an optional index-time token filter for full-text indexing. This directive is used to create a custom tokenizer that makes tokens according to custom rules. The filter is created by the indexer on indexing source data into a plain table or by an RT table on processing INSERT or REPLACE statements. The plugins are defined using the format library name:plugin name:optional string of settings. For example, my_lib.so:custom_blend:chars=@#&.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'

POST /cli -d "
CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_token_filter' => 'my_lib.so:custom_blend:chars=@#&'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) index_token_filter = \'my_lib.so:custom_blend:chars=@#&\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_token_filter = \'my_lib.so:custom_blend:chars=@#&\'');

utilsApi.sql("CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'");

utilsApi.Sql("CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'");

table products {
  index_token_filter = my_lib.so:custom_blend:chars=@#&

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

overshort_step = {0|1}

Position increment on overshort (less than min_word_len) keywords. Optional, allowed values are 0 and 1, default is 1.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) overshort_step = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) overshort_step = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'overshort_step' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) overshort_step = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) overshort_step = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) overshort_step = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) overshort_step = '1'");

table products {
  overshort_step = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis

Phrase boundary characters list. Optional, default is empty.

This list controls what characters will be treated as phrase boundaries, in order to adjust word positions and enable phrase-level search emulation through proximity search. The syntax is similar to charset_table, but mappings are not allowed and the boundary characters must not overlap with anything else.

On phrase boundary, additional word position increment (specified by phrase_boundary_step) will be added to current word position. This enables phrase-level searching through proximity queries: words in different phrases will be guaranteed to be more than phrase_boundary_step distance away from each other; so proximity search within that distance will be equivalent to phrase-level search.

Phrase boundary condition will be raised if and only if such character is followed by a separator; this is to avoid abbreviations such as S.T.A.L.K.E.R or URLs being treated as several phrases.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'

POST /cli -d "
CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'phrase_boundary' => '., ?, !, U+2026',
             'phrase_boundary_step' => '10'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary = \'., ?, !, U+2026\' phrase_boundary_step = \'10\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary = \'., ?, !, U+2026\' phrase_boundary_step = \'10\'');

utilsApi.sql("CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'");

utilsApi.Sql("CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'");

table products {
  phrase_boundary = ., ?, !, U+2026
  phrase_boundary_step = 10

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

phrase_boundary_step = 100

Phrase boundary word position increment. Optional, default is 0.

On phrase boundary, current word position will be additionally incremented by this number.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'

POST /cli -d "
CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'phrase_boundary_step' => '100',
             'phrase_boundary' => '., ?, !, U+2026'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary_step = \'100\' phrase_boundary = \'., ?, !, U+2026\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary_step = \'100\' phrase_boundary = \'., ?, !, U+2026\'');

utilsApi.sql("CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'");

utilsApi.Sql("CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'");

table products {
  phrase_boundary_step = 100
  phrase_boundary = ., ?, !, U+2026

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

# index '13"' as '13inch'
regexp_filter = \b(\d+)\" => \1inch

# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color

Regular expressions (regexps) used to filter the fields and queries. This directive is optional, multi-valued, and its default is an empty list of regular expressions. The regular expressions engine used by Manticore Search is Google's RE2, which is known for its speed and safety. For detailed information on the syntax supported by RE2, you can visit the RE2 syntax guide.

In certain applications such as product search, there can be many ways to refer to a product, model, or property. For example, iPhone 3gs and iPhone 3 gs (or even iPhone3 gs) are very likely to refer to the same product. Another example could be different ways to express a laptop screen size, such as 13-inch, 13 inch, 13", or 13in.

Regexps provide a mechanism to specify rules tailored to handle such cases. In the first example, you could possibly use a wordforms file to handle a handful of iPhone models, but in the second example, it's better to specify rules that would normalize "13-inch" and "13in" to something identical.

Regular expressions listed in regexp_filter are applied in the order they are listed, at the earliest stage possible, before any other processing (including exceptions), even before tokenization. That is, regexps are applied to the raw source fields when indexing, and to the raw search query text when searching.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'

POST /cli -d "
CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'regexp_filter' => '(blue|red) => color'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) regexp_filter = \'(blue|red) => color\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) regexp_filter = \'(blue|red) => color\'');

utilsApi.sql("CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'");

utilsApi.Sql("CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'");

table products {
  # index '13"' as '13inch'
  regexp_filter = \b(\d+)\" => \1inch

  # index 'blue' or 'red' as 'color'
  regexp_filter = (blue|red) => color

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

CJK Wildcard searching settings

Data tokenization

Supported languages

Chinese, Japanese and Korean (CJK) languages

Low-level tokenization

Index configuration options

charset_table

blend_chars

blend_mode

min_word_len

ngram_len

ngram_chars

ignore_chars

bigram_index

bigram_freq_words

dict

embedded_limit

global_idf

hitless_words

index_field_lengths

index_token_filter

overshort_step

phrase_boundary

phrase_boundary_step

regexp_filter