Creating a table > NLP and tokenization > Morphology

Word forms are applied after tokenizing incoming text by charset_table rules. They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (e.g. to normalize all the variants such as "walks", "walked", "walking" to the normal form "walk"). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.

wordforms = path/to/wordforms.txt
wordforms = path/to/alternateforms.txt
wordforms = path/to/dict*.txt

Word forms dictionary. Optional, default is empty.

The word forms dictionaries are used to normalize incoming words both during indexing and searching. Therefore, when it comes to a plain table, it's required to rotate the table in order to pick up changes in the word forms file.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt /var/lib/manticore/dict*.txt'

POST /cli -d "
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'wordforms' => [
                '/var/lib/manticore/wordforms.txt',
                '/var/lib/manticore/alternateforms.txt',
                '/var/lib/manticore/dict*.txt'
            ]
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float)wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'');

utilsApi.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", true);

utils_api.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", Some(true)).await;

table products {
  wordforms = /var/lib/manticore/wordforms.txt
  wordforms = /var/lib/manticore/alternateforms.txt
  wordforms = /var/lib/manticore/dict*.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Word forms support in Manticore is designed to handle large dictionaries well. They moderately affect indexing speed; for example, a dictionary with 1 million entries slows down full-text indexing by about 1.5 times. Searching speed is not affected at all. The additional RAM impact is roughly equal to the dictionary file size, and dictionaries are shared across tables. For instance, if the very same 50 MB word forms file is specified for 10 different tables, the additional searchd RAM usage will be about 50 MB.

The dictionary file should be in a simple plain text format. Each line should contain source and destination word forms in UTF-8 encoding, separated by a 'greater than' sign. The rules from the charset_table will be applied when the file is loaded. Therefore, if you do not modify charset_table, your word forms will be case-insensitive, similar to your other full-text indexed data. Below is a sample of the file contents:

‹›

Example

Example

📋

walks > walk
walked > walk
walking > walk

There is a bundled utility called Spelldump that helps you create a dictionary file in a format that Manticore can read. The utility can read from source .dict and .aff dictionary files in the ispell or MySpell format, as bundled with OpenOffice.

You can map several source words to a single destination word. The process happens on tokens, not the source text, so differences in whitespace and markup are ignored.

You can use the => symbol instead of >. Comments (starting with #) are also allowed. Finally, if a line starts with a tilde (~), the wordform will be applied after morphology, instead of before (note that only a single source and destination word are supported in this case).

‹›

Example

Example

📋

core 2 duo > c2d
e6600 > c2d
core 2duo => c2d # Some people write '2duo' together...
~run > walk # Along with stem_en morphology enabled replaces 'run', 'running', 'runs' (and any other words that stem to just 'run') to 'walk'

If you need to use >, = or ~ as normal characters, you can escape them by preceding each with a backslash (\). Both > and = should be escaped in this manner. Here's an example:

‹›

Example

Example

📋

a\> > abc
\>b > bcd
c\=\> => cde
\=\>d => def
\=\>a \> f \> => foo
\~g => bar

You can specify multiple destination tokens:

‹›

Example

Example

📋

s02e02 > season 2 episode 2
s3 e3 > season 3 episode 3

You can specify multiple files, not just one. Masks can be used as a pattern, and all matching files will be processed in simple ascending order:

In the RT mode, only absolute paths are allowed.

If multi-byte codepages are used and file names include non-latin characters, the resulting order may not be exactly alphabetic. If the same wordform definition is found in multiple files, the latter one is used and overrides previous definitions.

‹›

SQL
Config

📋

create table tbl1 ... wordforms='/tmp/wf*'
create table tbl2 ... wordforms='/tmp/wf, /tmp/wf2'

Exceptions

Exceptions (also known as synonyms) allow mapping one or more tokens (including tokens with characters that would normally be excluded) to a single keyword. They are similar to wordforms in that they also perform mapping but have a number of important differences.

A short summary of the differences from wordforms is as follows:

Exceptions	Word forms
Case sensitive	Case insensitive
Can use special characters that are not in charset_table	Fully obey charset_table
Underperform on huge dictionaries	Designed to handle millions of entries

exceptions = path/to/exceptions.txt

Tokenizing exceptions file. Optional, the default is empty. In the RT mode, only absolute paths are allowed.

The expected file format is plain text, with one line per exception. The line format is as follows:

map-from-tokens => map-to-token

Example file:

at & t => at&t
AT&T => AT&T
Standarten   Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus
\=\>abc\> => abc

All tokens here are case sensitive and will not be processed by charset_table rules. Thus, with the example exceptions file above, the at&t text will be tokenized as two keywords at and t due to lowercase letters. On the other hand, AT&T will match exactly and produce a single AT&T keyword.

If you need to use > or = as normal characters, you can escape them by preceding each with a backslash (\). Both > and = should be escaped in this manner.

Note that the map-to keywords:

are always interpreted as a single word
are both case and space sensitive

In the above sample, ms windows query will not match the document with MS Windows text. The query will be interpreted as a query for two keywords, ms and windows. The mapping for MS Windows is a single keyword ms windows, with a space in the middle. On the other hand, standartenfuhrer will retrieve documents with Standarten Fuhrer or Standarten Fuehrer contents (capitalized exactly like this), or any capitalization variant of the keyword itself, e.g., staNdarTenfUhreR. (It won't catch standarten fuhrer, however: this text does not match any of the listed exceptions because of case sensitivity and gets indexed as two separate keywords.)

The whitespace in the map-from tokens list matters, but its amount does not. Any amount of whitespace in the map-form list will match any other amount of whitespace in the indexed document or query. For instance, the AT & T map-from token will match AT & T text, whatever the amount of space in both map-from part and the indexed text. Such text will, therefore, be indexed as a special AT&T keyword, thanks to the very first entry from the sample.

Exceptions also allow capturing special characters (that are exceptions from general charset_table rules; hence the name). Assume that you generally do not want to treat + as a valid character, but still want to be able to search for some exceptions from this rule such as C++. The sample above will do just that, totally independent of what characters are in the table and what are not.

When using a plain table, it is necessary to rotate the table to incorporate changes from the exceptions file. In the case of a real-time table, changes will only apply to new documents.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'

POST /cli -d "
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'exceptions' => '/usr/local/manticore/data/exceptions.txt'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'');

utilsApi.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", true);

utils_api.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", Some(true)).await;

table products {
  exceptions = /usr/local/manticore/data/exceptions.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Word forms Morphology

Morphology preprocessors can be applied to words during indexing to normalize different forms of the same word and improve segmentation. For example, an English stemmer can normalize "dogs" and "dog" to "dog", resulting in identical search results for both keywords.

Manticore has four built-in morphology preprocessors:

Lemmatizer: reduces a word to its root or lemma. For example, "running" can be reduced to "run" and "octopi" can be reduced to "octopus". Note that some words may have multiple corresponding root forms. For example, "dove" can be either the past tense of "dive" or a noun meaning a bird, as in "A white dove flew over the cuckoo's nest." In this case, a lemmatizer can generate all the possible root forms.
Stemmer: reduces a word to its stem by removing or replacing certain known suffixes. The resulting stem may not necessarily be a valid word. For example, the Porter English stemmer reduces "running" to "run", "business" to "busi" (not a valid word), and does not reduce "octopi" at all.
Phonetic algorithms: replace words with phonetic codes that are the same even if the words are different but phonetically close.
Word breaking algorithms: split text into words. Currently available only for Chinese.

morphology = morphology1[, morphology2, ...]

The morphology directive specifies a list of morphology preprocessors to apply to the words being indexed. This is an optional setting, with the default being no preprocessor applied.

Manticore comes with built-in morphological preprocessors for:

English, Russian, and German lemmatizers
English, Russian, Arabic, and Czech stemmers
SoundEx and MetaPhone phonetic algorithms
Chinese word breaking algorithm
Snowball (libstemmer) stemmers for more than 15 other languages are also available.

Lemmatizers require dictionary .pak files that can be installed using the manticore-language-packs packages or downloaded from the Manticore website. In the latter case the dictionaries need to be put in the directory specified by lemmatizer_base.

Additionally, the lemmatizer_cache setting can be used to speed up lemmatizing by spending more RAM for an uncompressed dictionary cache.

The Chinese language segmentation can be done using ICU or Jieba (requires package manticore-language-packs). Both libraries provide more accurate segmentation than n-grams, but are slightly slower. The charset_table must include all Chinese characters, which can be done using the cont, cjk or chinese character sets. When you set morphology=icu_chinese or morphology=jieba_chinese, the documents are first pre-processed by ICU or Jieba. Then, the tokenizer processes the result according to the charset_table, and finally, other morphology processors from the morphology option are applied. Only those parts of the text that contain Chinese are passed to ICU/Jieba for segmentation, while the other parts can be modified by different means such as different morphologies or charset_table.

Built-in English and Russian stemmers are faster than their libstemmer counterparts but may produce slightly different results

Soundex implementation matches that of MySQL. Metaphone implementation is based on Double Metaphone algorithm and indexes the primary code.

To use the morphology option, specify one or multiple of the built-in options, including:

none: do not perform any morphology processing
lemmatize_ru - apply Russian lemmatizer and pick a single root form
lemmatize_uk - apply Ukrainian lemmatizer and pick a single root form (install it first in Centos or Ubuntu/Debian). For correct work of the lemmatizer make sure specific Ukrainian characters are preserved in your charset_table since by default they are not. For that override them, like this: charset_table='non_cont,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491'. Here is an interactive course about how to install and use the urkainian lemmatizer.
lemmatize_en - apply English lemmatizer and pick a single root form
lemmatize_de - apply German lemmatizer and pick a single root form
lemmatize_ru_all - apply Russian lemmatizer and index all possible root forms
lemmatize_uk_all - apply Ukrainian lemmatizer and index all possible root forms. Find the installation links above and take care of the charset_table.
lemmatize_en_all - apply English lemmatizer and index all possible root forms
lemmatize_de_all - apply German lemmatizer and index all possible root forms
stem_en - apply Porter's English stemmer
stem_ru - apply Porter's Russian stemmer
stem_enru - apply Porter's English and Russian stemmers
stem_cz - apply Czech stemmer
stem_ar - apply Arabic stemmer
soundex - replace keywords with their SOUNDEX code
metaphone - replace keywords with their METAPHONE code
icu_chinese - apply Chinese text segmentation using ICU
jieba_chinese - apply Chinese text segmentation using Jieba (requires package manticore-language-packs)
libstemmer_* . Refer to the list of supported languages for details

Multiple stemmers can be specified, separated by commas. They will be applied to incoming words in the order they are listed, and the processing will stop once one of the stemmers modifies the word. Additionally, when wordforms feature is enabled, the word will be looked up in the word forms dictionary first. If there is a matching entry in the dictionary, stemmers will not be applied at all. wordforms сan be used to implement stemming exceptions.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'

POST /cli -d "CREATE TABLE products(title text, price float)  morphology = 'stem_en, libstemmer_sv'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'morphology' => 'stem_en, libstemmer_sv'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'');

utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'", true);

utils_api.sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'", Some(true)).await;

table products {
  morphology = stem_en, libstemmer_sv

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

morphology_skip_fields = field1[, field2, ...]

A list of fields to skip morphology preprocessing. Optional, default is empty (apply preprocessors to all fields).

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'

POST /cli -d "
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'morphology_skip_fields' => 'name',
            'morphology' => 'stem_en'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'');

utilsApi.sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'", true);

utils_api.sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'", Some(true)).await;

table products {
  morphology_skip_fields = name
  morphology = stem_en

  type = rt
  path = tbl
  rt_field = title
  rt_field = name
  rt_attr_uint = price
}

min_stemming_len = length

Minimum word length at which to enable stemming. Optional, default is 1 (stem everything).

Stemmers are not perfect, and might sometimes produce undesired results. For instance, running "gps" keyword through Porter stemmer for English results in "gp", which is not really the intent. min_stemming_len feature lets you suppress stemming based on the source word length, ie. to avoid stemming too short words. Keywords that are shorter than the given threshold will not be stemmed. Note that keywords that are exactly as long as specified will be stemmed. So in order to avoid stemming 3-character keywords, you should specify 4 for the value. For more finely grained control, refer to wordforms feature.

‹›

SQL
JSON
PHP
Python
Python-asycnio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'

POST /cli -d "
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'min_stemming_len' => '4',
             'morphology' => 'stem_en'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'');

utilsApi.sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'", true);

utils_api.sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'", Some(true)).await;

table products {
  min_stemming_len = 4
  morphology = stem_en

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_exact_words = {0|1}

This option allows for the indexing of original keywords along with their morphologically modified versions. However, original keywords that are remapped by the wordforms and exceptions cannot be indexed. The default value is 0, indicating that this feature is disabled by default.

This allows the use of the exact form operator in the query language. Enabling this feature will increase the full-text index size and indexing time, but will not impact search performance.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'

POST /cli -d "
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'index_exact_words' => '1',
             'morphology' => 'stem_en'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'');

utilsApi.sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'", true);

utils_api.sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'", Some(true)).await;

table products {
  index_exact_words = 1
  morphology = stem_en

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

jieba_hmm = {0|1}

Enable or disable HMM in the Jieba segmentation tool. Optional; the default is 1.

In Jieba, the HMM (Hidden Markov Model) option refers to an algorithm used for word segmentation. Specifically, it allows Jieba to perform Chinese word segmentation by recognizing unknown words, especially those not present in its dictionary.

Jieba primarily uses a dictionary-based method for segmenting known words, but when the HMM option is enabled, it applies a statistical model to identify probable word boundaries for words or phrases that are not in its dictionary. This is particularly useful for segmenting new or rare words, names, and slang.

In summary, the jieba_hmm option helps improve segmentation accuracy at the expense of indexing performance. It must be used with morphology = jieba_chinese, see Chinese, Japanese and Korean (CJK) and Thai languages.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'

POST /cli -d "
CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'morphology' => 'jieba_chinese',
             'jieba_hmm'='1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_hmm = \'0\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_hmm = \'0\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_hmm = \'0\'');

utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'", true);

utils_api.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'", Some(true)).await;

table products {
  morphology = jieba_chinese
  jieba_hmm = 0

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

jieba_mode = {accurate|full|search}

Jieba segmentation mode. Optional; the default is accurate.

In accurate mode, Jieba splits the sentence into the most precise words using dictionary matching. This mode focuses on precision, ensuring that the segmentation is as accurate as possible.

In full mode, Jieba tries to split the sentence into every possible word combination, aiming to include all potential words. This mode focuses on maximizing recall, meaning it identifies as many words as possible, even if some of them overlap or are less commonly used. It returns all the words found in its dictionary.

In search mode, Jieba breaks the text into both whole words and smaller parts, combining precise segmentation with extra detail by providing overlapping word fragments. This mode balances precision and recall, making it useful for search engines.

jieba_mode should be used with morphology = jieba_chinese. See Chinese, Japanese, Korean (CJK) and Thai languages.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'

POST /cli -d "
CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'morphology' => 'jieba_chinese',
             'jieba_mode'='full'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_mode = \'full\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_mode = \'full\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_mode = \'full\'');

utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'", true);

utils_api.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'", Some(true)).await;

table products {
  morphology = jieba_chinese
  jieba_mode = full

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

jieba_user_dict_path = path/to/stopwords/file

Path to the Jieba user dictionary. Optional.

Jieba, a Chinese text segmentation library, uses dictionary files to assist with word segmentation. The format of these dictionary files is as follows: each line contains a word, split into three parts separated by spaces — the word itself, word frequency, and part of speech (POS) tag. The word frequency and POS tag are optional and can be omitted. The dictionary file must be UTF-8 encoded.

Example:

创新办 3 i
云计算 5
凱特琳 nz
台中

jieba_user_dict_path should be used with morphology = jieba_chinese. For more details, see Chinese, Japanese, Korean (CJK), and Thai languages.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'

POST /cli -d "
CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'morphology' => 'jieba_chinese',
             'jieba_user_dict_path' = '/usr/local/manticore/data/user-dict.txt'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_user_dict_path = \'/usr/local/manticore/data/user-dict.txt\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_user_dict_path = \'/usr/local/manticore/data/user-dict.txt\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_user_dict_path = \'/usr/local/manticore/data/user-dict.txt\'');

utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'", true);

utils_api.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'", Some(true)).await;

table products {
  morphology = jieba_chinese
  jieba_user_dict_path = /usr/local/manticore/data/user-dict.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Exceptions Advanced HTML tokenization

html_strip = {0|1}

This option determines whether HTML markup should be stripped from the incoming full-text data. The default value is 0, which disables stripping. To enable stripping, set the value to 1.

HTML tags and entities are considered as markup and will be processed.

HTML tags are removed, while the contents between them (e.g. everything between <p> and </p>) are left intact. You can choose to keep and index tag attributes (e.g. HREF attribute in an A tag or ALT in an IMG tag). Some well-known inline tags, such as A, B, I, S, U, BASEFONT, BIG, EM, FONT, IMG, LABEL, SMALL, SPAN, STRIKE, STRONG, SUB, SUP, and TT, are completely removed. All other tags are treated as block level and are replaced with whitespace. For example, the text te<strong>st</strong> will be indexed as a single keyword 'test', while te<p>st</p> will be indexed as two keywords 'te' and 'st'.

HTML entities are decoded and replaced with their corresponding UTF-8 characters. The stripper supports both numeric forms (e.g. ï) and text forms (e.g. ó or  ) of entities, and supports all entities specified by the HTML4 standard.

The stripper is designed to work with properly formed HTML and XHTML, but may produce unexpected results on malformed input (such as HTML with stray <'s or unclosed >'s).

Please note that only the tags themselves, as well as HTML comments, are stripped. To strip the contents of the tags, including embedded scripts, see the html_remove_elements option. There are no restrictions on tag names, meaning that everything that looks like a valid tag start, end, or comment will be stripped.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) html_strip = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_strip' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) html_strip = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) html_strip = '1'");

utils_api.sql("CREATE TABLE products(title text, price float) html_strip = '1'", Some(true)).await;

table products {
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

html_index_attrs = img=alt,title; a=title;

The html_index_attrs option allows you to specify which HTML markup attributes should be indexed even though other HTML markup is stripped. The default value is empty, meaning no attributes will be indexed. The format of the option is a per-tag enumeration of indexable attributes, as demonstrated in the example above. The contents of the specified attributes will be retained and indexed, providing a way to extract additional information from your full-text data.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_index_attrs' => 'img=alt,title; a=title;',
            'html_strip' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'");

utils_api.sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'", Some(true)).await;

table products {
  html_index_attrs = img=alt,title; a=title;
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

html_remove_elements = element1[, element2, ...]

A list of HTML elements whose contents, along with the elements themselves, will be stripped. Optional, the default is an empty string (do not strip contents of any elements).

This option allows you to remove the contents of elements, meaning everything between the opening and closing tags. It is useful for removing embedded scripts, CSS, etc. The short tag form for empty elements (e.g.
) is properly supported, and the text following such a tag will not be removed.

The value is a comma-separated list of element (tag) names, the contents of which should be removed. Tag names are case-insensitive.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_remove_elements' => 'style, script',
            'html_strip' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'");

utils_api.sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'", Some(true)).await;

table products {
  html_remove_elements = style, script
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_sp = {0|1}

Controls detection and indexing of sentence and paragraph boundaries. Optional, default is 0 (no detection or indexing).

This directive enables the detection and indexing of sentence and paragraph boundaries, making it possible for the SENTENCE and PARAGRAPH operators to work. Sentence boundary detection is based on plain text analysis, and only requires setting index_sp = 1 to enable it. Paragraph detection, however, relies on HTML markup and occurs during the HTML stripping process. As such, to index paragraph boundaries, both the index_sp directive and the html_strip directive must be set to 1.

The following rules are used to determine sentence boundaries:

Question marks (?) and exclamation marks (!) always indicate a sentence boundary.
Trailing dots (.) indicate a sentence boundary, except in the following cases:
- When followed by a letter. This is considered part of an abbreviation (e.g. "S.T.A.L.K.E.R." or "Goldman Sachs S.p.A.").
- When followed by a comma. This is considered an abbreviation followed by a comma (e.g. "Telecom Italia S.p.A., founded in 1994").
- When followed by a space and a lowercase letter. This is considered an abbreviation within a sentence (e.g. "News Corp. announced in February").
- When preceded by a space and an uppercase letter, and followed by a space. This is considered a middle initial (e.g. "John D. Doe").

Paragraph boundaries are detected at every block-level HTML tag, including: ADDRESS, BLOCKQUOTE, CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P, PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.

Both sentences and paragraphs increment the keyword position counter by 1.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_sp' => '1',
            'html_strip' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'", true);

utils_api.sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'", Some(true)).await;

table products {
  index_sp = 1
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_zones = h*, th, title

A list of HTML/XML zones within a field to be indexed. The default is an empty string (no zones will be indexed).

A "zone" is defined as everything between an opening and a matching closing tag, and all spans sharing the same tag name are referred to as a "zone." For example, everything between <H1> and </H1> in a document field belongs to the H1 zone.

The index_zones directive enables zone indexing, but the HTML stripper must also be enabled (by setting html_strip = 1). The value of index_zones should be a comma-separated list of tag names and wildcards (ending with a star) to be indexed as zones.

Zones can be nested and overlap, as long as every opening tag has a matching tag. Zones can also be used for matching with the ZONE operator, as described in the extended_query_syntax.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_zones' => 'h*,th,title',
            'html_strip' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'", true);

utils_api.sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'", Some(true)).await;

table products {
  index_zones = h*, th, title
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Morphology Creating a distributed table

Word forms

wordforms

Exceptions

exceptions

Advanced morphology

morphology

morphology_skip_fields

min_stemming_len

index_exact_words

jieba_hmm

jieba_mode

jieba_user_dict_path

Advanced HTML tokenization

Stripping HTML tags

html_strip

html_index_attrs

html_remove_elements

Extracting important parts from HTML

index_sp

index_zones