Creating a table > NLP and tokenization > Morphology

Exceptions (also known as synonyms) allow mapping one or more tokens (including tokens with characters that would normally be excluded) to a single keyword. They are similar to wordforms in that they also perform mapping but have a number of important differences.

A short summary of the differences from wordforms is as follows:

Exceptions	Word forms
Case sensitive	Case insensitive
Can use special characters that are not in charset_table	Fully obey charset_table
Underperform on huge dictionaries	Designed to handle millions of entries

exceptions = path/to/exceptions.txt

Tokenizing exceptions file. Optional, the default is empty. In the RT mode, only absolute paths are allowed.

The expected file format is plain text, with one line per exception. The line format is as follows:

map-from-tokens => map-to-token

Example file:

at & t => at&t
AT&T => AT&T
Standarten   Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus

All tokens here are case sensitive and will not be processed by charset_table rules. Thus, with the example exceptions file above, the "at&t" text will be tokenized as two keywords "at" and "t" due to lowercase letters. On the other hand, "AT&T" will match exactly and produce a single "AT&T" keyword.

Note that this map-to keyword:

is always interpreted as a single word
is both case and space sensitive

In our sample, "ms windows" query will not match the document with "MS Windows" text. The query will be interpreted as a query for two keywords, "ms" and "windows". The mapping for "MS Windows" is a single keyword "ms windows", with a space in the middle. On the other hand, "standartenfuhrer" will retrieve documents with "Standarten Fuhrer" or "Standarten Fuehrer" contents (capitalized exactly like this), or any capitalization variant of the keyword itself, e.g., "staNdarTenfUhreR". (It won't catch "standarten fuhrer", however: this text does not match any of the listed exceptions because of case sensitivity and gets indexed as two separate keywords.)

The whitespace in the map-from tokens list matters, but its amount does not. Any amount of whitespace in the map-form list will match any other amount of whitespace in the indexed document or query. For instance, the "AT & T" map-from token will match "AT & T" text, whatever the amount of space in both map-from part and the indexed text. Such text will, therefore, be indexed as a special "AT&T" keyword, thanks to the very first entry from the sample.

Exceptions also allow capturing special characters (that are exceptions from general charset_table rules; hence the name). Assume that you generally do not want to treat '+' as a valid character, but still want to be able to search for some exceptions from this rule such as 'C++'. The sample above will do just that, totally independent of what characters are in the table and what are not.

Therefore, when it comes to a plain table, it's required to rotate the table in order to pick up changes in the exceptions file.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'

POST /cli -d "
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'exceptions' => '/usr/local/manticore/data/exceptions.txt'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'');

utilsApi.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'");

utilsApi.Sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'");

table products {
  exceptions = /usr/local/manticore/data/exceptions.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Morphology

Morphology preprocessors can be applied to words during indexing to normalize different forms of the same word and improve segmentation. For example, an English stemmer can normalize "dogs" and "dog" to "dog", resulting in identical search results for both keywords.

Manticore has four built-in morphology preprocessors:

Lemmatizer: reduces a word to its root or lemma. For example, "running" can be reduced to "run" and "octopi" can be reduced to "octopus". Note that some words may have multiple corresponding root forms. For example, "dove" can be either the past tense of "dive" or a noun meaning a bird, as in "A white dove flew over the cuckoo's nest." In this case, a lemmatizer can generate all the possible root forms.
Stemmer: reduces a word to its stem by removing or replacing certain known suffixes. The resulting stem may not necessarily be a valid word. For example, the Porter English stemmer reduces "running" to "run", "business" to "busi" (not a valid word), and does not reduce "octopi" at all.
Phonetic algorithms: replace words with phonetic codes that are the same even if the words are different but phonetically close.
Word breaking algorithms: split text into words. Currently available only for Chinese.

morphology = morphology1[, morphology2, ...]

The morphology directive specifies a list of morphology preprocessors to apply to the words being indexed. This is an optional setting, with the default being no preprocessor applied.

Manticore comes with built-in morphological preprocessors for:

English, Russian, and German lemmatizers
English, Russian, Arabic, and Czech stemmers
SoundEx and MetaPhone phonetic algorithms
Chinese word breaking algorithm
Snowball (libstemmer) stemmers for more than 15 other languages are also available.

Lemmatizers require dictionary .pak files that can be downloaded from the Manticore website. The dictionaries need to be put in the directory specified by lemmatizer_base. Additionally, the lemmatizer_cache setting can be used to speed up lemmatizing by spending more RAM for an uncompressed dictionary cache.

The Chinese language segmentation can be performed using ICU. It provides more precise segmentation compared to n-grams but is slightly slower. The charset_table must include all Chinese characters, which can be done by using the "cjk" alias. When "morphology=icu_chinese" is specified, the documents are first pre-processed by ICU. Then, the result is processed by the tokenizer according to the charset_table, and finally, other morphology processors specified in the "morphology" option are applied. Only those parts of texts that contain Chinese are passed to ICU for segmentation, while others can be modified by different means such as different morphologies or charset_table.

Built-in English and Russian stemmers are faster than their libstemmer counterparts but may produce slightly different results

Soundex implementation matches that of MySQL. Metaphone implementation is based on Double Metaphone algorithm and indexes the primary code.

To use the morphology option, specify one or multiple of the built-in options, including:

none: do not perform any morphology processing
lemmatize_ru - apply Russian lemmatizer and pick a single root form
lemmatize_uk - apply Ukrainian lemmatizer and pick a single root form (install it first in Centos or Ubuntu/Debian). For correct work of the lemmatizer make sure specific Ukrainian characters are preserved in your charset_table since by default they are not. For that override them, like this: charset_table='non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491'. Here is an interactive course about how to install and use the urkainian lemmatizer.
lemmatize_en - apply English lemmatizer and pick a single root form
lemmatize_de - apply German lemmatizer and pick a single root form
lemmatize_ru_all - apply Russian lemmatizer and index all possible root forms
lemmatize_uk_all - apply Ukrainian lemmatizer and index all possible root forms. Find the installation links above and take care of the charset_table.
lemmatize_en_all - apply English lemmatizer and index all possible root forms
lemmatize_de_all - apply German lemmatizer and index all possible root forms
stem_en - apply Porter's English stemmer
stem_ru - apply Porter's Russian stemmer
stem_enru - apply Porter's English and Russian stemmers
stem_cz - apply Czech stemmer
stem_ar - apply Arabic stemmer
soundex - replace keywords with their SOUNDEX code
metaphone - replace keywords with their METAPHONE code
icu_chinese - apply Chinese text segmentation using ICU
libstemmer_* . Refer to the list of supported languages for details

Multiple stemmers can be specified, separated by commas. They will be applied to incoming words in the order they are listed, and the processing will stop once one of the stemmers modifies the word. Additionally, when wordforms feature is enabled, the word will be looked up in the word forms dictionary first. If there is a matching entry in the dictionary, stemmers will not be applied at all. wordforms сan be used to implement stemming exceptions.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'

POST /cli -d "CREATE TABLE products(title text, price float)  morphology = 'stem_en, libstemmer_sv'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'morphology' => 'stem_en, libstemmer_sv'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'');

utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'");

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'");

table products {
  morphology = stem_en, libstemmer_sv

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

morphology_skip_fields = field1[, field2, ...]

A list of fields to skip morphology preprocessing. Optional, default is empty (apply preprocessors to all fields).

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'

POST /cli -d "
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'morphology_skip_fields' => 'name',
            'morphology' => 'stem_en'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'');

utilsApi.sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'");

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'");

table products {
  morphology_skip_fields = name
  morphology = stem_en

  type = rt
  path = tbl
  rt_field = title
  rt_field = name
  rt_attr_uint = price
}

min_stemming_len = length

Minimum word length at which to enable stemming. Optional, default is 1 (stem everything).

Stemmers are not perfect, and might sometimes produce undesired results. For instance, running "gps" keyword through Porter stemmer for English results in "gp", which is not really the intent. min_stemming_len feature lets you suppress stemming based on the source word length, ie. to avoid stemming too short words. Keywords that are shorter than the given threshold will not be stemmed. Note that keywords that are exactly as long as specified will be stemmed. So in order to avoid stemming 3-character keywords, you should specify 4 for the value. For more finely grained control, refer to wordforms feature.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'

POST /cli -d "
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'min_stemming_len' => '4',
             'morphology' => 'stem_en'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'');

utilsApi.sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'");

utilsApi.Sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'");

table products {
  min_stemming_len = 4
  morphology = stem_en

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_exact_words = {0|1}

This option allows for the indexing of original keywords along with their morphologically modified versions. However, original keywords that are remapped by the wordforms and exceptions cannot be indexed. The default value is 0, indicating that this feature is disabled by default.

This allows the use of the exact form operator in the query language. Enabling this feature will increase the full-text index size and indexing time, but will not impact search performance.

‹›

SQL
JSON
PHP
Python
javascript
Java
C#
CONFIG

📋

CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'

POST /cli -d "
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'index_exact_words' => '1',
             'morphology' => 'stem_en'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'');

utilsApi.sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'");

utilsApi.Sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'");

table products {
  index_exact_words = 1
  morphology = stem_en

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Exceptions Advanced HTML tokenization

Exceptions

exceptions

Advanced morphology

morphology

morphology_skip_fields

min_stemming_len

index_exact_words