Creating a table > NLP and tokenization > Wordforms

Wildcard searching is a common text search type. In Manticore, it is performed at the dictionary level. By default, both plain and RT tables use a dictionary type called dict. In this mode, words are stored as they are, so enabling wildcarding does not affect the size of the table. When a wildcard search is performed, the dictionary is searched to find all possible expansions of the wildcarded word. This expansion can be problematic in terms of computation at query time when the expanded word provides many expansions or expansions that have huge hitlists, especially in the case of infixes where the wildcard is added at the start and end of the word. To avoid such problems, the expansion_limit can be used.

min_prefix_len = length

This setting determines the minimum word prefix length to index and search. By default, it is set to 0, meaning prefixes are not allowed.

Prefixes allow for wildcard searching by wordstart* wildcards.

For example, if the word "example" is indexed with min_prefix_len=3, it can be found by searching for "exa", "exam", "examp", "exampl", as well as the full word.

Note that with dict=crc min_prefix_len will affect the size of the full-text index since each word expansion will be stored additionally.

Manticore can differentiate perfect word matches from prefix matches and rank the former higher if the following conditions are met:

dict=keywords (on by default)
index_exact_words=1 (off by default),
expand_keywords=1 (also off by default)

Note that with either dict=crc mode or any of the above options disabled, it is not possible to differentiate between prefixes and full words, and perfect word matches cannot be ranked higher.

When the minimum infix length is set to a positive number, the minimum prefix length is always considered 1.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) min_prefix_len = '3'

POST /cli -d "
CREATE TABLE products(title text, price float) min_prefix_len = '3'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'min_prefix_len' => '3'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) min_prefix_len = \'3\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) min_prefix_len = \'3\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_prefix_len = \'3\'');

utilsApi.sql("CREATE TABLE products(title text, price float) min_prefix_len = '3'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) min_prefix_len = '3'", true);

utils_api.sql("CREATE TABLE products(title text, price float) min_prefix_len = '3'", Some(true)).await;

table products {
  min_prefix_len = 3

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

min_infix_len = length

The min_infix_len setting determines the minimum length of an infix prefix to index and search. It is optional and its default value is 0, which means that infixes are not allowed. The minimum allowed non-zero value is 2.

When enabled, infixes allow for wildcard searching with term patterns like start*, *end, *middle*, , and so on. It also allows you to disable too short wildcards if they are too expensive to search for.

If the following conditions are met, Manticore can differentiate perfect word matches from infix matches and rank the former higher:

dict=keywords (on by default)
index_exact_words=1 (off by default),
expand_keywords=1 (also off by default)

Note that with the dict=crc mode or any of the above options disabled, there is no way to differentiate between infixes and full words, and thus perfect word matches cannot be ranked higher.

Infix wildcard search query time can vary greatly, depending on how many keywords the substring will actually expand to. Short and frequent syllables like *in* or *ti* might expand to way too many keywords, all of which would need to be matched and processed. Therefore, to generally enable substring searches, you would set min_infix_len to 2. To limit the impact from wildcard searches with too short wildcards, you might set it higher.

Infixes must be at least 2 characters long, and wildcards like *a* are not allowed for performance reasons.

When min_infix_len is set to a positive number, the minimum prefix length is considered 1. For dict word infixing and prefixing cannot be both enabled at the same time. For dict and other fields to have prefixes declared with prefix_fields, it is forbidden to declare the same field in both lists.

If dict=keywords, besides the wildcard * two other wildcard characters can be used:

? can match any (one) character: t?st will match test, but not teast
% can match zero or one character: tes% will match tes or test, but not testing

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) min_infix_len = '3'

POST /cli -d "
CREATE TABLE products(title text, price float) min_infix_len = '3'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'min_infix_len' => '3'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) min_infix_len = \'3\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) min_infix_len = \'3\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_infix_len = \'3\'');

utilsApi.sql("CREATE TABLE products(title text, price float) min_infix_len = '3'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) min_infix_len = '3'", true);

utils_api.sql("CREATE TABLE products(title text, price float) min_infix_len = '3'", Some(true)).await;

table products {
  min_infix_len = 3

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

prefix_fields = field1[, field2, ...]

The prefix_fields setting is used to limit prefix indexing to specific full-text fields in dict=crc mode. By default, all fields are indexed in prefix mode, but because prefix indexing can affect both indexing and searching performance, it may be desired to limit it to certain fields.

To limit prefix indexing to specific fields, use the prefix_fields setting followed by a comma-separated list of field names. If prefix_fields is not set, then all fields will be indexed in prefix mode.

‹›

CONFIG

CONFIG

📋

table products {
  prefix_fields = title, name
  min_prefix_len = 3
  dict = crc

infix_fields = field1[, field2, ...]

The infix_fields setting allows you to specify a list of full-text fields to limit infix indexing to. This applies to dict=crc only and is optional, with the default being to index all fields in infix mode. This setting is similar to prefix_fields, but instead allows you to limit infix indexing to specific fields.

‹›

CONFIG

CONFIG

📋

table products {
  infix_fields = title, name
  min_infix_len = 3
  dict = crc

max_substring_len = length

The max_substring_len directive sets the maximum substring length to be indexed for either prefix or infix searches. This setting is optional, and its default value is 0 (which means that all possible substrings are indexed). It only applies to dict.

By default, substring indexing in dict indexes all possible substrings as separate keywords, which can result in an overly large full-text index. Therefore, the max_substring_len directive allows you to skip too-long substrings that will probably never be searched for.

For example, a test table of 10,000 blog posts takes up a different amount of disk space depending on the settings:

6.4 MB baseline (no substrings)
24.3 MB (3.8x) with min_prefix_len = 3
22.2 MB (3.5x) with min_prefix_len = 3, max_substring_len = 8
19.3 MB (3.0x) with min_prefix_len = 3, max_substring_len = 6
94.3 MB (14.7x) with min_infix_len = 3
84.6 MB (13.2x) with min_infix_len = 3, max_substring_len = 8
70.7 MB (11.0x) with min_infix_len = 3, max_substring_len = 6

Therefore, limiting the max substring length can save 10-15% of the table size.

When using dict=keywords mode, there is no performance impact associated with substring length. Therefore, this directive is not applicable and is intentionally forbidden in that case. However, if required, you can still limit the length of a substring that you search for in the application code.

‹›

CONFIG

CONFIG

📋

table products {
  max_substring_len = 12
  min_infix_len = 3
  dict = crc

expand_keywords = {0|1|exact|star}

This setting expands keywords with their exact forms and/or with stars when possible. The supported values are:

1 - expand to both the exact form and the form with the stars. For instance,running will become (running | *running* | =running)
exact - - augment the keyword with only its exact form. For instance, running will become (running | =running)
star - augment the keyword by adding * around it. For instance, running will become (running | *running*) This setting is optional, and the default value is 0 (keywords are not expanded).

Queries against tables with expand_keywords feature enabled are internally expanded as follows: if the table was built with prefix or infix indexing enabled, every keyword gets internally replaced with a disjunction of the keyword itself and a respective prefix or infix (keyword with stars). If the table was built with both stemming and index_exact_words enabled, exact form is also added.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) expand_keywords = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) expand_keywords = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'expand_keywords' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) expand_keywords = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) expand_keywords = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) expand_keywords = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) expand_keywords = '1'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) expand_keywords = '1'", true);

utils_api.sql("CREATE TABLE products(title text, price float) expand_keywords = '1'", Some(true)).await;

table products {
  expand_keywords = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Expanded queries take naturally longer to complete, but can possibly improve the search quality, as the documents with exact form matches should be ranked generally higher than documents with stemmed or infix matches.

Note that the existing query syntax does not allow to emulate this kind of expansion, because internal expansion works on keyword level and expands keywords within phrase or quorum operators too (which is not possible through the query syntax). Take a look at the examples and how expand_keywords affects the search result weights and how "runsy" is found by "runs" w/o the need to add a star:

‹›

expand_keywords_enabled
expand_keywords_disabled

📋

mysql> create table t(f text) min_infix_len='2' expand_keywords='1' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)

mysql> select *, weight() from t where match('runs');
+------+---------+----------+
| id   | f       | weight() |
+------+---------+----------+
|    2 | runs    |     1560 |
|    1 | running |     1500 |
|    3 | runsy   |     1500 |
+------+---------+----------+
3 rows in set (0.01 sec)

mysql> drop table t;
Query OK, 0 rows affected (0.01 sec)

mysql> create table t(f text) min_infix_len='2' expand_keywords='exact' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)

mysql> select *, weight() from t where match('running');
+------+---------+----------+
| id   | f       | weight() |
+------+---------+----------+
|    1 | running |     1590 |
|    2 | runs    |     1500 |
+------+---------+----------+
2 rows in set (0.00 sec)

This directive does not affect indexer in any way, it only affects searchd.

expansion_limit = number

Maximum number of expanded keywords for a single wildcard. Details are here.

Ignoring stop words

Stop words are words that are ignored during indexing and searching, typically due to their high frequency and low value to search results.

Manticore Search applies stemming to stop words by default, which can lead to undesired results, but this can be turned off using the stopwords_unstemmed.

Small stop word files are stored in the table header, and there is a limit to the size of files that can be embedded, as defined by the embedded_limit option.

Stop words are not indexed, but they do affect keyword positions. For example, if "the" is a stop word, and document 1 contains the phrase "in office" while document 2 contains the phrase "in the office," searching for "in office" as an exact phrase will only return the first document, even though "the" is skipped as a stop word in the second document. This behavior can be modified using the stopword_step directive.

stopwords=path/to/stopwords/file[ path/to/another/file ...]

The stopwords setting is optional and by default empty. It allows you to specify the path to one or more stop word files, separated by spaces. All the files will be loaded. In the real-time mode, only absolute paths are allowed.

The stop word file format is simple plain text with UTF-8 encoding. The file data will be tokenized with respect to the charset_table settings, so you can use the same separators as in the indexed data.

Stop word files can be created manually or semi-automatically. The indexer provides a mode that creates a frequency dictionary of the table, sorted by the keyword frequency. Top keywords from that dictionary can usually be used as stop words. See --buildstops and --buildfreqs switch for details. Top keywords from that dictionary can usually be used as stop words.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'

POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt stopwords-ru.txt stopwords-en.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => '/usr/local/manticore/data/stopwords.txt stopwords-ru.txt stopwords-en.txt'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'');

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'", true);

utils_api.sql("CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'", Some(true)).await;

table products {
  stopwords = /usr/local/manticore/data/stopwords.txt
  stopwords = stopwords-ru.txt stopwords-en.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Alternatively you can use one of the default stop word files that come with Manticore. Currently stop words for 50 languages are available. Here is the full list of aliases for them:

af - Afrikaans
ar - Arabic
bg - Bulgarian
bn - Bengali
ca - Catalan
ckb- Kurdish
cz - Czech
da - Danish
de - German
el - Greek
en - English
eo - Esperanto
es - Spain
et - Estonian
eu - Basque
fa - Persian
fi - Finnish
fr - French
ga - Irish
gl - Galician
hi - Hindi
he - Hebrew
hr - Croatian
hu - Hungarian
hy - Armenian
id - Indonesian
it - Italian
ja - Japanese
ko - Korean
la - Latin
lt - Lithuanian
lv - Latvian
mr - Marathi
nl - Dutch
no - Norwegian
pl - Polish
pt - Portuguese
ro - Romanian
ru - Russian
sk - Slovak
sl - Slovenian
so - Somali
st - Sotho
sv - Swedish
sw - Swahili
th - Thai
tr - Turkish
yo - Yoruba
zh - Chinese
zu - Zulu

For example, to use stop words for Italian language just put the following line in your config file:

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) stopwords = 'it'

POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'it'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'it'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'it\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'it\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'it\'');

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'it'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = 'it'", true);

utils_api.sql("CREATE TABLE products(title text, price float) stopwords = 'it'", Some(true)).await;

table products {
  stopwords = it

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

If you need to use stop words for multiple languages you should list all their aliases, separated with commas (RT mode) or spaces (plain mode):

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'

POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'en, it, ru'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'');

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'", true);

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'", true);

utils_api.sql("CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'", Some(true)).await;

table products {
  stopwords = en it ru

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

stopword_step={0|1}

The position_increment setting on stopwords is optional, and the allowed values are 0 and 1, with the default being 1.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'en, it, ru',
            'stopword_step' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'", true);

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'", true);

utils_api.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'", Some(true)).await;

table products {
  stopwords = en
  stopword_step = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

stopwords_unstemmed={0|1}

Whether to apply stop words before or after stemming. Optional, default is 0 (apply stop word filter after stemming).

By default, stop words are stemmed themselves, and then applied to tokens after stemming (or any other morphology processing). This means that a token is stopped when stem(token) is equal to stem(stopword). This default behavior can lead to unexpected results when a token is erroneously stemmed to a stopped root. For example, "Andes" might get stemmed to "and", so when "and" is a stopword, "Andes" is also skipped.

However, you can change this behavior by enabling the stopwords_unstemmed directive. When this is enabled, stop words are applied before stemming (and therefore to the original word forms), and the tokens are skipped when the token is equal to the stopword.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'en, it, ru',
            'stopwords_unstemmed' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'", true);

utils_api.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'", Some(true)).await;

table products {
  stopwords = en
  stopwords_unstemmed = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Wildcard searching settings Word forms

Word forms are applied after tokenizing incoming text by charset_table rules. They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (e.g. to normalize all the variants such as "walks", "walked", "walking" to the normal form "walk"). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.

wordforms = path/to/wordforms.txt
wordforms = path/to/alternateforms.txt
wordforms = path/to/dict*.txt

Word forms dictionary. Optional, default is empty.

The word forms dictionaries are used to normalize incoming words both during indexing and searching. Therefore, when it comes to a plain table, it's required to rotate the table in order to pick up changes in the word forms file.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt /var/lib/manticore/dict*.txt'

POST /cli -d "
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'wordforms' => [
                '/var/lib/manticore/wordforms.txt',
                '/var/lib/manticore/alternateforms.txt',
                '/var/lib/manticore/dict*.txt'
            ]
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float)wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'');

utilsApi.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", true);

utils_api.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", Some(true)).await;

table products {
  wordforms = /var/lib/manticore/wordforms.txt
  wordforms = /var/lib/manticore/alternateforms.txt
  wordforms = /var/lib/manticore/dict*.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Word forms support in Manticore is designed to handle large dictionaries well. They moderately affect indexing speed; for example, a dictionary with 1 million entries slows down full-text indexing by about 1.5 times. Searching speed is not affected at all. The additional RAM impact is roughly equal to the dictionary file size, and dictionaries are shared across tables. For instance, if the very same 50 MB word forms file is specified for 10 different tables, the additional searchd RAM usage will be about 50 MB.

The dictionary file should be in a simple plain text format. Each line should contain source and destination word forms in UTF-8 encoding, separated by a 'greater than' sign. The rules from the charset_table will be applied when the file is loaded. Therefore, if you do not modify charset_table, your word forms will be case-insensitive, similar to your other full-text indexed data. Below is a sample of the file contents:

‹›

Example

Example

📋

walks > walk
walked > walk
walking > walk

There is a bundled utility called Spelldump that helps you create a dictionary file in a format that Manticore can read. The utility can read from source .dict and .aff dictionary files in the ispell or MySpell format, as bundled with OpenOffice.

You can map several source words to a single destination word. The process happens on tokens, not the source text, so differences in whitespace and markup are ignored.

You can use the => symbol instead of >. Comments (starting with #) are also allowed. Finally, if a line starts with a tilde (~), the wordform will be applied after morphology, instead of before (note that only a single source and destination word are supported in this case).

‹›

Example

Example

📋

core 2 duo > c2d
e6600 > c2d
core 2duo => c2d # Some people write '2duo' together...
~run > walk # Along with stem_en morphology enabled replaces 'run', 'running', 'runs' (and any other words that stem to just 'run') to 'walk'

If you need to use >, = or ~ as normal characters, you can escape them by preceding each with a backslash (\). Both > and = should be escaped in this manner. Here's an example:

‹›

Example

Example

📋

a\> > abc
\>b > bcd
c\=\> => cde
\=\>d => def
\=\>a \> f \> => foo
\~g => bar

You can specify multiple destination tokens:

‹›

Example

Example

📋

s02e02 > season 2 episode 2
s3 e3 > season 3 episode 3

You can specify multiple files, not just one. Masks can be used as a pattern, and all matching files will be processed in simple ascending order:

In the RT mode, only absolute paths are allowed.

If multi-byte codepages are used and file names include non-latin characters, the resulting order may not be exactly alphabetic. If the same wordform definition is found in multiple files, the latter one is used and overrides previous definitions.

‹›

SQL
Config

📋

create table tbl1 ... wordforms='/tmp/wf*'
create table tbl2 ... wordforms='/tmp/wf, /tmp/wf2'

Ignoring stop words Exceptions

Exceptions (also known as synonyms) allow mapping one or more tokens (including tokens with characters that would normally be excluded) to a single keyword. They are similar to wordforms in that they also perform mapping but have a number of important differences.

A short summary of the differences from wordforms is as follows:

Exceptions	Word forms
Case sensitive	Case insensitive
Can use special characters that are not in charset_table	Fully obey charset_table
Underperform on huge dictionaries	Designed to handle millions of entries

exceptions = path/to/exceptions.txt

Tokenizing exceptions file. Optional, the default is empty. In the RT mode, only absolute paths are allowed.

The expected file format is plain text, with one line per exception. The line format is as follows:

map-from-tokens => map-to-token

Example file:

at & t => at&t
AT&T => AT&T
Standarten   Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus
\=\>abc\> => abc

All tokens here are case sensitive and will not be processed by charset_table rules. Thus, with the example exceptions file above, the at&t text will be tokenized as two keywords at and t due to lowercase letters. On the other hand, AT&T will match exactly and produce a single AT&T keyword.

If you need to use > or = as normal characters, you can escape them by preceding each with a backslash (\). Both > and = should be escaped in this manner.

Note that the map-to keywords:

are always interpreted as a single word
are both case and space sensitive

In the above sample, ms windows query will not match the document with MS Windows text. The query will be interpreted as a query for two keywords, ms and windows. The mapping for MS Windows is a single keyword ms windows, with a space in the middle. On the other hand, standartenfuhrer will retrieve documents with Standarten Fuhrer or Standarten Fuehrer contents (capitalized exactly like this), or any capitalization variant of the keyword itself, e.g., staNdarTenfUhreR. (It won't catch standarten fuhrer, however: this text does not match any of the listed exceptions because of case sensitivity and gets indexed as two separate keywords.)

The whitespace in the map-from tokens list matters, but its amount does not. Any amount of whitespace in the map-form list will match any other amount of whitespace in the indexed document or query. For instance, the AT & T map-from token will match AT & T text, whatever the amount of space in both map-from part and the indexed text. Such text will, therefore, be indexed as a special AT&T keyword, thanks to the very first entry from the sample.

Exceptions also allow capturing special characters (that are exceptions from general charset_table rules; hence the name). Assume that you generally do not want to treat + as a valid character, but still want to be able to search for some exceptions from this rule such as C++. The sample above will do just that, totally independent of what characters are in the table and what are not.

When using a plain table, it is necessary to rotate the table to incorporate changes from the exceptions file. In the case of a real-time table, changes will only apply to new documents.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'

POST /cli -d "
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'exceptions' => '/usr/local/manticore/data/exceptions.txt'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'');

utilsApi.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", true);

utils_api.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", Some(true)).await;

table products {
  exceptions = /usr/local/manticore/data/exceptions.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Word forms Morphology