Creating an index > NLP and tokenization > Wordforms

Wildcard searching is a common text search type. In Manticore it is performed at dictionary level. By default, both plain and RT indexes use a dictionary type called dict. In this mode words are stored as they are, so the size of the index is not affected by enabling wildcarding. When a wildcard search is performed, in the dictionary a lookup is made to find all possible expansions of the wildcarded word. This expansion can be problematic in terms of computation at query time in cases where the expanded word can provide lots of expansions or expansions that have huge hitlists. The penalties are higher in case of infixes, where wildcard is added at the start and end of the words. expansion_limit is to be used to avoid such problems.

min_prefix_len = length

Minimum word prefix length to index and search. Optional, default is 0 (do not allow prefixes).

Prefixes allow to implement wildcard searching by wordstart* wildcards.

For instance, if you index word "example" with min_prefix_len=3 you will be able to find it by "exa", "exam", "examp", "exampl" prefixes along with the word itself.

Be aware that in case of dict=crc min_prefix_len will also affect index size as each word expansion will be stored additionally.

Manticore can differentiate perfect word matches from prefix matches and rank the former higher if you conform the following conditions:

dict=keywords (on by default)
index_exact_words=1 (off by default),
expand_keywords=1 (also off by default)

Note that either with the dict=crc mode or with any of the above options disabled, there is no way to differentiate between the prefixes and full words, and thus perfect word matches can't be ranked higher.

When minimum infix length is set to a positive number, minimum prefix length is always considered 1.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) min_prefix_len = '3'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) min_prefix_len = '3'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'min_prefix_len' => '3'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) min_prefix_len = \'3\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) min_prefix_len = \'3\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) min_prefix_len = '3'");

index products {
  min_prefix_len = 3

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

min_infix_len = length

Minimum infix prefix length to index and search. Optional, default is 0 (do not allow infixes), and minimum allowed non-zero value is 2.

Infix length setting enables wildcard searches with term patterns like start*, *end, *middle*, and so on. It also lets you disable too short wildcards if those are too expensive to search for.

Manticore can differentiate perfect word matches from infix matches and rank the former higher if you conform the following conditions:

dict=keywords (on by default)
index_exact_words=1 (off by default),
expand_keywords=1 (also off by default)

Note that either with the dict=crc mode or with any of the above options disabled, there is no way to differentiate between the infixes and full words, and thus perfect word matches can't be ranked higher.

Infix wildcard search query time can vary greatly, depending on how many keywords the substring will actually expand to. Short and frequent syllables like *in* or *ti* just might expand to way too many keywords, all of which would need to be matched and processed. Therefore, to generally enable substring searches you would set min_infix_len to 2; and to limit the impact from wildcard searches with too short wildcards, you might set it higher.

Infixes must be at least 2 characters long, wildcards like *a* are not allowed for performance reasons.

When minimum infix length is set to a positive number, minimum prefix length is considered 1. For dict word infixing and prefixing cannot be both enabled at the same. For dict it is possible to specify only some fields to have infixes declared with infix_fields and other fields to have prefixes declared with prefix_fields, but it's forbidden to declare same field in the both lists.

In case of dict=keywords, beside the wildcard * two other wildcard characters can be used:

? can match any(one) character: t?st will match test, but not teast
% can match zero or one character : tes% will match tes or test, but not testing

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) min_infix_len = '3'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) min_infix_len = '3'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'min_infix_len' => '3'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) min_infix_len = \'3\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) min_infix_len = \'3\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) min_infix_len = '3'");

index products {
  min_infix_len = 3

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

prefix_fields = field1[, field2, ...]

List of full-text fields to limit prefix indexing to. Applies to dict=crc only. Optional, default is empty (index all fields in prefix mode).

Because prefix indexing impacts both indexing and searching performance, it might be desired to limit it to specific full-text fields only: for instance, to provide prefix searching through URLs, but not through page contents. prefix_fields specifies what fields will be prefix-indexed; all other fields will be indexed in normal mode. The value format is a comma-separated list of field names.

‹›

CONFIG

CONFIG

📋

index products {
  prefix_fields = title, name
  min_prefix_len = 3
  dict = crc

infix_fields = field1[, field2, ...]

The list of full-text fields to limit infix indexing to. Applies to dict=crc only. Optional, default is empty (index all fields in infix mode).

Similar to prefix_fields, but lets you limit infix-indexing to given fields.

‹›

CONFIG

CONFIG

📋

index products {
  infix_fields = title, name
  min_infix_len = 3
  dict = crc

max_substring_len = length

Maximum substring (either prefix or infix) length to index. Optional, default is 0 (do not limit indexed substrings). Applies to dict only.

By default, substring (either prefix or infix) indexing in the dict will index all possible substrings as separate keywords. That might result in an overly large index. So this directive lets you limit the impact of substring indexing by skipping too-long substrings (which, chances are, will never get searched for anyway).

For example, a test index of 10,000 blog posts takes this much disk space depending on the settings:

6.4 MB baseline (no substrings)
24.3 MB (3.8x) with min_prefix_len = 3
22.2 MB (3.5x) with min_prefix_len = 3, max_substring_len = 8
19.3 MB (3.0x) with min_prefix_len = 3, max_substring_len = 6
94.3 MB (14.7x) with min_infix_len = 3
84.6 MB (13.2x) with min_infix_len = 3, max_substring_len = 8
70.7 MB (11.0x) with min_infix_len = 3, max_substring_len = 6

So in this test limiting the max substring length saved us 10-15% on the index size.

There is no performance impact associated with substring length when using dict=keywords mode, so this directive is not applicable and intentionally forbidden in that case. If required, you can still limit the length of a substring that you search for in the application code.

‹›

CONFIG

CONFIG

📋

index products {
  max_substring_len = 12
  min_infix_len = 3
  dict = crc

expand_keywords = {0|1|exact|star}

Expands keywords with their exact forms (i.e. the forms of the keywords before applying any morphological modifications) and/or stars when possible. The supported values are:

1 - expand to both the exact form and the form with the stars. running will become (running | *running* | =running)
exact - augment the keyword with only its exact form. running will become (running | =running)
star - augment the keyword by adding * around it. running will become (running | *running*) Optional, default is 0 (do not expand keywords).

Queries against indexes with expand_keywords feature enabled are internally expanded as follows: if the index was built with prefix or infix indexing enabled, every keyword gets internally replaced with a disjunction of the keyword itself and a respective prefix or infix (keyword with stars). If the index was built with both stemming and index_exact_words enabled, exact form is also added.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) expand_keywords = '1'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) expand_keywords = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'expand_keywords' => '1'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) expand_keywords = \'1\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) expand_keywords = \'1\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) expand_keywords = '1'");

index products {
  expand_keywords = 1

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

Expanded queries take naturally longer to complete, but can possibly improve the search quality, as the documents with exact form matches should be ranked generally higher than documents with stemmed or infix matches.

Note that the existing query syntax does not allow to emulate this kind of expansion, because internal expansion works on keyword level and expands keywords within phrase or quorum operators too (which is not possible through the query syntax). Take a look at the examples and how expand_keywords affects the search result weights and how "runsy" is found by "runs" w/o the need to add a star:

‹›

expand_keywords_enabled
expand_keywords_disabled

📋

mysql> create table t(f text) min_infix_len='2' expand_keywords='1' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)

mysql> select *, weight() from t where match('runs');
+------+---------+----------+
| id   | f       | weight() |
+------+---------+----------+
|    2 | runs    |     1560 |
|    1 | running |     1500 |
|    3 | runsy   |     1500 |
+------+---------+----------+
3 rows in set (0.01 sec)

mysql> drop table t;
Query OK, 0 rows affected (0.01 sec)

mysql> create table t(f text) min_infix_len='2' expand_keywords='exact' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)

mysql> select *, weight() from t where match('running');
+------+---------+----------+
| id   | f       | weight() |
+------+---------+----------+
|    1 | running |     1590 |
|    2 | runs    |     1500 |
+------+---------+----------+
2 rows in set (0.00 sec)

This directive does not affect indexer in any way, it only affects searchd.

expansion_limit = number

Maximum number of expanded keywords for a single wildcard. Details are here.

Ignoring stop words

Stop words are the words that are skipped during indexing and searching. Typically you'd put most frequent words to the stop words list, because they do not add much value to search results but consume a lot of resources to process.

Stemming is by default applied when parsing stop words file. That might however lead to undesired results. You can turn that off with stopwords_unstemmed.

Small enough files are stored in the index header, see embedded_limit for details.

While stop words are not indexed, they still do affect the keyword positions. For instance, assume that "the" is a stop word, that document 1 contains the line "in office", and that document 2 contains "in the office". Searching for "in office" as for an exact phrase will only return the first document, as expected, even though "the" in the second one is skipped as a stop word. That behavior can be tweaked through the stopword_step directive.

stopwords=path/to/stopwords/file[ path/to/another/file ...]

Stop word files list (space separated). Optional, default is empty. You can specify several file names, separated by spaces. All the files will be loaded. In RT mode only absolute paths are allowed.

Stop words file format is simple plain text. The encoding must be UTF-8. File data will be tokenized with respect to charset_table settings, so you can use the same separators as in the indexed data.

Stop word files can either be created manually, or semi-automatically. indexer provides a mode that creates a frequency dictionary of the index, sorted by the keyword frequency, see --buildstops and --buildfreqs switch for details. Top keywords from that dictionary can usually be used as stop words.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt stopwords-ru.txt stopwords-en.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => '/usr/local/manticore/data/stopwords.txt stopwords-ru.txt stopwords-en.txt'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'");

index products {
  stopwords = /usr/local/manticore/data/stopwords.txt
  stopwords = stopwords-ru.txt stopwords-en.txt

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

Alternatively you can use one of the default stop word files that come with Manticore. Currently stop words for 50 languages are available. Here is the full list of aliases for them:

af - Afrikaans
ar - Arabic
bg - Bulgarian
bn - Bengali
ca - Catalan
ckb- Kurdish
cz - Czech
da - Danish
de - German
el - Greek
en - English
eo - Esperanto
es - Spain
et - Estonian
eu - Basque
fa - Persian
fi - Finnish
fr - French
ga - Irish
gl - Galician
hi - Hindi
he - Hebrew
hr - Croatian
hu - Hungarian
hy - Armenian
id - Indonesian
it - Italian
ja - Japanese
ko - Korean
la - Latin
lt - Lithuanian
lv - Latvian
mr - Marathi
nl - Dutch
no - Norwegian
pl - Polish
pt - Portuguese
ro - Romanian
ru - Russian
sk - Slovak
sl - Slovenian
so - Somali
st - Sotho
sv - Swedish
sw - Swahili
th - Thai
tr - Turkish
yo - Yoruba
zh - Chinese
zu - Zulu

For example, to use stop words for Italian language just put the following line in your config file:

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) stopwords = 'it'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) stopwords = 'it'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'it'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) stopwords = \'it\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) stopwords = \'it\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) stopwords = 'it'");

index products {
  stopwords = it

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

If you need to use stop words for multiple languages you should list all their aliases, separated with commas (in RT mode) or spaces (plain mode):

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'en, it, ru'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'");

index products {
  stopwords = en it ru

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

stopword_step={0|1}

Position increment on stopwords. Optional, allowed values are 0 and 1, default is 1.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'en, it, ru',
            'stopword_step' => '1'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'");

index products {
  stopwords = en
  stopword_step = 1

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

stopwords_unstemmed={0|1}

Whether to apply stop words before or after stemming. Optional, default is 0 (apply stop word filter after stemming).

By default, stop words are stemmed themselves, and applied to tokens after stemming (or any other morphology processing). In other words, by default, a token is stopped when stem(token) is equal to stem(stopword). That can lead to unexpected results when a token gets (erroneously) stemmed to a stopped root. For example, 'Andes' might get stemmed to 'and', so when 'and' is a stopword, 'Andes' is also skipped.

stopwords_unstemmed directive changed this behaviour. When it's enabled, stop words are applied before stemming (and therefore to the original word forms), and the tokens are skipped when token is equal to stopword.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'en, it, ru',
            'stopwords_unstemmed' => '1'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'");

index products {
  stopwords = en
  stopwords_unstemmed = 1

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

Wildcard searching settings Word forms

Word forms are applied after tokenizing incoming text by charset_table rules. They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (e.g. to normalize all the variants such as "walks", "walked", "walking" to the normal form "walk"). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.

wordforms = path/to/wordforms.txt
wordforms = path/to/alternateforms.txt
wordforms = path/to/dict*.txt

Word forms dictionary. Optional, default is empty.

The dictionaries are used to normalize incoming words both during indexing and searching. Therefore, when it comes to a plain index to pick up changes in wordforms file it's required to rotate the index.

Word forms support in Manticore is designed to support big dictionaries well. They moderately affect indexing speed: for instance, a dictionary with 1 million entries slows down indexing about 1.5 times. Searching speed is not affected at all. Additional RAM impact is roughly equal to the dictionary file size, and dictionaries are shared across indexes: i.e. if the very same 50 MB wordforms file is specified for 10 different indexes, additional searchd RAM usage will be about 50 MB.

Dictionary file should be in a simple plain text format. Each line should contain source and destination word forms, in UTF-8 encoding, separated by "greater" sign. Rules from the charset_table will be applied when the file is loaded. So basically it's as case sensitive as your other full-text indexed data, ie. typically case insensitive. Here's the file contents sample:

walks > walk
walked > walk
walking > walk

There is a bundled Spelldump utility that helps you create a dictionary file in the format Manticore can read from source .dict and .aff dictionary files in ispell or MySpell format (as bundled with OpenOffice).

You can map several source words to a single destination word. Because the work happens on tokens, not the source text, differences in whitespace and markup are ignored.

You can use => instead of >. Comments (starting with # are also allowed. Finally, if a line starts with a tilde (~) the wordform will be applied after morphology, instead of before (only single source word is supported).

core 2 duo > c2d
e6600 > c2d
core 2duo => c2d # Some people write '2duo' together...
~run > walk # Along with stem_en morphology enabled replaces 'run', 'running', 'runs' (and any other words that stem to just 'run') to 'walk'

You can specify multiple destination tokens:

s02e02 > season 2 episode 2
s3 e3 > season 3 episode 3

You can specify several files and not only just one. Masks can be used as a pattern, and all matching files will be processed in simple ascending order.

In RT mode only absolute paths are allowed.

If multi-byte codepages are used, and file names can include foreign characters, the resulting order may not be exactly alphabetic. If the same wordform definition is found in several files, the latter one is used, and it overrides previous definitions.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt /var/lib/manticore/dict*.txt'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'wordforms' => [
                '/var/lib/manticore/wordforms.txt',
                '/var/lib/manticore/alternateforms.txt',
                '/var/lib/manticore/dict*.txt'
            ]
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float)wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'");

index products {
  wordforms = /var/lib/manticore/wordforms.txt
  wordforms = /var/lib/manticore/alternateforms.txt
  wordforms = /var/lib/manticore/dict*.txt

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

Ignoring stop words Exceptions

Exceptions (also known as synonyms) allow to map one or more tokens (including tokens with characters that would normally be excluded) to a single keyword. They are similar to wordforms in that they also perform mapping, but have a number of important differences.

Short summary of the differences from wordforms is as follows:

Exceptions	Word forms
Case sensitive	Case insensitive
Can use special characters that are not in charset_table	Fully obey charset_table
Underperform on huge dictionaries	Designed to handle millions of entries

exceptions = path/to/exceptions.txt

Tokenizing exceptions file. Optional, default is empty. In RT mode only absolute paths are allowed.

The expected file format is also plain text, with one line per exception, and the line format is as follows:

map-from-tokens => map-to-token

Example file:

at & t => at&t
AT&T => AT&T
Standarten   Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus

All tokens here are case sensitive: they will not be processed by charset_table rules. Thus, with the example exceptions file above, "at&t" text will be tokenized as two keywords "at" and "t", because of lowercase letters. On the other hand, "AT&T" will match exactly and produce single "AT&T" keyword.

Note that this map-to keyword is:

always interpreted as a single word
and is both case and space sensitive

In our sample, "ms windows" query will not match the document with "MS Windows" text. The query will be interpreted as a query for two keywords, "ms" and "windows". And what "MS Windows" gets mapped to is a single keyword "ms windows", with a space in the middle. On the other hand, "standartenfuhrer" will retrieve documents with "Standarten Fuhrer" or "Standarten Fuehrer" contents (capitalized exactly like this), or any capitalization variant of the keyword itself, eg. "staNdarTenfUhreR". (It won't catch "standarten fuhrer", however: this text does not match any of the listed exceptions because of case sensitivity, and gets indexed as two separate keywords.)

Whitespace in the map-from tokens list matters, but its amount does not. Any amount of the whitespace in the map-form list will match any other amount of whitespace in the indexed document or query. For instance, "AT & T" map-from token will match "AT & T" text, whatever the amount of space in both map-from part and the indexed text. Such text will therefore be indexed as a special "AT&T" keyword, thanks to the very first entry from the sample.

Exceptions also allow to capture special characters (that are exceptions from general charset_table rules; hence the name). Assume that you generally do not want to treat '+' as a valid character, but still want to be able search for some exceptions from this rule such as 'C++'. The sample above will do just that, totally independent of what characters are in the table and what are not.

Exceptions are applied to raw incoming document and query data during indexing and searching respectively. Therefore, when it comes to a plain index to pick up changes in the file it's required to reindex and restart searchd.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'exceptions' => '/usr/local/manticore/data/exceptions.txt'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'");

index products {
  exceptions = /usr/local/manticore/data/exceptions.txt

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

Word forms Morphology