Creating a table > NLP and tokenization > Exceptions

Word forms are applied after tokenizing incoming text by charset_table rules. They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (e.g. to normalize all the variants such as "walks", "walked", "walking" to the normal form "walk"). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.

wordforms = path/to/wordforms.txt
wordforms = path/to/alternateforms.txt
wordforms = path/to/dict*.txt

Word forms dictionary. Optional, default is empty.

The word forms dictionaries are used to normalize incoming words both during indexing and searching. Therefore, when it comes to a plain table, it's required to rotate the table in order to pick up changes in the word forms file.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt /var/lib/manticore/dict*.txt'

POST /cli -d "
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'wordforms' => [
                '/var/lib/manticore/wordforms.txt',
                '/var/lib/manticore/alternateforms.txt',
                '/var/lib/manticore/dict*.txt'
            ]
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float)wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'');

utilsApi.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", true);

utils_api.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", Some(true)).await;

table products {
  wordforms = /var/lib/manticore/wordforms.txt
  wordforms = /var/lib/manticore/alternateforms.txt
  wordforms = /var/lib/manticore/dict*.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Word forms support in Manticore is designed to handle large dictionaries well. They moderately affect indexing speed; for example, a dictionary with 1 million entries slows down full-text indexing by about 1.5 times. Searching speed is not affected at all. The additional RAM impact is roughly equal to the dictionary file size, and dictionaries are shared across tables. For instance, if the very same 50 MB word forms file is specified for 10 different tables, the additional searchd RAM usage will be about 50 MB.

The dictionary file should be in a simple plain text format. Each line should contain source and destination word forms in UTF-8 encoding, separated by a 'greater than' sign. The rules from the charset_table will be applied when the file is loaded. Therefore, if you do not modify charset_table, your word forms will be case-insensitive, similar to your other full-text indexed data. Below is a sample of the file contents:

‹›

Example

Example

📋

walks > walk
walked > walk
walking > walk

There is a bundled utility called Spelldump that helps you create a dictionary file in a format that Manticore can read. The utility can read from source .dict and .aff dictionary files in the ispell or MySpell format, as bundled with OpenOffice.

You can map several source words to a single destination word. The process happens on tokens, not the source text, so differences in whitespace and markup are ignored.

You can use the => symbol instead of >. Comments (starting with #) are also allowed. Finally, if a line starts with a tilde (~), the wordform will be applied after morphology, instead of before (note that only a single source and destination word are supported in this case).

‹›

Example

Example

📋

core 2 duo > c2d
e6600 > c2d
core 2duo => c2d # Some people write '2duo' together...
~run > walk # Along with stem_en morphology enabled replaces 'run', 'running', 'runs' (and any other words that stem to just 'run') to 'walk'

If you need to use >, = or ~ as normal characters, you can escape them by preceding each with a backslash (\). Both > and = should be escaped in this manner. Here's an example:

‹›

Example

Example

📋

a\> > abc
\>b > bcd
c\=\> => cde
\=\>d => def
\=\>a \> f \> => foo
\~g => bar

You can specify multiple destination tokens:

‹›

Example

Example

📋

s02e02 > season 2 episode 2
s3 e3 > season 3 episode 3

You can specify multiple files, not just one. Masks can be used as a pattern, and all matching files will be processed in simple ascending order:

In the RT mode, only absolute paths are allowed.

If multi-byte codepages are used and file names include non-latin characters, the resulting order may not be exactly alphabetic. If the same wordform definition is found in multiple files, the latter one is used and overrides previous definitions.

‹›

SQL
Config

📋

create table tbl1 ... wordforms='/tmp/wf*'
create table tbl2 ... wordforms='/tmp/wf, /tmp/wf2'

Exceptions

Exceptions (also known as synonyms) allow mapping one or more tokens (including tokens with characters that would normally be excluded) to a single keyword. They are similar to wordforms in that they also perform mapping but have a number of important differences.

A short summary of the differences from wordforms is as follows:

Exceptions	Word forms
Case sensitive	Case insensitive
Can use special characters that are not in charset_table	Fully obey charset_table
Underperform on huge dictionaries	Designed to handle millions of entries

exceptions = path/to/exceptions.txt

Tokenizing exceptions file. Optional, the default is empty. In the RT mode, only absolute paths are allowed.

The expected file format is plain text, with one line per exception. The line format is as follows:

map-from-tokens => map-to-token

Example file:

at & t => at&t
AT&T => AT&T
Standarten   Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus
\=\>abc\> => abc

All tokens here are case sensitive and will not be processed by charset_table rules. Thus, with the example exceptions file above, the at&t text will be tokenized as two keywords at and t due to lowercase letters. On the other hand, AT&T will match exactly and produce a single AT&T keyword.

If you need to use > or = as normal characters, you can escape them by preceding each with a backslash (\). Both > and = should be escaped in this manner.

Note that the map-to keywords:

are always interpreted as a single word
are both case and space sensitive

In the above sample, ms windows query will not match the document with MS Windows text. The query will be interpreted as a query for two keywords, ms and windows. The mapping for MS Windows is a single keyword ms windows, with a space in the middle. On the other hand, standartenfuhrer will retrieve documents with Standarten Fuhrer or Standarten Fuehrer contents (capitalized exactly like this), or any capitalization variant of the keyword itself, e.g., staNdarTenfUhreR. (It won't catch standarten fuhrer, however: this text does not match any of the listed exceptions because of case sensitivity and gets indexed as two separate keywords.)

The whitespace in the map-from tokens list matters, but its amount does not. Any amount of whitespace in the map-form list will match any other amount of whitespace in the indexed document or query. For instance, the AT & T map-from token will match AT & T text, whatever the amount of space in both map-from part and the indexed text. Such text will, therefore, be indexed as a special AT&T keyword, thanks to the very first entry from the sample.

Exceptions also allow capturing special characters (that are exceptions from general charset_table rules; hence the name). Assume that you generally do not want to treat + as a valid character, but still want to be able to search for some exceptions from this rule such as C++. The sample above will do just that, totally independent of what characters are in the table and what are not.

When using a plain table, it is necessary to rotate the table to incorporate changes from the exceptions file. In the case of a real-time table, changes will only apply to new documents.

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'

POST /cli -d "
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'exceptions' => '/usr/local/manticore/data/exceptions.txt'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'');

utilsApi.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", true);

utils_api.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", Some(true)).await;

table products {
  exceptions = /usr/local/manticore/data/exceptions.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Word forms Morphology