Exceptions

Exceptions (also known as synonyms) allow mapping one or more tokens (including tokens with characters that would normally be excluded) to a single keyword. They are similar to wordforms in that they also perform mapping but have a number of important differences.

A short summary of the differences from wordforms is as follows:

Exceptions Word forms
Case sensitive Case insensitive
Can use special characters that are not in charset_table Fully obey charset_table
Underperform on huge dictionaries Designed to handle millions of entries

exceptions

exceptions = path/to/exceptions.txt

Tokenizing exceptions file. Optional, the default is empty. In the RT mode, only absolute paths are allowed.

The expected file format is plain text, with one line per exception. The line format is as follows:

map-from-tokens => map-to-token

Example file:

at & t => at&t
AT&T => AT&T
Standarten   Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus

All tokens here are case sensitive and will not be processed by charset_table rules. Thus, with the example exceptions file above, the "at&t" text will be tokenized as two keywords "at" and "t" due to lowercase letters. On the other hand, "AT&T" will match exactly and produce a single "AT&T" keyword.

Note that this map-to keyword:

  • is always interpreted as a single word
  • is both case and space sensitive

In our sample, "ms windows" query will not match the document with "MS Windows" text. The query will be interpreted as a query for two keywords, "ms" and "windows". The mapping for "MS Windows" is a single keyword "ms windows", with a space in the middle. On the other hand, "standartenfuhrer" will retrieve documents with "Standarten Fuhrer" or "Standarten Fuehrer" contents (capitalized exactly like this), or any capitalization variant of the keyword itself, e.g., "staNdarTenfUhreR". (It won't catch "standarten fuhrer", however: this text does not match any of the listed exceptions because of case sensitivity and gets indexed as two separate keywords.)

The whitespace in the map-from tokens list matters, but its amount does not. Any amount of whitespace in the map-form list will match any other amount of whitespace in the indexed document or query. For instance, the "AT & T" map-from token will match "AT & T" text, whatever the amount of space in both map-from part and the indexed text. Such text will, therefore, be indexed as a special "AT&T" keyword, thanks to the very first entry from the sample.

Exceptions also allow capturing special characters (that are exceptions from general charset_table rules; hence the name). Assume that you generally do not want to treat '+' as a valid character, but still want to be able to search for some exceptions from this rule such as 'C++'. The sample above will do just that, totally independent of what characters are in the table and what are not.

Therefore, when it comes to a plain table, it's required to rotate the table in order to pick up changes in the exceptions file.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'

Advanced morphology

Morphology preprocessors can be applied to the words being indexed to replace different forms of the same word with the base, normalized form or improve segmentation. For instance, English stemmer will normalize both "dogs" and "dog" to "dog", making search results for the both keywords the same.

There are 4 different morphology preprocessors that Manticore implements.

  • Lemmatizer reduces a keyword form to a so-called lemma, a proper normal form, or in other words, a valid natural language root word. For example, "running" could be reduced to "run", the infinitive verb form, and "octopi" would be reduced to "octopus", the singular noun form. Note that sometimes a word form can have multiple corresponding root words. For instance, by looking at "dove" it is not possible to tell whether this is a past tense of "dive" the verb as in "He dove into a pool.", or "dove" the noun as in "White dove flew over the cuckoo's nest." In this case lemmatizer can generate all the possible root forms.
  • Stemmer reduces a keyword form to a so-called stem by removing and/or replacing certain well-known suffixes. The resulting stem is however not guaranteed to be a valid word on itself. For instance, with a Porter English stemmers "running" would still reduce to "run", which is fine, but "business" would reduce to "busi", which is not a word, and "octopi" would not reduce at all. Stemmers are essentially (much) simpler but still pretty good replacements of full-blown lemmatizers.
  • Phonetic algorithms replace the words with specially crafted phonetic codes that are equal even when the words original are different, but phonetically close.
  • Word breaking algorithms split text into words. Currently available only for Chinese.

morphology

morphology = morphology1[, morphology2, ...]

A list of morphology preprocessors to apply. Optional, default is empty (do not apply any preprocessor).

The morphology processors that come with our own built-in Manticore implementations are:

  • English, Russian, and German lemmatizers
  • English, Russian, Arabic, and Czech stemmers
  • SoundEx and MetaPhone phonetic algorithms
  • Chinese word breaking algorithm
  • Snowball (libstemmer) stemmers for more than 15 other languages.

Lemmatizers require dictionary .pak files that you can download from the website. The dictionaries needs to be put in the directory specified by lemmatizer_base. Also, there is the lemmatizer_cache setting which lets you speed up lemmatizing (and therefore indexing) by spending more RAM for, basically, an uncompressed dictionary cache.

The Chinese language segmentation using ICU is also available. It is a much more precise, but a little bit slower way (compared to n-grams) to segment Chinese documents. charset_table must contain all Chinese characters (you can use alias "cjk"). In case of "morphology=icu_chinese" documents are first pre-processed by ICU, then the result is processed by the tokenizer (according to your charset_table) and then other morphology processors specified in the "morphology" option are applied. When the documents are processed by ICU, only those parts of texts that contain Chinese are passed to ICU for segmentation, others can be modified by other means (different morphologies, charset_table etc.)

Built-in English and Russian stemmers should be faster than their libstemmer counterparts, but can produce slightly different results, because they are based on an older version.

Soundex implementation matches that of MySQL. Metaphone implementation is based on Double Metaphone algorithm and indexes the primary code.

Built-in values that are available for use in the morphology option are as follows:

  • none - do not perform any morphology processing
  • lemmatize_ru - apply Russian lemmatizer and pick a single root form
  • lemmatize_uk - apply Ukrainian lemmatizer and pick a single root form (install it first in Centos or Ubuntu/Debian). For correct work of the lemmatizer make sure specific Ukrainian characters are preserved in your charset_table since by default they are not. For that override them, like this: charset_table='non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491'. Here is an interactive course about how to install and use the urkainian lemmatizer.
  • lemmatize_en - apply English lemmatizer and pick a single root form
  • lemmatize_de - apply German lemmatizer and pick a single root form
  • lemmatize_ru_all - apply Russian lemmatizer and index all possible root forms
  • lemmatize_uk_all - apply Ukrainian lemmatizer and index all possible root forms. Find the installation links above and take care of the charset_table.
  • lemmatize_en_all - apply English lemmatizer and index all possible root forms
  • lemmatize_de_all - apply German lemmatizer and index all possible root forms
  • stem_en - apply Porter's English stemmer
  • stem_ru - apply Porter's Russian stemmer
  • stem_enru - apply Porter's English and Russian stemmers
  • stem_cz - apply Czech stemmer
  • stem_ar - apply Arabic stemmer
  • soundex - replace keywords with their SOUNDEX code
  • metaphone - replace keywords with their METAPHONE code
  • icu_chinese - apply Chinese text segmentation using ICU
  • libstemmer_* . Refer to the list of supported languages for details

Several stemmers can be specified at once comma-separated. They will be applied to incoming words in the order they are listed, and the processing will stop once one of the stemmers actually modifies the word. Also when wordforms feature is enabled the word will be looked up in word forms dictionary first, and if there is a matching entry in the dictionary, stemmers will not be applied at all. Or in other words, wordforms can be used to implement stemming exceptions.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'

morphology_skip_fields

morphology_skip_fields = field1[, field2, ...]

A list of fields to skip morphology preprocessing. Optional, default is empty (apply preprocessors to all fields).

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'

min_stemming_len

min_stemming_len = length

Minimum word length at which to enable stemming. Optional, default is 1 (stem everything).

Stemmers are not perfect, and might sometimes produce undesired results. For instance, running "gps" keyword through Porter stemmer for English results in "gp", which is not really the intent. min_stemming_len feature lets you suppress stemming based on the source word length, ie. to avoid stemming too short words. Keywords that are shorter than the given threshold will not be stemmed. Note that keywords that are exactly as long as specified will be stemmed. So in order to avoid stemming 3-character keywords, you should specify 4 for the value. For more finely grained control, refer to wordforms feature.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'

index_exact_words

index_exact_words = {0|1}

Whether to index the original keywords along with the stemmed/remapped versions. Optional, default is 0 (do not index).

When enabled, index_exact_words forces indexation to put the raw keywords in the full-text index along with the stemmed versions. That, in turn, enables exact form operator in the query language to work. This impacts the full-text index size and the indexing time. However, searching performance is not impacted at all.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'

Advanced HTML tokenization

Stripping HTML tags

html_strip

html_strip = {0|1}

This option determines whether HTML markup should be stripped from the incoming full-text data. The default value is 0, which disables stripping. To enable stripping, set the value to 1.

HTML tags and entities are considered as markup and will be processed.

HTML tags are removed, while the contents between them (e.g. everything between <p> and </p>) are left intact. You can choose to keep and index tag attributes (e.g. HREF attribute in an A tag or ALT in an IMG tag). Some well-known inline tags, such as A, B, I, S, U, BASEFONT, BIG, EM, FONT, IMG, LABEL, SMALL, SPAN, STRIKE, STRONG, SUB, SUP, and TT, are completely removed. All other tags are treated as block level and are replaced with whitespace. For example, the text te<strong>st</strong> will be indexed as a single keyword 'test', while te<p>st</p> will be indexed as two keywords 'te' and 'st'.

HTML entities are decoded and replaced with their corresponding UTF-8 characters. The stripper supports both numeric forms (e.g. &#239;) and text forms (e.g. &oacute; or &nbsp;) of entities, and supports all entities specified by the HTML4 standard.

The stripper is designed to work with properly formed HTML and XHTML, but may produce unexpected results on malformed input (such as HTML with stray <'s or unclosed >'s).

Please note that only the tags themselves, as well as HTML comments, are stripped. To strip the contents of the tags, including embedded scripts, see the html_remove_elements option. There are no restrictions on tag names, meaning that everything that looks like a valid tag start, end, or comment will be stripped.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) html_strip = '1'

html_index_attrs

html_index_attrs = img=alt,title; a=title;

The html_index_attrs option allows you to specify which HTML markup attributes should be indexed even though other HTML markup is stripped. The default value is empty, meaning no attributes will be indexed. The format of the option is a per-tag enumeration of indexable attributes, as demonstrated in the example above. The contents of the specified attributes will be retained and indexed, providing a way to extract additional information from your full-text data.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'

html_remove_elements

html_remove_elements = element1[, element2, ...]

A list of HTML elements whose contents, along with the elements themselves, will be stripped. Optional, the default is an empty string (do not strip contents of any elements).

This option allows you to remove the contents of elements, meaning everything between the opening and closing tags. It is useful for removing embedded scripts, CSS, etc. The short tag form for empty elements (e.g.
) is properly supported, and the text following such a tag will not be removed.

The value is a comma-separated list of element (tag) names, the contents of which should be removed. Tag names are case-insensitive.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'

Extracting important parts from HTML

index_sp

index_sp = {0|1}

Controls detection and indexing of sentence and paragraph boundaries. Optional, default is 0 (no detection or indexing).

This directive enables the detection and indexing of sentence and paragraph boundaries, making it possible for the SENTENCE and PARAGRAPH operators to work. Sentence boundary detection is based on plain text analysis, and only requires setting index_sp = 1 to enable it. Paragraph detection, however, relies on HTML markup and occurs during the [HTML stripping process](../../Creating_a_table/NLP_and_tokenization/Advanced_HTML_tokenization.md#html_strip. As such, to index paragraph boundaries, both the index_sp directive and the html_strip directive must be set to 1.

The following rules are used to determine sentence boundaries:

  • Question marks (?) and exclamation marks (!) always indicate a sentence boundary.
  • Trailing dots (.) indicate a sentence boundary, except in the following cases:
    • When followed by a letter. This is considered part of an abbreviation (e.g. "S.T.A.L.K.E.R." or "Goldman Sachs S.p.A.").
    • When followed by a comma. This is considered an abbreviation followed by a comma (e.g. "Telecom Italia S.p.A., founded in 1994").
    • When followed by a space and a lowercase letter. This is considered an abbreviation within a sentence (e.g. "News Corp. announced in February").
    • When preceded by a space and an uppercase letter, and followed by a space. This is considered a middle initial (e.g. "John D. Doe").

Paragraph boundaries are detected at every block-level HTML tag, including: ADDRESS, BLOCKQUOTE, CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P, PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.

Both sentences and paragraphs increment the keyword position counter by 1.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'

index_zones

index_zones = h*, th, title

A list of HTML/XML zones within a field to be indexed. The default is an empty string (no zones will be indexed).

A "zone" is defined as everything between an opening and a matching closing tag, and all spans sharing the same tag name are referred to as a "zone." For example, everything between <H1> and </H1> in a document field belongs to the H1 zone.

The index_zones directive enables zone indexing, but the HTML stripper must also be enabled (by setting html_strip = 1). The value of index_zones should be a comma-separated list of tag names and wildcards (ending with a star) to be indexed as zones.

Zones can be nested and overlap, as long as every opening tag has a matching tag. Zones can also be used for matching with the ZONE operator, as described in the extended_query_syntax.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'