Creating an index > NLP and tokenization > Morphology

Word forms are applied after tokenizing incoming text by charset_table rules. They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (e.g. to normalize all the variants such as "walks", "walked", "walking" to the normal form "walk"). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.

wordforms = path/to/wordforms.txt
wordforms = path/to/alternateforms.txt
wordforms = path/to/dict*.txt

Word forms dictionary. Optional, default is empty.

The dictionaries are used to normalize incoming words both during indexing and searching. Therefore, when it comes to a plain index to pick up changes in wordforms file it's required to rotate the index.

Word forms support in Manticore is designed to support big dictionaries well. They moderately affect indexing speed: for instance, a dictionary with 1 million entries slows down indexing about 1.5 times. Searching speed is not affected at all. Additional RAM impact is roughly equal to the dictionary file size, and dictionaries are shared across indexes: i.e. if the very same 50 MB wordforms file is specified for 10 different indexes, additional searchd RAM usage will be about 50 MB.

Dictionary file should be in a simple plain text format. Each line should contain source and destination word forms, in UTF-8 encoding, separated by "greater" sign. Rules from the charset_table will be applied when the file is loaded. So basically it's as case sensitive as your other full-text indexed data, ie. typically case insensitive. Here's the file contents sample:

walks > walk
walked > walk
walking > walk

There is a bundled Spelldump utility that helps you create a dictionary file in the format Manticore can read from source .dict and .aff dictionary files in ispell or MySpell format (as bundled with OpenOffice).

You can map several source words to a single destination word. Because the work happens on tokens, not the source text, differences in whitespace and markup are ignored.

You can use => instead of >. Comments (starting with # are also allowed. Finally, if a line starts with a tilde (~) the wordform will be applied after morphology, instead of before (only single source word is supported).

core 2 duo > c2d
e6600 > c2d
core 2duo => c2d # Some people write '2duo' together...
~run > walk # Along with stem_en morphology enabled replaces 'run', 'running', 'runs' (and any other words that stem to just 'run') to 'walk'

You can specify multiple destination tokens:

s02e02 > season 2 episode 2
s3 e3 > season 3 episode 3

You can specify several files and not only just one. Masks can be used as a pattern, and all matching files will be processed in simple ascending order.

In RT mode only absolute paths are allowed.

If multi-byte codepages are used, and file names can include foreign characters, the resulting order may not be exactly alphabetic. If the same wordform definition is found in several files, the latter one is used, and it overrides previous definitions.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt /var/lib/manticore/dict*.txt'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'wordforms' => [
                '/var/lib/manticore/wordforms.txt',
                '/var/lib/manticore/alternateforms.txt',
                '/var/lib/manticore/dict*.txt'
            ]
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float)wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'");

index products {
  wordforms = /var/lib/manticore/wordforms.txt
  wordforms = /var/lib/manticore/alternateforms.txt
  wordforms = /var/lib/manticore/dict*.txt
  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

Exceptions

Last modified: September 08, 2021

Exceptions (also known as synonyms) allow to map one or more tokens (including tokens with characters that would normally be excluded) to a single keyword. They are similar to wordforms in that they also perform mapping, but have a number of important differences.

Short summary of the differences from wordforms is as follows:

Exceptions	Word forms
Case sensitive	Case insensitive
Can use special characters that are not in charset_table	Fully obey charset_table
Underperform on huge dictionaries	Designed to handle millions of entries

exceptions = path/to/exceptions.txt

Tokenizing exceptions file. Optional, default is empty. In RT mode only absolute paths are allowed.

The expected file format is also plain text, with one line per exception, and the line format is as follows:

map-from-tokens => map-to-token

Example file:

at & t => at&t
AT&T => AT&T
Standarten   Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus

All tokens here are case sensitive: they will not be processed by charset_table rules. Thus, with the example exceptions file above, "at&t" text will be tokenized as two keywords "at" and "t", because of lowercase letters. On the other hand, "AT&T" will match exactly and produce single "AT&T" keyword.

Note that this map-to keyword is:

always interpreted as a single word
and is both case and space sensitive

In our sample, "ms windows" query will not match the document with "MS Windows" text. The query will be interpreted as a query for two keywords, "ms" and "windows". And what "MS Windows" gets mapped to is a single keyword "ms windows", with a space in the middle. On the other hand, "standartenfuhrer" will retrieve documents with "Standarten Fuhrer" or "Standarten Fuehrer" contents (capitalized exactly like this), or any capitalization variant of the keyword itself, eg. "staNdarTenfUhreR". (It won't catch "standarten fuhrer", however: this text does not match any of the listed exceptions because of case sensitivity, and gets indexed as two separate keywords.)

Whitespace in the map-from tokens list matters, but its amount does not. Any amount of the whitespace in the map-form list will match any other amount of whitespace in the indexed document or query. For instance, "AT & T" map-from token will match "AT & T" text, whatever the amount of space in both map-from part and the indexed text. Such text will therefore be indexed as a special "AT&T" keyword, thanks to the very first entry from the sample.

Exceptions also allow to capture special characters (that are exceptions from general charset_table rules; hence the name). Assume that you generally do not want to treat '+' as a valid character, but still want to be able search for some exceptions from this rule such as 'C++'. The sample above will do just that, totally independent of what characters are in the table and what are not.

Exceptions are applied to raw incoming document and query data during indexing and searching respectively. Therefore, when it comes to a plain index to pick up changes in the file it's required to reindex and restart searchd.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'exceptions' => '/usr/local/manticore/data/exceptions.txt'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'");

index products {
  exceptions = /usr/local/manticore/data/exceptions.txt

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

Word forms Morphology

Last modified: May 04, 2021

Morphology preprocessors can be applied to the words being indexed to replace different forms of the same word with the base, normalized form or improve segmentation. For instance, English stemmer will normalize both "dogs" and "dog" to "dog", making search results for the both keywords the same.

There are 4 different morphology preprocessors that Manticore implements.

Lemmatizer reduces a keyword form to a so-called lemma, a proper normal form, or in other words, a valid natural language root word. For example, "running" could be reduced to "run", the infinitive verb form, and "octopi" would be reduced to "octopus", the singular noun form. Note that sometimes a word form can have multiple corresponding root words. For instance, by looking at "dove" it is not possible to tell whether this is a past tense of "dive" the verb as in "He dove into a pool.", or "dove" the noun as in "White dove flew over the cuckoo's nest." In this case lemmatizer can generate all the possible root forms.
Stemmer reduces a keyword form to a so-called stem by removing and/or replacing certain well-known suffixes. The resulting stem is however not guaranteed to be a valid word on itself. For instance, with a Porter English stemmers "running" would still reduce to "run", which is fine, but "business" would reduce to "busi", which is not a word, and "octopi" would not reduce at all. Stemmers are essentially (much) simpler but still pretty good replacements of full-blown lemmatizers.
Phonetic algorithms replace the words with specially crafted phonetic codes that are equal even when the words original are different, but phonetically close.
Word breaking algorithms split text into words. Currently available only for Chinese.

morphology = morphology1[, morphology2, ...]

A list of morphology preprocessors to apply. Optional, default is empty (do not apply any preprocessor).

The morphology processors that come with our own built-in Manticore implementations are:

English, Russian, and German lemmatizers
English, Russian, Arabic, and Czech stemmers
SoundEx and MetaPhone phonetic algorithms
Chinese word breaking algorithm

You can also link with libstemmer library for even more stemmers (see details below). With libstemmer, Manticore also supports morphological processing for more than 15 other languages. Binary packages should come prebuilt with libstemmer support, too.

Lemmatizers require a dictionary that needs to be additionally downloaded from the Manticore website. That dictionary needs to be installed in a directory specified by lemmatizer_base directive. Also, there is a lemmatizer_cache directive that lets you speed up lemmatizing (and therefore indexing) by spending more RAM for, basically, an uncompressed cache of a dictionary.

Chinese segmentation using ICU is also available. It is a much more precise, but a little bit slower way (compared to n-grams) to segment Chinese documents. charset_table must contain all Chinese characters (you can use alias "cjk"). In case of "morphology=icu_chinese" documents are first pre-processed by ICU, then the result is processed by the tokenizer (according to your charset_table) and then other morphology processors specified in the "morphology" option are applied. When the documents are processed by ICU, only those parts of texts that contain Chinese are passed to ICU for segmentation, others can be modified by other means (different morphologies, charset_table etc.)

Built-in English and Russian stemmers should be faster than their libstemmer counterparts, but can produce slightly different results, because they are based on an older version.

Soundex implementation matches that of MySQL. Metaphone implementation is based on Double Metaphone algorithm and indexes the primary code.

Built-in values that are available for use in the morphology option are as follows:

none - do not perform any morphology processing
lemmatize_ru - apply Russian lemmatizer and pick a single root form
lemmatize_uk - apply Ukrainian lemmatizer and pick a single root form (install it first in Centos or Ubuntu/Debian). For correct work of the lemmatizer make sure specific Ukrainian characters are preserved in your charset_table since by default they are not. For that override them, like this: charset_table='non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491'. Here is an interactive course about how to install and use the urkainian lemmatizer.
lemmatize_en - apply English lemmatizer and pick a single root form
lemmatize_de - apply German lemmatizer and pick a single root form
lemmatize_ru_all - apply Russian lemmatizer and index all possible root forms
lemmatize_uk_all - apply Ukrainian lemmatizer and index all possible root forms. Find the installation links above and take care of the charset_table.
lemmatize_en_all - apply English lemmatizer and index all possible root forms
lemmatize_de_all - apply German lemmatizer and index all possible root forms
stem_en - apply Porter's English stemmer
stem_ru - apply Porter's Russian stemmer
stem_enru - apply Porter's English and Russian stemmers
stem_cz - apply Czech stemmer
stem_ar - apply Arabic stemmer
soundex - replace keywords with their SOUNDEX code
metaphone - replace keywords with their METAPHONE code
icu_chinese - apply Chinese text segmentation using ICU
libstemmer_* . Refer to the list of supported languages for details

Several stemmers can be specified at once comma-separated. They will be applied to incoming words in the order they are listed, and the processing will stop once one of the stemmers actually modifies the word. Also when wordforms feature is enabled the word will be looked up in word forms dictionary first, and if there is a matching entry in the dictionary, stemmers will not be applied at all. Or in other words, wordforms can be used to implement stemming exceptions.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'

POST /sql -d "mode=raw&query=CREATE TABLE products(title text, price float)  morphology = 'stem_en, libstemmer_sv'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'morphology' => 'stem_en, libstemmer_sv'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'");

index products {
  morphology = stem_en, libstemmer_sv
  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

morphology_skip_fields = field1[, field2, ...]

A list of fields to skip morphology preprocessing. Optional, default is empty (apply preprocessors to all fields).

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'morphology_skip_fields' => 'name',
            'morphology' => 'stem_en'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'");

index products {
  morphology_skip_fields = name
  morphology = stem_en
  type = rt
  path = idx
  rt_field = title
  rt_field = name
  rt_attr_uint = price
}

min_stemming_len = length

Minimum word length at which to enable stemming. Optional, default is 1 (stem everything).

Stemmers are not perfect, and might sometimes produce undesired results. For instance, running "gps" keyword through Porter stemmer for English results in "gp", which is not really the intent. min_stemming_len feature lets you suppress stemming based on the source word length, ie. to avoid stemming too short words. Keywords that are shorter than the given threshold will not be stemmed. Note that keywords that are exactly as long as specified will be stemmed. So in order to avoid stemming 3-character keywords, you should specify 4 for the value. For more finely grained control, refer to wordforms feature.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'min_stemming_len' => '4',
             'morphology' => 'stem_en'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'");

index products {
  min_stemming_len = 4
  morphology = stem_en
  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

index_exact_words = {0|1}

Whether to index the original keywords along with the stemmed/remapped versions. Optional, default is 0 (do not index).

When enabled, index_exact_words forces indexation to put the raw keywords in the index along with the stemmed versions. That, in turn, enables exact form operator in the query language to work. This impacts the index size and the indexing time. However, searching performance is not impacted at all.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'index_exact_words' => '1',
             'morphology' => 'stem_en'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'");

index products {
  index_exact_words = 1
  morphology = stem_en
  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

Exceptions Advanced HTML tokenization

Last modified: May 11, 2021

html_strip = {0|1}

Whether to strip HTML markup from incoming full-text data. Optional, default is 0. Known values are 0 (disable stripping) and 1 (enable stripping).

Both HTML tags and entities and considered markup and get processed.

HTML tags are removed, their contents (i.e., everything between <p> and </p>) are left intact by default. You can choose to keep and index attributes of the tags (e.g., HREF attribute in an A tag, or ALT in an IMG one). Several well-known inline tags are completely removed, all other tags are treated as block level and replaced with whitespace. For example, te<b>st</b> text will be indexed as a single keyword 'test', however, te<p>st</p> will be indexed as two keywords 'te' and 'st'. Known inline tags are as follows: A, B, I, S, U, BASEFONT, BIG, EM, FONT, IMG, LABEL, SMALL, SPAN, STRIKE, STRONG, SUB, SUP, TT.

HTML entities get decoded and replaced with corresponding UTF-8 characters. Stripper supports both numeric forms (such as ï) and text forms (such as ó or  ). All entities as specified by HTML4 standard are supported.

Stripping should work with properly formed HTML and XHTML, but, just as most browsers, may produce unexpected results on malformed input (such as HTML with stray <'s or unclosed >'s).

Only the tags themselves, and also HTML comments, are stripped. To strip the contents of the tags too (eg. to strip embedded scripts), see html_remove_elements option. There are no restrictions on tag names; ie. everything that looks like a valid tag start, or end, or a comment will be stripped.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) html_strip = '1'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_strip' => '1'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) html_strip = \'1\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) html_strip = \'1\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) html_strip = '1'");

index products {
  html_strip = 1

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

html_index_attrs = img=alt,title; a=title;

A list of markup attributes to index when stripping HTML. Optional, default is empty (do not index markup attributes).

Specifies HTML markup attributes whose contents should be retained and indexed even though other HTML markup is stripped. The format is per-tag enumeration of indexable attributes, as shown above.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_index_attrs' => 'img=alt,title; a=title;',
            'html_strip' => '1'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'");

index products {
  html_index_attrs = img=alt,title; a=title;
  html_strip = 1

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

html_remove_elements = element1[, element2, ...]

A list of HTML elements for which to strip contents along with the elements themselves. Optional, default is empty string (do not strip contents of any elements).

This feature allows to strip element contents, ie. everything that is between the opening and the closing tags. It is useful to remove embedded scripts, CSS, etc. Short tag form for empty elements (ie. <br/>) is properly supported; ie. the text that follows such tag will not be removed.

The value is a comma-separated list of element (tag) names whose contents should be removed. Tag names are case insensitive.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_remove_elements' => 'style, script',
            'html_strip' => '1'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'");

index products {
  html_remove_elements = style, script
  html_strip = 1

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

index_sp = {0|1}

Whether to detect and index sentence and paragraph boundaries. Optional, default is 0 (do not detect and index).

This directive enables sentence and paragraph boundary indexing. It's required for the SENTENCE and PARAGRAPH operators to work. Sentence boundary detection is based on plain text analysis, so you only need to set index_sp = 1 to enable it. Paragraph detection is however based on HTML markup, and happens in the HTML stripper. So to index paragraph locations you also need to enable the stripper by specifying html_strip = 1. Both types of boundaries are detected based on a few built-in rules enumerated just below.

Sentence boundary detection rules are as follows.

Question and exclamation signs (? and !) are always a sentence boundary.
Trailing dot (.) is a sentence boundary, except:
- When followed by a letter. That's considered a part of an abbreviation (as in "S.T.A.L.K.E.R" or "Goldman Sachs S.p.A.").
- When followed by a comma. That's considered an abbreviation followed by a comma (as in "Telecom Italia S.p.A., founded in 1994").
- When followed by a space and a small letter. That's considered an abbreviation within a sentence (as in "News Corp. announced in February").
- When preceded by a space and a capital letter, and followed by a space. That's considered a middle initial (as in "John D. Doe").

Paragraph boundaries are inserted at every block-level HTML tag. Namely, those are (as taken from HTML 4 standard) ADDRESS, BLOCKQUOTE, CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P, PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.

Both sentences and paragraphs increment the keyword position counter by 1.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_sp' => '1',
            'html_strip' => '1'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'");

index products {
  index_sp = 1
  html_strip = 1

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

index_zones = h*, th, title

A list of in-field HTML/XML zones to index. Optional, default is empty (do not index zones).

Zones can be formally defined as follows. Everything between an opening and a matching closing tag is called a span, and the aggregate of all spans corresponding sharing the same tag name is called a zone. For instance, everything between the occurrences of <H1> and </H1> in the document field belongs to H1 zone.

Zone indexing, enabled by index_zones directive, is an optional extension of the HTML stripper. So it will also require that the stripper is enabled (with html_strip = 1). The value of the index_zones should be a comma-separated list of those tag names and wildcards (ending with a star) that should be indexed as zones.

Zones can nest and overlap arbitrarily. The only requirement is that every opening tag has a matching tag. You can also have an arbitrary number of both zones (as in unique zone names, such as H1) and spans (all the occurrences of those H1 tags) in a document. Once indexed, zones can then be used for matching with the ZONE operator, see extended_query_syntax.

‹›

SQL
HTTP
PHP
Python
javascript
Java
CONFIG

📋

CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'

POST /sql -d "mode=raw&query=
CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_zones' => 'h*,th,title',
            'html_strip' => '1'
        ]);

utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'')

res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'');

utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'");

index products {
  index_zones = h*, th, title
  html_strip = 1

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price
}

Morphology Creating a distributed index

Last modified: February 10, 2021

Word forms

wordforms

Exceptions

exceptions

Advanced morphology

morphology

morphology_skip_fields

min_stemming_len

index_exact_words

Advanced HTML tokenization

Stripping HTML tags

html_strip

html_index_attrs

html_remove_elements

Extracting important parts from HTML

index_sp

index_zones