Supported languages

Manticore supports a wide range of languages, with basic support enabled for most languages via charset_table = non_cjk (which is the default value).

For many languages, Manticore provides a stopwords file that can be used to improve search relevance.

Additionally, advanced morphology is available for a few languages that can significantly improve search relevance by using dictionary-based lemmatization or stemming algorithms for better segmentation and normalization.

The table below lists all supported languages and indicates how to enable:

  • basic support (column "Supported")
  • stopwords (column "Stopwords file name")
  • advanced morphology (column "Advanced morphology")
Language Supported Stopwords file name Advanced morphology Notes
Afrikaans charset_table=non_cjk af -
Arabic charset_table=non_cjk ar morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar
Armenian charset_table=non_cjk hy -
Assamese specify charset_table specify charset_table manually - -
Basque charset_table=non_cjk eu -
Bengali charset_table=non_cjk bn -
Bishnupriya specify charset_table manually - -
Buhid specify charset_table manually - -
Bulgarian charset_table=non_cjk bg -
Catalan charset_table=non_cjk ca morphology=libstemmer_ca
Chinese charset_table=chinese or ngram_chars=chinese zh morphology=icu_chinese or ngram_chars=1 correspondingly ICU dictionary based segmentation is much more accurate than ngram-based
Croatian charset_table=non_cjk hr -
Kurdish charset_table=non_cjk ckb -
Czech charset_table=non_cjk cz morphology=stem_cz (Czech stemmer)
Danish charset_table=non_cjk da morphology=libstemmer_da
Dutch charset_table=non_cjk nl morphology=libstemmer_nl
English charset_table=non_cjk en morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter's English stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_en (English from libstemmer)
Esperanto charset_table=non_cjk eo -
Estonian charset_table=non_cjk et -
Finnish charset_table=non_cjk fi morphology=libstemmer_fi
French charset_table=non_cjk fr morphology=libstemmer_fr
Galician charset_table=non_cjk gl -
Garo specify charset_table manually - -
German charset_table=non_cjk de morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de
Greek charset_table=non_cjk el morphology=libstemmer_el
Hebrew charset_table=non_cjk he -
Hindi charset_table=non_cjk hi morphology=libstemmer_hi
Hmong specify charset_table manually - -
Ho specify charset_table manually - -
Hungarian charset_table=non_cjk hu morphology=libstemmer_hu
Indonesian charset_table=non_cjk id morphology=libstemmer_id
Irish charset_table=non_cjk ga morphology=libstemmer_ga
Italian charset_table=non_cjk it morphology=libstemmer_it
Japanese ngram_chars=japanese - ngram_chars=japanese ngram_len=1 Requires ngram-based segmentation
Komi specify charset_table manually - -
Korean ngram_chars=korean - ngram_chars=korean ngram_len=1 Requires ngram-based segmentation
Large Flowery Miao specify charset_table manually - -
Latin charset_table=non_cjk la -
Latvian charset_table=non_cjk lv -
Lithuanian charset_table=non_cjk lt morphology=libstemmer_lt
Maba specify charset_table manually - -
Maithili specify charset_table manually - -
Marathi specify charset_table manually - -
Marathi charset_table=non_cjk mr -
Mende specify charset_table manually - -
Mru specify charset_table manually - -
Myene specify charset_table manually - -
Nepali specify charset_table manually - morphology=libstemmer_ne
Ngambay specify charset_table manually - -
Norwegian charset_table=non_cjk no morphology=libstemmer_no
Odia specify charset_table manually - -
Persian charset_table=non_cjk fa -
Polish charset_table=non_cjk pl -
Portuguese charset_table=non_cjk pt morphology=libstemmer_pt
Romanian charset_table=non_cjk ro morphology=libstemmer_ro
Russian charset_table=non_cjk ru morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter's Russian stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_ru (from libstemmer)
Santali specify charset_table manually - -
Sindhi specify charset_table manually - -
Slovak charset_table=non_cjk sk -
Slovenian charset_table=non_cjk sl -
Somali charset_table=non_cjk so -
Sotho charset_table=non_cjk st -
Spanish charset_table=non_cjk es morphology=libstemmer_es
Swahili charset_table=non_cjk sw -
Swedish charset_table=non_cjk sv morphology=libstemmer_sv
Sylheti specify charset_table manually - -
Tamil specify charset_table manually - morphology=libstemmer_ta
Thai charset_table=non_cjk th -
Turkish charset_table=non_cjk tr morphology=libstemmer_tr
Ukrainian charset_table=non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491 - morphology=lemmatize_uk_all Requires installation of UK lemmatizer
Yoruba charset_table=non_cjk yo -
Zulu charset_table=non_cjk zu -

Chinese, Japanese and Korean (CJK) languages

Manticore provides built-in support for indexing CJK texts, allowing you to process CJK texts in two different ways:

  1. Precise segmentation using the ICU library. Currently, only Chinese is supported.
‹›
  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'
  1. Basic support using the N-gram options ngram_len and ngram_chars For each CJK language, there are separate character set tables (chinese, korean, japanese) that can be used, or you can use the common cjk character set table.
‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'

Additionally, there is built-in support for Chinese stopwords with the alias zh.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'

Low-level tokenization

When text is indexed in Manticore, it is split into words and case folding is done so that words like "Abc", "ABC", and "abc" are treated as the same word.

To perform these operations correctly, Manticore must know:

  • the encoding of the source text (which should always be UTF-8)
  • which characters are considered letters and which are not
  • which letters should be folded to other letters

You can configure these settings on a per-table basis using the charset_table option. charset_table specifies an array that maps letter characters to their case-folded versions (or any other characters that you prefer). Characters that are not present in the array are considered to be non-letters and will be treated as word separators during indexing or searching in this table.

The default character set is non_cjk, which includes most languages.

You can also define text pattern replacement rules. For example, with the following rules:

regexp_filter = \**(\d+)\" => \1 inch
regexp_filter = (BLUE|RED) => COLOR

The text RED TUBE 5" LONG would be indexed as COLOR TUBE 5 INCH LONG, and PLANK 2" x 4" would be indexed as PLANK 2 INCH x 4 INCH. These rules are applied in the specified order. The rules also apply to queries, so a search for BLUE TUBE would actually search for COLOR TUBE.

You can learn more about regexp_filter here.

Index configuration options

charset_table

# default
charset_table = non_cjk

# only English and Russian letters
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451

# english charset defined with alias
charset_table = 0..9, english, _

# you can override character mappings by redefining them, e.g. for case insensitive search with German umlauts you can use:
charset_table = non_cjk, U+00E4, U+00C4->U+00E4, U+00F6, U+00D6->U+00F6, U+00FC, U+00DC->U+00FC, U+00DF, U+1E9E->U+00DF

charset_table specifies an array that maps letter characters to their case folded versions (or any other characters if you like). The default character set is non_cjk which includes most non-CJK languages.

charset_table is a workhorse of Manticore's tokenization process, which extracts keywords from document text or query text. It controls what characters are accepted as valid and how they should be transformed (e.g. whether case should be removed or not).

By default, every character maps to 0, which means that it is not considered a valid keyword and is treated as a separator. Once a character is mentioned in the table, it is mapped to another character (most frequently, either to itself or to a lowercase letter) and is treated as a valid keyword part.

charset_table uses a comma-separated list of mappings to declare characters as valid or to map them to other characters. Syntax shortcuts are available for mapping ranges of characters at once:

  • Single char mapping: A->a. Declares the source character 'A' as allowed within keywords and maps it to the destination character 'a' (but does not declare 'a' as allowed).
  • Range mapping: A..Z->a..z. Declares all characters in the source range as allowed and maps them to the destination range. Does not declare the destination range as allowed. Checks the lengths of both ranges.
  • Stray char mapping: a. Declares a character as allowed and maps it to itself. Equivalent to a->a single char mapping.
  • Stray range mapping: a..z. Declares all characters in the range as allowed and maps them to themselves. Equivalent to a..z->a..z range mapping.
  • Checkerboard range mapping: A..Z/2. Maps every pair of characters to the second character. For instance, A..Z/2 is equivalent to A->B, B->B, C->D, D->D, ..., Y->Z, Z->Z. This mapping shortcut is helpful for Unicode blocks where uppercase and lowercase letters go in an interleaved order.

For characters with codes from 0 to 32, and those in the range of 127 to 8-bit ASCII and Unicode characters, Manticore always treats them as separators. To avoid configuration file encoding issues, 8-bit ASCII characters and Unicode characters must be specified in U+XXX form, where XXX is a hexadecimal code point number. The minimal accepted Unicode character code is U+0021.

If the default mappings are insufficient for your needs, you can redefine the character mappings by specifying them again with another mapping. For example, if the built-in non_cjk array includes characters Ä and ä and maps them both to the ASCII character a, you can redefine those characters by adding the Unicode code points for them, like this:

charset_table = non_cjk,U+00E4,U+00C4

for case sensitive search or

charset_table = non_cjk,U+00E4,U+00C4->U+00E4

for case insensitive search.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'

Besides definitions of characters and mappings, there are several built-in aliases that can be used. Current aliases are:

  • english
  • russian
  • non_cjk
  • cjk
‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'

If you want to support different languages in your search, it can be a laborious task to define sets of valid characters and folding rules for all of them. We have simplified this for you by providing default charset tables, non_cjk and cjk, that cover non-CJK and CJK (Chinese, Japanese, Korean) languages respectively. In most cases, these charsets should be sufficient for your needs.

Please note that the following languages are currently not supported:

  • Assamese
  • Bishnupriya
  • Buhid
  • Garo
  • Hmong
  • Ho
  • Komi
  • Large Flowery Miao
  • Maba
  • Maithili
  • Marathi
  • Mende
  • Mru
  • Myene
  • Ngambay
  • Odia
  • Santali
  • Sindhi
  • Sylheti

All other languages listed in the Unicode languages list are supported by default.

To work with both cjk and non-cjk languages, set the options in your configuration file as shown below (with an exception for Chinese):

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'

If you do not require support for cjk-languages, you can simply exclude the ngram_len and ngram_chars options. For more information on these options, refer to the corresponding documentation sections.

To map one character to multiple characters or vice versa, you can use regexp_filter can be helpful.

blend_chars

blend_chars = +, &, U+23
blend_chars = +, &->+

Blended characters list. Optional, default is empty.

Blended characters are indexed as both separators and valid characters. For example, when & is defined as a blended character and AT&T appears in an indexed document, three different keywords will be indexed, at&t, at and t.

Care should be taken when using blended characters as defining a character as blended means that it is no longer a separator.

  • Therefore, if you put a comma to the blend_chars and search for dog,cat, it will treat that as a single token dog,cat. If dog,cat was not indexed as dog,cat, but left as dog cat only, then it will not match.
  • Hence, this behavior should be controlled with the blend_mode setting.

Positions for tokens obtained by replacing blended characters with whitespace are assigned as usual, and regular keywords will be indexed as if there were no blend_chars specified at all. An additional token that mixes blended and non-blended characters will be put at the starting position. For instance, if AT&T company occurs in the very beginning of the text field, at will be given position 1, t position 2, company position 3, and AT&T will also be given position 1, blending with the opening regular keyword. As a result, queries for AT&T or just AT will match that document. A phrase query for "AT T" will also match, as well as a phrase query for "AT&T company".

Blended characters can overlap with special characters used in query syntax, such as T-Mobile or @twitter. Where possible, the query parser will handle the blended character as blended. For instance, if hello @twitter is within quotes (a phrase operator), the query parser will handle the @ symbol as blended. However, if the @ symbol was not within quotes, the character would be handled as an operator. Therefore, it is recommended to escape keywords.

Blended characters can be remapped so that multiple different blended characters can be normalized into one base form. This is useful when indexing multiple alternative Unicode codepoints with equivalent glyphs.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'

blend_mode

blend_mode = option [, option [, ...]]
option = trim_none | trim_head | trim_tail | trim_both | trim_all | skip_pure

The blended tokens indexing mode is enabled by the blend_mode directive.

By default, tokens that mix blended and non-blended characters get indexed entirely. For example, when both an at-sign and an exclamation are in blend_chars, the string @dude! will be indexed as two tokens: @dude! (with all the blended characters) and dude (without any). As a result, a query of @dude will not match it.

blend_mode adds flexibility to this indexing behavior. It takes a comma-separated list of options, each of which specifies a token indexing variant.

If multiple options are specified, multiple variants of the same token will be indexed. Regular keywords (resulting from that token by replacing blended characters with a separator) are always indexed.

The options are:

  • trim_none - Index the entire token
  • trim_head - Trim heading blended characters, and index the resulting token
  • trim_tail - Trim trailing blended characters, and index the resulting token
  • trim_both- Trim both heading and trailing blended characters, and index the resulting token
  • trim_all - Trim heading, trailing, and middle blended characters, and index the resulting token
  • skip_pure - Do not index the token if it is purely blended, that is, consists of blended characters only

Using blend_mode with the example @dude! string above, the setting blend_mode = trim_head, trim_tail would result in two indexed tokens: @dude and dude!. Using trim_both would have no effect because trimming both blended characters results in dude, which is already indexed as a regular keyword. Indexing @U.S.A. with trim_both (and assuming that dot is blended two) would result in U.S.A being indexed. Lastly, skip_pure enables you to ignore sequences of blended characters only. For example, one @@@ two would be indexed as one two, and match that as a phrase. This is not the case by default because a fully blended token gets indexed and offsets the second keyword position.

Default behavior is to index the entire token, equivalent to blend_mode = trim_none.

Be aware that using blend modes limits your search, even with the default mode trim_none if you assume . is a blended character:

  • .dog. will become dog. dog during indexing
  • and you won't be able to find it by dog..

Using more modes increases the chance your keyword will match something.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'

min_word_len

min_word_len = length

min_word_len is an optional index configuration option in Manticore that specifies the minimum indexed word length. The default value is 1, which means that everything is indexed.

Only those words that are not shorter than this minimum will be indexed. For example, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) min_word_len = '4'

ngram_len

ngram_len = 1

N-gram lengths for N-gram indexing. Optional, default is 0 (disable n-gram indexing). Known values are 0 and 1.

N-grams provide basic CJK (Chinese, Japanese, Korean) support for unsegmented texts. The issue with CJK searching is that there may be no clear separators between the words. In some cases, you may not want to use dictionary-based segmentation the one available for Chinese. In those cases, n-gram segmentation might work well too.

When this feature is enabled, streams of CJK (or any other characters defined in ngram_chars) are indexed as N-grams. For example, if the incoming text is "ABCDEF" (where A to F represent some CJK characters) and ngram_len is 1, it will be indexed as if it were "A B C D E F". Only ngram_len=1 is currently supported. Only those characters that are listed in ngram_chars table will be split this way; others will not be affected.

Note that if the search query is segmented, i.e. there are separators between individual words, then wrapping the words in quotes and using extended mode will result in proper matches being found even if the text was not segmented. For instance, assume that the original query is BC DEF. After wrapping in quotes on the application side, it should look like "BC" "DEF" (with quotes). This query will be passed to Manticore and internally split into 1-grams too, resulting in "B C" "D E F" query, still with quotes that are the phrase matching operator. And it will match the text even though there were no separators in the text.

Even if the search query is not segmented, Manticore should still produce good results, thanks to phrase-based ranking: it will pull closer phrase matches (which in the case of N-gram CJK words can mean closer multi-character word matches) to the top.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'

ngram_chars

ngram_chars = cjk

ngram_chars = cjk, U+3000..U+2FA1F

N-gram characters list. Optional, default is empty.

To be used in conjunction with in ngram_len, this list defines characters, sequences of which are subject to N-gram extraction. Words comprised of other characters will not be affected by N-gram indexing feature. The value format is identical to charset_table. N-gram characters cannot appear in the charset_table.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'

Also you can use an alias for our default N-gram table as in the example. It should be sufficient in most cases.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'

ignore_chars

ignore_chars = U+AD

Ignored characters list. Optional, default is empty.

Useful in cases when some characters, such as soft hyphenation mark (U+00AD), should be not just treated as separators but rather fully ignored. For example, if '-' is simply not in the charset_table, "abc-def" text will be indexed as "abc" and "def" keywords. On the contrary, if '-' is added to ignore_chars list, the same text will be indexed as a single "abcdef" keyword.

The syntax is the same as for charset_table, but it's only allowed to declare characters, and not allowed to map them. Also, the ignored characters must not be present in charset_table.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'

bigram_index

bigram_index = {none|all|first_freq|both_freq}

Bigram indexing mode. Optional, default is none.

Bigram indexing is a feature to accelerate phrase searches. When indexing, it stores a document list for either all or some of the adjacent words pairs into the index. Such a list can then be used at searching time to significantly accelerate phrase or sub-phrase matching.

bigram_index controls the selection of specific word pairs. The known modes are:

  • all, index every single word pair
  • first_freq, only index word pairs where the first word is in a list of frequent words (see bigram_freq_words). For example, with bigram_freq_words = the, in, i, a, indexing "alone in the dark" text will result in "in the" and "the dark" pairs being stored as bigrams, because they begin with a frequent keyword (either "in" or "the" respectively), but "alone in" would not be indexed, because "in" is a second word in that pair.
  • both_freq, only index word pairs where both words are frequent. Continuing with the same example, in this mode indexing "alone in the dark" would only store "in the" (the very worst of them all from searching perspective) as a bigram, but none of the other word pairs.

For most use cases, both_freq would be the best mode, but your mileage may vary.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'

bigram_freq_words

bigram_freq_words = the, a, you, i

A list of keywords considered "frequent" when indexing bigrams. Optional, default is empty.

Some of the bigram indexing modes (see bigram_index) require to define a list of frequent keywords. These are not to be confused with stop words. Stop words are completely eliminated when both indexing and searching. Frequent keywords are only used by bigrams to determine whether to index a current word pair or not.

bigram_freq_words lets you define a list of such keywords.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'

dict

dict = {keywords|crc}

The type of keywords dictionary used is identified by one of two known values, 'crc' or 'keywords'. This is optional, with 'keywords' as the default.

Using the keywords dictionary mode (dict=keywords) can significantly decrease the indexing burden and enable substring searches on extensive collections. This mode can be utilized for both plain and RT tables.

CRC dictionaries do not store the original keyword text in the index. Instead, they replace keywords with a control sum value (computed using FNV64) during both searching and indexing processes. This value is used internally within the index. This approach has two disadvantages:

  • Firstly, there's a risk of control sum collisions between different keywords pairs. This risk grows in proportion to the number of unique keywords in the index. Nonetheless, this concern is minor as the probability of a single FNV64 collision in a dictionary of 1 billion entries is roughly 1 in 16, or 6.25 percent. Most dictionaries will have far fewer than a billion keywords given that a typical spoken human language has between 1 and 10 million word forms.
  • Secondly, and more crucially, it's not straightforward to perform substring searches with control sums. Manticore addressed this issue by pre-indexing all possible substrings as separate keywords (see min_prefix_len, min_infix_len directives). This method even has an added advantage of matching substrings in the fastest way possible. Yet, pre-indexing all substrings significantly increases the index size (often by factors of 3-10x or more) and subsequently affects the indexing time, making substring searches on large indexes rather impractical.

The keywords dictionary resolves both of these issues. It stores keywords in the index and performs search-time wildcard expansion. For instance, a search for a test* prefix could internally expand to a 'test|tests|testing' query based on the dictionary's contents. This expansion process is entirely invisible to the application, with the exception that the separate per-keyword statistics for all the matched keywords are now also reported.

For substring (infix) searches, extended wildcards can be used. Special characters such as '?' and '%' are compatible with substring (infix) search (e.g., t?st*, run%, *abc*). Note that these wildcards only function with dict=keywords, and not elsewhere.

Indexing with a keywords dictionary is approximately 1.1x to 1.3x slower than regular, non-substring indexing - yet significantly faster than substring indexing (either prefix or infix). The index size should only be slightly larger than that of the standard non-substring table, with a total difference of 1..10% percent. The time it takes for regular keyword searching should be nearly the same or identical across all three index types discussed (CRC non-substring, CRC substring, keywords). Substring searching time can significantly fluctuate based on how many actual keywords match the given substring (i.e., how many keywords the search term expands into). The maximum number of matched keywords is limited by the expansion_limit directive.

In summary, keywords and CRC dictionaries offer two different trade-off decisions for substring searching. You can opt to either sacrifice indexing time and index size to achieve the fastest worst-case searches (CRC dictionary), or minimally impact indexing time but sacrifice worst-case searching time when the prefix expands into a high number of keywords (keywords dictionary).

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) dict = 'keywords'

embedded_limit

embedded_limit = size

Embedded exceptions, wordforms, or stop words file size limit. Optional, default is 16K.

When you create a table the above mentioned files can be either saved externally along with the table or embedded directly into the table. Files sized under embedded_limit get stored into the table. For bigger files, only the file names are stored. This also simplifies moving table files to a different machine; you may get by just copying a single file.

With smaller files, such embedding reduces the number of the external files on which the table depends, and helps maintenance. But at the same time it makes no sense to embed a 100 MB wordforms dictionary into a tiny delta table. So there needs to be a size threshold, and embedded_limit is that threshold.

‹›
  • CONFIG
CONFIG
📋
table products {
  embedded_limit = 32K

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

global_idf

global_idf = /path/to/global.idf

The path to a file with global (cluster-wide) keyword IDFs. Optional, default is empty (use local IDFs).

On a multi-table cluster, per-keyword frequencies are quite likely to differ across different tables. That means that when the ranking function uses TF-IDF based values, such as BM25 family of factors, the results might be ranked slightly differently depending on what cluster node they reside.

The easiest way to fix that issue is to create and utilize a global frequency dictionary, or a global IDF file for short. This directive lets you specify the location of that file. It is suggested (but not required) to use an .idf extension. When the IDF file is specified for a given table and OPTION global_idf is set to 1, the engine will use the keyword frequencies and collection documents counts from the global_idf file, rather than just the local table. That way, IDFs and the values that depend on them will stay consistent across the cluster.

IDF files can be shared across multiple tables. Only a single copy of an IDF file will be loaded by searchd, even when many tables refer to that file. Should the contents of an IDF file change, the new contents can be loaded with a SIGHUP.

You can build an .idf file using indextool utility, by dumping dictionaries using --dumpdict dict.txt --stats switch first, then converting those to .idf format using --buildidf, then merging all the .idf files across cluster using --mergeidf.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'

hitless_words

hitless_words = {all|path/to/file}

Hitless words list. Optional, allowed values are 'all', or a list file name.

By default, Manticore full-text index stores not only a list of matching documents for every given keyword, but also a list of its in-document positions (known as hitlist). Hitlists enables phrase, proximity, strict order and other advanced types of searching, as well as phrase proximity ranking. However, hitlists for specific frequent keywords (that can not be stopped for some reason despite being frequent) can get huge and thus slow to process while querying. Also, in some cases we might only care about boolean keyword matching, and never need position-based searching operators (such as phrase matching) nor phrase ranking.

hitless_words lets you create indexes that either do not have positional information (hitlists) at all, or skip it for specific keywords.

Hitless index will generally use less space than the respective regular full-text index (about 1.5x can be expected). Both indexing and searching should be faster, at a cost of missing positional query and ranking support.

If used in positional queries (e.g. phrase queries) the hitless words are taken out from them and used as operand without a position. For example if "hello" and "world" are hitless and "simon" and "says" are not hitless, the phrase query "simon says hello world" will be converted to ("simon says" & hello & world), matching "hello" and "world" anywhere in the document and "simon says" as an exact phrase.

A positional query than contains only hitless words will result in an empty phrase node, therefore the entire query will return an empty result and a warning. If the whole dictionary is hitless (using all) only boolean matching can be used on the respective index.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) hitless_words = 'all'

index_field_lengths

index_field_lengths = {0|1}

Enables computing and storing of field lengths (both per-document and average per-index values) into the full-text index. Optional, default is 0 (do not compute and store).

When index_field_lengths is set to 1 Manticore will:

  • create a respective length attribute for every full-text field, sharing the same name but with __len suffix
  • compute a field length (counted in keywords) for every document and store in to a respective attribute
  • compute the per-index averages. The lengths attributes will have a special TOKENCOUNT type, but their values are in fact regular 32-bit integers, and their values are generally accessible.

BM25A() and BM25F() functions in the expression ranker are based on these lengths and require index_field_lengths to be enabled. Historically, Manticore used a simplified, stripped-down variant of BM25 that, unlike the complete function, did not account for document length. There's also support for both a complete variant of BM25, and its extension towards multiple fields, called BM25F. They require per-document length and per-field lengths, respectively. Hence the additional directive.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) index_field_lengths = '1'

index_token_filter

index_token_filter = my_lib.so:custom_blend:chars=@#&

Index-time token filter for full-text indexing. Optional, default is empty.

The index_token_filter directive specifies an optional index-time token filter for full-text indexing. This directive is used to create a custom tokenizer that makes tokens according to custom rules. The filter is created by the indexer on indexing source data into a plain table or by an RT table on processing INSERT or REPLACE statements. The plugins are defined using the format library name:plugin name:optional string of settings. For example, my_lib.so:custom_blend:chars=@#&.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'

overshort_step

overshort_step = {0|1}

Position increment on overshort (less than min_word_len) keywords. Optional, allowed values are 0 and 1, default is 1.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) overshort_step = '1'

phrase_boundary

phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis

Phrase boundary characters list. Optional, default is empty.

This list controls what characters will be treated as phrase boundaries, in order to adjust word positions and enable phrase-level search emulation through proximity search. The syntax is similar to charset_table, but mappings are not allowed and the boundary characters must not overlap with anything else.

On phrase boundary, additional word position increment (specified by phrase_boundary_step) will be added to current word position. This enables phrase-level searching through proximity queries: words in different phrases will be guaranteed to be more than phrase_boundary_step distance away from each other; so proximity search within that distance will be equivalent to phrase-level search.

Phrase boundary condition will be raised if and only if such character is followed by a separator; this is to avoid abbreviations such as S.T.A.L.K.E.R or URLs being treated as several phrases.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'

phrase_boundary_step

phrase_boundary_step = 100

Phrase boundary word position increment. Optional, default is 0.

On phrase boundary, current word position will be additionally incremented by this number.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'

regexp_filter

# index '13"' as '13inch'
regexp_filter = \b(\d+)\" => \1inch

# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color

Regular expressions (regexps) used to filter the fields and queries. This directive is optional, multi-valued, and its default is an empty list of regexps.

In certain applications such as product search, there can be many ways to refer to a product, model, or property. For example, iPhone 3gs and iPhone 3 gs (or even iPhone3 gs) are very likely to refer to the same product. Another example could be different ways to express a laptop screen size, such as 13-inch, 13 inch, 13", or 13in.

Regexps provide a mechanism to specify rules tailored to handle such cases. In the first example, you could possibly use a wordforms file to handle a handful of iPhone models, but in the second example, it's better to specify rules that would normalize "13-inch" and "13in" to something identical.

Regular expressions listed in regexp_filter are applied in the order they are listed, at the earliest stage possible, before any other processing, even before tokenization. That is, regexps are applied to the raw source fields when indexing, and to the raw search query text when searching.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'