Manticore supports a wide range of languages, with basic support enabled for most languages via charset_table = non_cont
(which is the default value). The non_cjk
option which is an alias for non_cont
can be used as well: charset_table = non_cjk
.
For many languages, Manticore provides a stopwords file that can be used to improve search relevance.
Additionally, advanced morphology is available for a few languages that can significantly improve search relevance by using dictionary-based lemmatization or stemming algorithms for better segmentation and normalization.
The table below lists all supported languages and indicates how to enable:
- basic support (column "Supported")
- stopwords (column "Stopwords file name")
- advanced morphology (column "Advanced morphology")
Language | Supported | Stopwords file name | Advanced morphology | Notes |
---|---|---|---|---|
Afrikaans | charset_table=non_cont | af | - | |
Arabic | charset_table=non_cont | ar | morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar | |
Armenian | charset_table=non_cont | hy | - | |
Assamese | specify charset_table specify charset_table manually | - | - | |
Basque | charset_table=non_cont | eu | - | |
Bengali | charset_table=non_cont | bn | - | |
Bishnupriya | specify charset_table manually | - | - | |
Buhid | specify charset_table manually | - | - | |
Bulgarian | charset_table=non_cont | bg | - | |
Catalan | charset_table=non_cont | ca | morphology=libstemmer_ca | |
Chinese using ICU | charset_table=chinese | zh | morphology=icu_chinese | More accurate than using ngrams |
Chinese using Jieba | charset_table=chinese | zh | morphology=jieba_chinese | More accurate than using ngrams |
Chinese using ngrams | ngram_chars=chinese | zh | ngram_chars=1 | Faster indexing, but the search performance might not be as good |
Croatian | charset_table=non_cont | hr | - | |
Kurdish | charset_table=non_cont | ckb | - | |
Czech | charset_table=non_cont | cz | morphology=stem_cz (Czech stemmer) | |
Danish | charset_table=non_cont | da | morphology=libstemmer_da | |
Dutch | charset_table=non_cont | nl | morphology=libstemmer_nl | |
English | charset_table=non_cont | en | morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter's English stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_en (English from libstemmer) | |
Esperanto | charset_table=non_cont | eo | - | |
Estonian | charset_table=non_cont | et | - | |
Finnish | charset_table=non_cont | fi | morphology=libstemmer_fi | |
French | charset_table=non_cont | fr | morphology=libstemmer_fr | |
Galician | charset_table=non_cont | gl | - | |
Garo | specify charset_table manually | - | - | |
German | charset_table=non_cont | de | morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de | |
Greek | charset_table=non_cont | el | morphology=libstemmer_el | |
Hebrew | charset_table=non_cont | he | - | |
Hindi | charset_table=non_cont | hi | morphology=libstemmer_hi | |
Hmong | specify charset_table manually | - | - | |
Ho | specify charset_table manually | - | - | |
Hungarian | charset_table=non_cont | hu | morphology=libstemmer_hu | |
Indonesian | charset_table=non_cont | id | morphology=libstemmer_id | |
Irish | charset_table=non_cont | ga | morphology=libstemmer_ga | |
Italian | charset_table=non_cont | it | morphology=libstemmer_it | |
Japanese | ngram_chars=japanese | - | ngram_chars=japanese ngram_len=1 | Requires ngram-based segmentation |
Komi | specify charset_table manually | - | - | |
Korean | ngram_chars=korean | - | ngram_chars=korean ngram_len=1 | Requires ngram-based segmentation |
Large Flowery Miao | specify charset_table manually | - | - | |
Latin | charset_table=non_cont | la | - | |
Latvian | charset_table=non_cont | lv | - | |
Lithuanian | charset_table=non_cont | lt | morphology=libstemmer_lt | |
Maba | specify charset_table manually | - | - | |
Maithili | specify charset_table manually | - | - | |
Marathi | specify charset_table manually | - | - | |
Marathi | charset_table=non_cont | mr | - | |
Mende | specify charset_table manually | - | - | |
Mru | specify charset_table manually | - | - | |
Myene | specify charset_table manually | - | - | |
Nepali | specify charset_table manually | - | morphology=libstemmer_ne | |
Ngambay | specify charset_table manually | - | - | |
Norwegian | charset_table=non_cont | no | morphology=libstemmer_no | |
Odia | specify charset_table manually | - | - | |
Persian | charset_table=non_cont | fa | - | |
Polish | charset_table=non_cont | pl | - | |
Portuguese | charset_table=non_cont | pt | morphology=libstemmer_pt | |
Romanian | charset_table=non_cont | ro | morphology=libstemmer_ro | |
Russian | charset_table=non_cont | ru | morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter's Russian stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_ru (from libstemmer) | |
Santali | specify charset_table manually | - | - | |
Sindhi | specify charset_table manually | - | - | |
Slovak | charset_table=non_cont | sk | - | |
Slovenian | charset_table=non_cont | sl | - | |
Somali | charset_table=non_cont | so | - | |
Sotho | charset_table=non_cont | st | - | |
Spanish | charset_table=non_cont | es | morphology=libstemmer_es | |
Swahili | charset_table=non_cont | sw | - | |
Swedish | charset_table=non_cont | sv | morphology=libstemmer_sv | |
Sylheti | specify charset_table manually | - | - | |
Tamil | specify charset_table manually | - | morphology=libstemmer_ta | |
Thai | charset_table=thai | th | - | |
Turkish | charset_table=non_cont | tr | morphology=libstemmer_tr | |
Ukrainian | charset_table=non_cont,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491 | - | morphology=lemmatize_uk_all | Requires installation of UK lemmatizer |
Yoruba | charset_table=non_cont | yo | - | |
Zulu | charset_table=non_cont | zu | - |
Manticore provides built-in support for indexing languages with continuous scripts (i.e., languages that do not use spaces or other marks between words or sentences). This allows you to process texts in these languages in two different ways:
- Precise segmentation using the ICU library. Currently, only Chinese is supported.
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'
- Precise segmentation using the Jieba library. Like ICU, it currently supports only Chinese.
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'
- Basic support using the N-gram options ngram_len and ngram_chars
For each language using a continuous script, there are separate character set tables (
chinese
,korean
,japanese
,thai
) that can be used. Alternatively, you can use the commoncont
character set table to support all CJK and Thai languages at once, or thecjk
charset to include all CJK languages only.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'
/* Or, alternatively */
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cjk,thai'
Additionally, there is built-in support for Chinese stopwords with the alias zh
.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'
When text is indexed in Manticore, it is split into words and case folding is done so that words like "Abc", "ABC", and "abc" are treated as the same word.
To perform these operations correctly, Manticore must know:
- the encoding of the source text (which should always be UTF-8)
- which characters are considered letters and which are not
- which letters should be folded to other letters
You can configure these settings on a per-table basis using the charset_table option. charset_table specifies an array that maps letter characters to their case-folded versions (or any other characters that you prefer). Characters that are not present in the array are considered to be non-letters and will be treated as word separators during indexing or searching in this table.
The default character set is non_cont
, which includes most languages.
You can also define text pattern replacement rules. For example, with the following rules:
regexp_filter = \**(\d+)\" => \1 inch
regexp_filter = (BLUE|RED) => COLOR
The text RED TUBE 5" LONG
would be indexed as COLOR TUBE 5 INCH LONG
, and PLANK 2" x 4"
would be indexed as PLANK 2 INCH x 4 INCH
. These rules are applied in the specified order. The rules also apply to queries, so a search for BLUE TUBE
would actually search for COLOR TUBE
.
You can learn more about regexp_filter here.
# default
charset_table = non_cont
# only English and Russian letters
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451
# english charset defined with alias
charset_table = 0..9, english, _
# you can override character mappings by redefining them, e.g. for case insensitive search with German umlauts you can use:
charset_table = non_cont, U+00E4, U+00C4->U+00E4, U+00F6, U+00D6->U+00F6, U+00FC, U+00DC->U+00FC, U+00DF, U+1E9E->U+00DF
charset_table
specifies an array that maps letter characters to their case-folded versions (or any other characters if you prefer). The default character set is non_cont
, which includes most languages with non-continuous scripts.
charset_table
is a workhorse of Manticore's tokenization process, which extracts keywords from document text or query text. It controls what characters are accepted as valid and how they should be transformed (e.g. whether case should be removed or not).
By default, every character maps to 0, which means that it is not considered a valid keyword and is treated as a separator. Once a character is mentioned in the table, it is mapped to another character (most frequently, either to itself or to a lowercase letter) and is treated as a valid keyword part.
charset_table uses a comma-separated list of mappings to declare characters as valid or to map them to other characters. Syntax shortcuts are available for mapping ranges of characters at once:
- Single char mapping:
A->a
. Declares the source character 'A' as allowed within keywords and maps it to the destination character 'a' (but does not declare 'a' as allowed). - Range mapping:
A..Z->a..z
. Declares all characters in the source range as allowed and maps them to the destination range. Does not declare the destination range as allowed. Checks the lengths of both ranges. - Stray char mapping:
a
. Declares a character as allowed and maps it to itself. Equivalent toa->a
single char mapping. - Stray range mapping:
a..z
. Declares all characters in the range as allowed and maps them to themselves. Equivalent toa..z->a..z
range mapping. - Checkerboard range mapping:
A..Z/2
. Maps every pair of characters to the second character. For instance,A..Z/2
is equivalent toA->B, B->B, C->D, D->D, ..., Y->Z, Z->Z
. This mapping shortcut is helpful for Unicode blocks where uppercase and lowercase letters go in an interleaved order.
For characters with codes from 0 to 32, and those in the range of 127 to 8-bit ASCII and Unicode characters, Manticore always treats them as separators. To avoid configuration file encoding issues, 8-bit ASCII characters and Unicode characters must be specified in U+XXX
form, where XXX
is a hexadecimal code point number. The minimal accepted Unicode character code is U+0021
.
If the default mappings are insufficient for your needs, you can redefine the character mappings by specifying them again with another mapping. For example, if the built-in non_cont
array includes characters Ä
and ä
and maps them both to the ASCII character a
, you can redefine those characters by adding the Unicode code points for them, like this:
charset_table = non_cont,U+00E4,U+00C4
for case sensitive search or
charset_table = non_cont,U+00E4,U+00C4->U+00E4
for case insensitive search.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'
Besides definitions of characters and mappings, there are several built-in aliases that can be used. Current aliases are:
chinese
cjk
cont
english
japanese
korean
non_cont
(non_cjk
)russian
thai
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'
If you want to support different languages in your search, it can be a laborious task to define sets of valid characters and folding rules for all of them. We have simplified this for you by providing default charset tables, non_cont
and cont
, that cover languages with non-continuous and continuous (Chinese, Japanese, Korean, Thai) scripts, respectively. In most cases, these charsets should be sufficient for your needs.
Please note that the following languages are currently not supported:
- Assamese
- Bishnupriya
- Buhid
- Garo
- Hmong
- Ho
- Komi
- Large Flowery Miao
- Maba
- Maithili
- Marathi
- Mende
- Mru
- Myene
- Ngambay
- Odia
- Santali
- Sindhi
- Sylheti
All other languages listed in the Unicode languages list are supported by default.
To work with both cont and non-cont languages, set the options in your configuration file as shown below (with an exception for Chinese):
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'
If you do not require support for continuous-script languages, you can simply exclude the ngram_len and ngram_chars. options. For more information on these options, refer to the corresponding documentation sections.
To map one character to multiple characters or vice versa, you can use regexp_filter can be helpful.
blend_chars = +, &, U+23
blend_chars = +, &->+
Blended characters list. Optional, default is empty.
Blended characters are indexed as both separators and valid characters. For example, when &
is defined as a blended character and AT&T
appears in an indexed document, three different keywords will be indexed, at&t
, at
and t
.
Additionally, blended characters can influence indexing in such a way that keywords are indexed as if the blended characters were not typed at all. This behavior is particularly evident when blend_mode = trim_all
is specified. For example, the phrase some_thing
will be indexed as some
, something
, and thing
with blend_mode = trim_all
.
Care should be taken when using blended characters as defining a character as blended means that it is no longer a separator.
- Therefore, if you put a comma to the
blend_chars
and search fordog,cat
, it will treat that as a single tokendog,cat
. Ifdog,cat
was not indexed asdog,cat
, but left asdog cat
only, then it will not match. - Hence, this behavior should be controlled with the blend_mode setting.
Positions for tokens obtained by replacing blended characters with whitespace are assigned as usual, and regular keywords will be indexed as if there were no blend_chars
specified at all. An additional token that mixes blended and non-blended characters will be put at the starting position. For instance, if AT&T company
occurs in the very beginning of the text field, at
will be given position 1, t
position 2, company
position 3, and AT&T
will also be given position 1, blending with the opening regular keyword. As a result, queries for AT&T
or just AT
will match that document. A phrase query for "AT T"
will also match, as well as a phrase query for "AT&T company"
.
Blended characters can overlap with special characters used in query syntax, such as T-Mobile
or @twitter
. Where possible, the query parser will handle the blended character as blended. For instance, if hello @twitter
is within quotes (a phrase operator), the query parser will handle the @
symbol as blended. However, if the @
symbol was not within quotes, the character would be handled as an operator. Therefore, it is recommended to escape keywords.
Blended characters can be remapped so that multiple different blended characters can be normalized into one base form. This is useful when indexing multiple alternative Unicode codepoints with equivalent glyphs.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'
blend_mode = option [, option [, ...]]
option = trim_none | trim_head | trim_tail | trim_both | trim_all | skip_pure
The blended tokens indexing mode is enabled by the blend_mode directive.
By default, tokens that mix blended and non-blended characters get indexed entirely. For example, when both an at-sign and an exclamation are in blend_chars
, the string @dude!
will be indexed as two tokens: @dude!
(with all the blended characters) and dude
(without any). As a result, a query of @dude
will not match it.
blend_mode
adds flexibility to this indexing behavior. It takes a comma-separated list of options, each of which specifies a token indexing variant.
If multiple options are specified, multiple variants of the same token will be indexed. Regular keywords (resulting from that token by replacing blended characters with a separator) are always indexed.
The options are:
trim_none
- Index the entire tokentrim_head
- Trim heading blended characters, and index the resulting tokentrim_tail
- Trim trailing blended characters, and index the resulting tokentrim_both
- Trim both heading and trailing blended characters, and index the resulting tokentrim_all
- Trim heading, trailing, and middle blended characters, and index the resulting tokenskip_pure
- Do not index the token if it is purely blended, that is, consists of blended characters only
Using blend_mode
with the example @dude!
string above, the setting blend_mode = trim_head, trim_tail
would result in two indexed tokens: @dude
and dude!
. Using trim_both
would have no effect because trimming both blended characters results in dude
, which is already indexed as a regular keyword. Indexing @U.S.A.
with trim_both
(and assuming that dot is blended two) would result in U.S.A
being indexed. Lastly, skip_pure
enables you to ignore sequences of blended characters only. For example, one @@@ two
would be indexed as one two
, and match that as a phrase. This is not the case by default because a fully blended token gets indexed and offsets the second keyword position.
Default behavior is to index the entire token, equivalent to blend_mode = trim_none
.
Be aware that using blend modes limits your search, even with the default mode trim_none
if you assume .
is a blended character:
.dog.
will become.dog. dog
during indexing- and you won't be able to find it by
dog.
.
Using more modes increases the chance your keyword will match something.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'
min_word_len = length
min_word_len is an optional index configuration option in Manticore that specifies the minimum indexed word length. The default value is 1, which means that everything is indexed.
Only those words that are not shorter than this minimum will be indexed. For example, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) min_word_len = '4'
ngram_len = 1
N-gram lengths for N-gram indexing. Optional, default is 0 (disable n-gram indexing). Known values are 0 and 1.
N-grams provide basic support for continuous-script languages in unsegmented texts. The issue with searching in languages using continuous scripts is the absence of clear separators between words. In some cases, you may not want to use dictionary-based segmentation, such as the one available for Chinese. In those instances, n-gram segmentation might also work well.
When this feature is enabled, streams of such languages (or any other characters defined in ngram_chars) are indexed as N-grams. For example, if the incoming text is "ABCDEF" (where A to F represent some language characters) and ngram_len is 1, it will be indexed as if it were "A B C D E F". Only ngram_len=1 is currently supported. Only those characters that are listed in ngram_chars table will be split this way; others will not be affected.
Note that if the search query is segmented, i.e. there are separators between individual words, then wrapping the words in quotes and using extended mode will result in proper matches being found even if the text was not segmented. For instance, assume that the original query is BC DEF
. After wrapping in quotes on the application side, it should look like "BC" "DEF"
(with quotes). This query will be passed to Manticore and internally split into 1-grams too, resulting in "B C" "D E F"
query, still with quotes that are the phrase matching operator. And it will match the text even though there were no separators in the text.
Even if the search query is not segmented, Manticore should still produce good results, thanks to phrase-based ranking: it will pull closer phrase matches (which in the case of N-gram words can mean closer multi-character word matches) to the top.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) ngram_chars = 'cont' ngram_len = '1'
ngram_chars = cont
ngram_chars = cont, U+3000..U+2FA1F
N-gram characters list. Optional, default is empty.
To be used in conjunction with in ngram_len, this list defines characters, sequences of which are subject to N-gram extraction. Words comprised of other characters will not be affected by N-gram indexing feature. The value format is identical to charset_table. N-gram characters cannot appear in the charset_table.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'
Also you can use an alias for our default N-gram table as in the example. It should be sufficient in most cases.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) ngram_chars = 'cont' ngram_len = '1'
ignore_chars = U+AD
Ignored characters list. Optional, default is empty.
Useful in cases when some characters, such as soft hyphenation mark (U+00AD), should be not just treated as separators but rather fully ignored. For example, if '-' is simply not in the charset_table, "abc-def" text will be indexed as "abc" and "def" keywords. On the contrary, if '-' is added to ignore_chars list, the same text will be indexed as a single "abcdef" keyword.
The syntax is the same as for charset_table, but it's only allowed to declare characters, and not allowed to map them. Also, the ignored characters must not be present in charset_table.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'
bigram_index = {none|all|first_freq|both_freq}
Bigram indexing mode. Optional, default is none.
Bigram indexing is a feature to accelerate phrase searches. When indexing, it stores a document list for either all or some of the adjacent words pairs into the index. Such a list can then be used at searching time to significantly accelerate phrase or sub-phrase matching.
bigram_index
controls the selection of specific word pairs. The known modes are:
all
, index every single word pairfirst_freq
, only index word pairs where the first word is in a list of frequent words (see bigram_freq_words). For example, withbigram_freq_words = the, in, i, a
, indexing "alone in the dark" text will result in "in the" and "the dark" pairs being stored as bigrams, because they begin with a frequent keyword (either "in" or "the" respectively), but "alone in" would not be indexed, because "in" is a second word in that pair.both_freq
, only index word pairs where both words are frequent. Continuing with the same example, in this mode indexing "alone in the dark" would only store "in the" (the very worst of them all from searching perspective) as a bigram, but none of the other word pairs.
For most use cases, both_freq
would be the best mode, but your mileage may vary.
It's important to note that bigram_index
works only at the tokenization level and doesn't account for transformations like morphology
, wordforms
or stopwords
. This means the tokens it creates are very straightforward, which makes searching phrases more exact and strict. While this can improve the accuracy of phrase matching, it also makes the system less able to recognize different forms of words or variations in how words appear.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'
bigram_freq_words = the, a, you, i
A list of keywords considered "frequent" when indexing bigrams. Optional, default is empty.
Some of the bigram indexing modes (see bigram_index) require to define a list of frequent keywords. These are not to be confused with stop words. Stop words are completely eliminated when both indexing and searching. Frequent keywords are only used by bigrams to determine whether to index a current word pair or not.
bigram_freq_words
lets you define a list of such keywords.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'
dict = {keywords|crc}
The type of keywords dictionary used is identified by one of two known values, 'crc' or 'keywords'. This is optional, with 'keywords' as the default.
Using the keywords dictionary mode (dict=keywords) can significantly decrease the indexing burden and enable substring searches on extensive collections. This mode can be utilized for both plain and RT tables.
CRC dictionaries do not store the original keyword text in the index. Instead, they replace keywords with a control sum value (computed using FNV64) during both searching and indexing processes. This value is used internally within the index. This approach has two disadvantages:
- Firstly, there's a risk of control sum collisions between different keywords pairs. This risk grows in proportion to the number of unique keywords in the index. Nonetheless, this concern is minor as the probability of a single FNV64 collision in a dictionary of 1 billion entries is roughly 1 in 16, or 6.25 percent. Most dictionaries will have far fewer than a billion keywords given that a typical spoken human language has between 1 and 10 million word forms.
- Secondly, and more crucially, it's not straightforward to perform substring searches with control sums. Manticore addressed this issue by pre-indexing all possible substrings as separate keywords (see min_prefix_len, min_infix_len directives). This method even has an added advantage of matching substrings in the fastest way possible. Yet, pre-indexing all substrings significantly increases the index size (often by factors of 3-10x or more) and subsequently affects the indexing time, making substring searches on large indexes rather impractical.
The keywords dictionary resolves both of these issues. It stores keywords in the index and performs search-time wildcard expansion. For instance, a search for a test*
prefix could internally expand to a 'test|tests|testing' query based on the dictionary's contents. This expansion process is entirely invisible to the application, with the exception that the separate per-keyword statistics for all the matched keywords are now also reported.
For substring (infix) searches, extended wildcards can be used. Special characters such as ?
and %
are compatible with substring (infix) search (e.g., t?st*
, run%
, *abc*
). Note that the wildcards operators and the REGEX only function with dict=keywords
.
Indexing with a keywords dictionary is approximately 1.1x to 1.3x slower than regular, non-substring indexing - yet significantly faster than substring indexing (either prefix or infix). The index size should only be slightly larger than that of the standard non-substring table, with a total difference of 1..10% percent. The time it takes for regular keyword searching should be nearly the same or identical across all three index types discussed (CRC non-substring, CRC substring, keywords). Substring searching time can significantly fluctuate based on how many actual keywords match the given substring (i.e., how many keywords the search term expands into). The maximum number of matched keywords is limited by the expansion_limit directive.
In summary, keywords and CRC dictionaries offer two different trade-off decisions for substring searching. You can opt to either sacrifice indexing time and index size to achieve the fastest worst-case searches (CRC dictionary), or minimally impact indexing time but sacrifice worst-case searching time when the prefix expands into a high number of keywords (keywords dictionary).
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) dict = 'keywords'
embedded_limit = size
Embedded exceptions, wordforms, or stop words file size limit. Optional, default is 16K.
When you create a table the above mentioned files can be either saved externally along with the table or embedded directly into the table. Files sized under embedded_limit
get stored into the table. For bigger files, only the file names are stored. This also simplifies moving table files to a different machine; you may get by just copying a single file.
With smaller files, such embedding reduces the number of the external files on which the table depends, and helps maintenance. But at the same time it makes no sense to embed a 100 MB wordforms dictionary into a tiny delta table. So there needs to be a size threshold, and embedded_limit
is that threshold.
- CONFIG
table products {
embedded_limit = 32K
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
global_idf = /path/to/global.idf
The path to a file with global (cluster-wide) keyword IDFs. Optional, default is empty (use local IDFs).
On a multi-table cluster, per-keyword frequencies are quite likely to differ across different tables. That means that when the ranking function uses TF-IDF based values, such as BM25 family of factors, the results might be ranked slightly differently depending on what cluster node they reside.
The easiest way to fix that issue is to create and utilize a global frequency dictionary, or a global IDF file for short. This directive lets you specify the location of that file. It is suggested (but not required) to use an .idf extension. When the IDF file is specified for a given table and OPTION global_idf is set to 1, the engine will use the keyword frequencies and collection documents counts from the global_idf file, rather than just the local table. That way, IDFs and the values that depend on them will stay consistent across the cluster.
IDF files can be shared across multiple tables. Only a single copy of an IDF file will be loaded by searchd
, even when many tables refer to that file. Should the contents of an IDF file change, the new contents can be loaded with a SIGHUP.
You can build an .idf file using indextool utility, by dumping dictionaries using --dumpdict dict.txt --stats
switch first, then converting those to .idf format using --buildidf
, then merging all the .idf files across cluster using --mergeidf
.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'
hitless_words = {all|path/to/file}
Hitless words list. Optional, allowed values are 'all', or a list file name.
By default, Manticore full-text index stores not only a list of matching documents for every given keyword, but also a list of its in-document positions (known as hitlist). Hitlists enables phrase, proximity, strict order and other advanced types of searching, as well as phrase proximity ranking. However, hitlists for specific frequent keywords (that can not be stopped for some reason despite being frequent) can get huge and thus slow to process while querying. Also, in some cases we might only care about boolean keyword matching, and never need position-based searching operators (such as phrase matching) nor phrase ranking.
hitless_words
lets you create indexes that either do not have positional information (hitlists) at all, or skip it for specific keywords.
Hitless index will generally use less space than the respective regular full-text index (about 1.5x can be expected). Both indexing and searching should be faster, at a cost of missing positional query and ranking support.
If used in positional queries (e.g. phrase queries) the hitless words are taken out from them and used as operand without a position. For example if "hello" and "world" are hitless and "simon" and "says" are not hitless, the phrase query "simon says hello world"
will be converted to ("simon says" & hello & world)
, matching "hello" and "world" anywhere in the document and "simon says" as an exact phrase.
A positional query than contains only hitless words will result in an empty phrase node, therefore the entire query will return an empty result and a warning. If the whole dictionary is hitless (using all
) only boolean matching can be used on the respective index.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) hitless_words = 'all'
index_field_lengths = {0|1}
Enables computing and storing of field lengths (both per-document and average per-index values) into the full-text index. Optional, default is 0 (do not compute and store).
When index_field_lengths
is set to 1 Manticore will:
- create a respective length attribute for every full-text field, sharing the same name but with
__len
suffix - compute a field length (counted in keywords) for every document and store in to a respective attribute
- compute the per-index averages. The lengths attributes will have a special TOKENCOUNT type, but their values are in fact regular 32-bit integers, and their values are generally accessible.
BM25A() and BM25F() functions in the expression ranker are based on these lengths and require index_field_lengths
to be enabled. Historically, Manticore used a simplified, stripped-down variant of BM25 that, unlike the complete function, did not account for document length. There's also support for both a complete variant of BM25, and its extension towards multiple fields, called BM25F. They require per-document length and per-field lengths, respectively. Hence the additional directive.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) index_field_lengths = '1'
index_token_filter = my_lib.so:custom_blend:chars=@#&
Index-time token filter for full-text indexing. Optional, default is empty.
The index_token_filter directive specifies an optional index-time token filter for full-text indexing. This directive is used to create a custom tokenizer that makes tokens according to custom rules. The filter is created by the indexer on indexing source data into a plain table or by an RT table on processing INSERT
or REPLACE
statements. The plugins are defined using the format library name:plugin name:optional string of settings
. For example, my_lib.so:custom_blend:chars=@#&
.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'
overshort_step = {0|1}
Position increment on overshort (less than min_word_len) keywords. Optional, allowed values are 0 and 1, default is 1.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) overshort_step = '1'
phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis
Phrase boundary characters list. Optional, default is empty.
This list controls what characters will be treated as phrase boundaries, in order to adjust word positions and enable phrase-level search emulation through proximity search. The syntax is similar to charset_table, but mappings are not allowed and the boundary characters must not overlap with anything else.
On phrase boundary, additional word position increment (specified by phrase_boundary_step) will be added to current word position. This enables phrase-level searching through proximity queries: words in different phrases will be guaranteed to be more than phrase_boundary_step distance away from each other; so proximity search within that distance will be equivalent to phrase-level search.
Phrase boundary condition will be raised if and only if such character is followed by a separator; this is to avoid abbreviations such as S.T.A.L.K.E.R or URLs being treated as several phrases.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'
phrase_boundary_step = 100
Phrase boundary word position increment. Optional, default is 0.
On phrase boundary, current word position will be additionally incremented by this number.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'
# index '13"' as '13inch'
regexp_filter = \b(\d+)\" => \1inch
# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color
Regular expressions (regexps) used to filter the fields and queries. This directive is optional, multi-valued, and its default is an empty list of regular expressions. The regular expressions engine used by Manticore Search is Google's RE2, which is known for its speed and safety. For detailed information on the syntax supported by RE2, you can visit the RE2 syntax guide.
In certain applications such as product search, there can be many ways to refer to a product, model, or property. For example, iPhone 3gs
and iPhone 3 gs
(or even iPhone3 gs
) are very likely to refer to the same product. Another example could be different ways to express a laptop screen size, such as 13-inch
, 13 inch
, 13"
, or 13in
.
Regexps provide a mechanism to specify rules tailored to handle such cases. In the first example, you could possibly use a wordforms file to handle a handful of iPhone models, but in the second example, it's better to specify rules that would normalize "13-inch" and "13in" to something identical.
Regular expressions listed in regexp_filter
are applied in the order they are listed, at the earliest stage possible, before any other processing (including exceptions), even before tokenization. That is, regexps are applied to the raw source fields when indexing, and to the raw search query text when searching.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'