Manticore provides built-in support for indexing languages with continuous scripts (i.e., languages that do not use spaces or other marks between words or sentences). This allows you to process texts in these languages in two different ways:
- Precise segmentation using the ICU library. Currently, only Chinese is supported.
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'
- Precise segmentation using the Jieba library. Like ICU, it currently supports only Chinese.
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'
- Basic support using the N-gram options ngram_len and ngram_chars
For each language using a continuous script, there are separate character set tables (
chinese
,korean
,japanese
,thai
) that can be used. Alternatively, you can use the commoncont
character set table to support all CJK and Thai languages at once, or thecjk
charset to include all CJK languages only.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'
/* Or, alternatively */
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cjk,thai'
Additionally, there is built-in support for Chinese stopwords with the alias zh
.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'
When text is indexed in Manticore, it is split into words and case folding is done so that words like "Abc", "ABC", and "abc" are treated as the same word.
To perform these operations correctly, Manticore must know:
- the encoding of the source text (which should always be UTF-8)
- which characters are considered letters and which are not
- which letters should be folded to other letters
You can configure these settings on a per-table basis using the charset_table option. charset_table specifies an array that maps letter characters to their case-folded versions (or any other characters that you prefer). Characters that are not present in the array are considered to be non-letters and will be treated as word separators during indexing or searching in this table.
The default character set is non_cont
, which includes most languages.
You can also define text pattern replacement rules. For example, with the following rules:
regexp_filter = \**(\d+)\" => \1 inch
regexp_filter = (BLUE|RED) => COLOR
The text RED TUBE 5" LONG
would be indexed as COLOR TUBE 5 INCH LONG
, and PLANK 2" x 4"
would be indexed as PLANK 2 INCH x 4 INCH
. These rules are applied in the specified order. The rules also apply to queries, so a search for BLUE TUBE
would actually search for COLOR TUBE
.
You can learn more about regexp_filter here.
# default
charset_table = non_cont
# only English and Russian letters
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451
# english charset defined with alias
charset_table = 0..9, english, _
# you can override character mappings by redefining them, e.g. for case insensitive search with German umlauts you can use:
charset_table = non_cont, U+00E4, U+00C4->U+00E4, U+00F6, U+00D6->U+00F6, U+00FC, U+00DC->U+00FC, U+00DF, U+1E9E->U+00DF
charset_table
specifies an array that maps letter characters to their case-folded versions (or any other characters if you prefer). The default character set is non_cont
, which includes most languages with non-continuous scripts.
charset_table
is a workhorse of Manticore's tokenization process, which extracts keywords from document text or query text. It controls what characters are accepted as valid and how they should be transformed (e.g. whether case should be removed or not).
By default, every character maps to 0, which means that it is not considered a valid keyword and is treated as a separator. Once a character is mentioned in the table, it is mapped to another character (most frequently, either to itself or to a lowercase letter) and is treated as a valid keyword part.
charset_table uses a comma-separated list of mappings to declare characters as valid or to map them to other characters. Syntax shortcuts are available for mapping ranges of characters at once:
- Single char mapping:
A->a
. Declares the source character 'A' as allowed within keywords and maps it to the destination character 'a' (but does not declare 'a' as allowed). - Range mapping:
A..Z->a..z
. Declares all characters in the source range as allowed and maps them to the destination range. Does not declare the destination range as allowed. Checks the lengths of both ranges. - Stray char mapping:
a
. Declares a character as allowed and maps it to itself. Equivalent toa->a
single char mapping. - Stray range mapping:
a..z
. Declares all characters in the range as allowed and maps them to themselves. Equivalent toa..z->a..z
range mapping. - Checkerboard range mapping:
A..Z/2
. Maps every pair of characters to the second character. For instance,A..Z/2
is equivalent toA->B, B->B, C->D, D->D, ..., Y->Z, Z->Z
. This mapping shortcut is helpful for Unicode blocks where uppercase and lowercase letters go in an interleaved order.
For characters with codes from 0 to 32, and those in the range of 127 to 8-bit ASCII and Unicode characters, Manticore always treats them as separators. To avoid configuration file encoding issues, 8-bit ASCII characters and Unicode characters must be specified in U+XXX
form, where XXX
is a hexadecimal code point number. The minimal accepted Unicode character code is U+0021
.
If the default mappings are insufficient for your needs, you can redefine the character mappings by specifying them again with another mapping. For example, if the built-in non_cont
array includes characters Ä
and ä
and maps them both to the ASCII character a
, you can redefine those characters by adding the Unicode code points for them, like this:
charset_table = non_cont,U+00E4,U+00C4
for case sensitive search or
charset_table = non_cont,U+00E4,U+00C4->U+00E4
for case insensitive search.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'
Besides definitions of characters and mappings, there are several built-in aliases that can be used. Current aliases are:
chinese
cjk
cont
english
japanese
korean
non_cont
(non_cjk
)russian
thai
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'
If you want to support different languages in your search, it can be a laborious task to define sets of valid characters and folding rules for all of them. We have simplified this for you by providing default charset tables, non_cont
and cont
, that cover languages with non-continuous and continuous (Chinese, Japanese, Korean, Thai) scripts, respectively. In most cases, these charsets should be sufficient for your needs.
Please note that the following languages are currently not supported:
- Assamese
- Bishnupriya
- Buhid
- Garo
- Hmong
- Ho
- Komi
- Large Flowery Miao
- Maba
- Maithili
- Marathi
- Mende
- Mru
- Myene
- Ngambay
- Odia
- Santali
- Sindhi
- Sylheti
All other languages listed in the Unicode languages list are supported by default.
To work with both cont and non-cont languages, set the options in your configuration file as shown below (with an exception for Chinese):
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'
If you do not require support for continuous-script languages, you can simply exclude the ngram_len and ngram_chars. options. For more information on these options, refer to the corresponding documentation sections.
To map one character to multiple characters or vice versa, you can use regexp_filter can be helpful.
blend_chars = +, &, U+23
blend_chars = +, &->+
Blended characters list. Optional, default is empty.
Blended characters are indexed as both separators and valid characters. For example, when &
is defined as a blended character and AT&T
appears in an indexed document, three different keywords will be indexed, at&t
, at
and t
.
Additionally, blended characters can influence indexing in such a way that keywords are indexed as if the blended characters were not typed at all. This behavior is particularly evident when blend_mode = trim_all
is specified. For example, the phrase some_thing
will be indexed as some
, something
, and thing
with blend_mode = trim_all
.
Care should be taken when using blended characters as defining a character as blended means that it is no longer a separator.
- Therefore, if you put a comma to the
blend_chars
and search fordog,cat
, it will treat that as a single tokendog,cat
. Ifdog,cat
was not indexed asdog,cat
, but left asdog cat
only, then it will not match. - Hence, this behavior should be controlled with the blend_mode setting.
Positions for tokens obtained by replacing blended characters with whitespace are assigned as usual, and regular keywords will be indexed as if there were no blend_chars
specified at all. An additional token that mixes blended and non-blended characters will be put at the starting position. For instance, if AT&T company
occurs in the very beginning of the text field, at
will be given position 1, t
position 2, company
position 3, and AT&T
will also be given position 1, blending with the opening regular keyword. As a result, queries for AT&T
or just AT
will match that document. A phrase query for "AT T"
will also match, as well as a phrase query for "AT&T company"
.
Blended characters can overlap with special characters used in query syntax, such as T-Mobile
or @twitter
. Where possible, the query parser will handle the blended character as blended. For instance, if hello @twitter
is within quotes (a phrase operator), the query parser will handle the @
symbol as blended. However, if the @
symbol was not within quotes, the character would be handled as an operator. Therefore, it is recommended to escape keywords.
Blended characters can be remapped so that multiple different blended characters can be normalized into one base form. This is useful when indexing multiple alternative Unicode codepoints with equivalent glyphs.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'
blend_mode = option [, option [, ...]]
option = trim_none | trim_head | trim_tail | trim_both | trim_all | skip_pure
The blended tokens indexing mode is enabled by the blend_mode directive.
By default, tokens that mix blended and non-blended characters get indexed entirely. For example, when both an at-sign and an exclamation are in blend_chars
, the string @dude!
will be indexed as two tokens: @dude!
(with all the blended characters) and dude
(without any). As a result, a query of @dude
will not match it.
blend_mode
adds flexibility to this indexing behavior. It takes a comma-separated list of options, each of which specifies a token indexing variant.
If multiple options are specified, multiple variants of the same token will be indexed. Regular keywords (resulting from that token by replacing blended characters with a separator) are always indexed.
The options are:
trim_none
- Index the entire tokentrim_head
- Trim heading blended characters, and index the resulting tokentrim_tail
- Trim trailing blended characters, and index the resulting tokentrim_both
- Trim both heading and trailing blended characters, and index the resulting tokentrim_all
- Trim heading, trailing, and middle blended characters, and index the resulting tokenskip_pure
- Do not index the token if it is purely blended, that is, consists of blended characters only
Using blend_mode
with the example @dude!
string above, the setting blend_mode = trim_head, trim_tail
would result in two indexed tokens: @dude
and dude!
. Using trim_both
would have no effect because trimming both blended characters results in dude
, which is already indexed as a regular keyword. Indexing @U.S.A.
with trim_both
(and assuming that dot is blended two) would result in U.S.A
being indexed. Lastly, skip_pure
enables you to ignore sequences of blended characters only. For example, one @@@ two
would be indexed as one two
, and match that as a phrase. This is not the case by default because a fully blended token gets indexed and offsets the second keyword position.
Default behavior is to index the entire token, equivalent to blend_mode = trim_none
.
Be aware that using blend modes limits your search, even with the default mode trim_none
if you assume .
is a blended character:
.dog.
will become.dog. dog
during indexing- and you won't be able to find it by
dog.
.
Using more modes increases the chance your keyword will match something.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'
min_word_len = length
min_word_len is an optional index configuration option in Manticore that specifies the minimum indexed word length. The default value is 1, which means that everything is indexed.
Only those words that are not shorter than this minimum will be indexed. For example, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) min_word_len = '4'
ngram_len = 1
N-gram lengths for N-gram indexing. Optional, default is 0 (disable n-gram indexing). Known values are 0 and 1.
N-grams provide basic support for continuous-script languages in unsegmented texts. The issue with searching in languages using continuous scripts is the absence of clear separators between words. In some cases, you may not want to use dictionary-based segmentation, such as the one available for Chinese. In those instances, n-gram segmentation might also work well.
When this feature is enabled, streams of such languages (or any other characters defined in ngram_chars) are indexed as N-grams. For example, if the incoming text is "ABCDEF" (where A to F represent some language characters) and ngram_len is 1, it will be indexed as if it were "A B C D E F". Only ngram_len=1 is currently supported. Only those characters that are listed in ngram_chars table will be split this way; others will not be affected.
Note that if the search query is segmented, i.e. there are separators between individual words, then wrapping the words in quotes and using extended mode will result in proper matches being found even if the text was not segmented. For instance, assume that the original query is BC DEF
. After wrapping in quotes on the application side, it should look like "BC" "DEF"
(with quotes). This query will be passed to Manticore and internally split into 1-grams too, resulting in "B C" "D E F"
query, still with quotes that are the phrase matching operator. And it will match the text even though there were no separators in the text.
Even if the search query is not segmented, Manticore should still produce good results, thanks to phrase-based ranking: it will pull closer phrase matches (which in the case of N-gram words can mean closer multi-character word matches) to the top.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) ngram_chars = 'cont' ngram_len = '1'
ngram_chars = cont
ngram_chars = cont, U+3000..U+2FA1F
N-gram characters list. Optional, default is empty.
To be used in conjunction with in ngram_len, this list defines characters, sequences of which are subject to N-gram extraction. Words comprised of other characters will not be affected by N-gram indexing feature. The value format is identical to charset_table. N-gram characters cannot appear in the charset_table.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'
Also you can use an alias for our default N-gram table as in the example. It should be sufficient in most cases.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) ngram_chars = 'cont' ngram_len = '1'
ignore_chars = U+AD
Ignored characters list. Optional, default is empty.
Useful in cases when some characters, such as soft hyphenation mark (U+00AD), should be not just treated as separators but rather fully ignored. For example, if '-' is simply not in the charset_table, "abc-def" text will be indexed as "abc" and "def" keywords. On the contrary, if '-' is added to ignore_chars list, the same text will be indexed as a single "abcdef" keyword.
The syntax is the same as for charset_table, but it's only allowed to declare characters, and not allowed to map them. Also, the ignored characters must not be present in charset_table.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'
bigram_index = {none|all|first_freq|both_freq}
Bigram indexing mode. Optional, default is none.
Bigram indexing is a feature to accelerate phrase searches. When indexing, it stores a document list for either all or some of the adjacent words pairs into the index. Such a list can then be used at searching time to significantly accelerate phrase or sub-phrase matching.
bigram_index
controls the selection of specific word pairs. The known modes are:
all
, index every single word pairfirst_freq
, only index word pairs where the first word is in a list of frequent words (see bigram_freq_words). For example, withbigram_freq_words = the, in, i, a
, indexing "alone in the dark" text will result in "in the" and "the dark" pairs being stored as bigrams, because they begin with a frequent keyword (either "in" or "the" respectively), but "alone in" would not be indexed, because "in" is a second word in that pair.both_freq
, only index word pairs where both words are frequent. Continuing with the same example, in this mode indexing "alone in the dark" would only store "in the" (the very worst of them all from searching perspective) as a bigram, but none of the other word pairs.
For most use cases, both_freq
would be the best mode, but your mileage may vary.
It's important to note that bigram_index
works only at the tokenization level and doesn't account for transformations like morphology
, wordforms
or stopwords
. This means the tokens it creates are very straightforward, which makes searching phrases more exact and strict. While this can improve the accuracy of phrase matching, it also makes the system less able to recognize different forms of words or variations in how words appear.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'
bigram_freq_words = the, a, you, i
A list of keywords considered "frequent" when indexing bigrams. Optional, default is empty.
Some of the bigram indexing modes (see bigram_index) require to define a list of frequent keywords. These are not to be confused with stop words. Stop words are completely eliminated when both indexing and searching. Frequent keywords are only used by bigrams to determine whether to index a current word pair or not.
bigram_freq_words
lets you define a list of such keywords.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'
dict = {keywords|crc}
The type of keywords dictionary used is identified by one of two known values, 'crc' or 'keywords'. This is optional, with 'keywords' as the default.
Using the keywords dictionary mode (dict=keywords) can significantly decrease the indexing burden and enable substring searches on extensive collections. This mode can be utilized for both plain and RT tables.
CRC dictionaries do not store the original keyword text in the index. Instead, they replace keywords with a control sum value (computed using FNV64) during both searching and indexing processes. This value is used internally within the index. This approach has two disadvantages:
- Firstly, there's a risk of control sum collisions between different keywords pairs. This risk grows in proportion to the number of unique keywords in the index. Nonetheless, this concern is minor as the probability of a single FNV64 collision in a dictionary of 1 billion entries is roughly 1 in 16, or 6.25 percent. Most dictionaries will have far fewer than a billion keywords given that a typical spoken human language has between 1 and 10 million word forms.
- Secondly, and more crucially, it's not straightforward to perform substring searches with control sums. Manticore addressed this issue by pre-indexing all possible substrings as separate keywords (see min_prefix_len, min_infix_len directives). This method even has an added advantage of matching substrings in the fastest way possible. Yet, pre-indexing all substrings significantly increases the index size (often by factors of 3-10x or more) and subsequently affects the indexing time, making substring searches on large indexes rather impractical.
The keywords dictionary resolves both of these issues. It stores keywords in the index and performs search-time wildcard expansion. For instance, a search for a test*
prefix could internally expand to a 'test|tests|testing' query based on the dictionary's contents. This expansion process is entirely invisible to the application, with the exception that the separate per-keyword statistics for all the matched keywords are now also reported.
For substring (infix) searches, extended wildcards can be used. Special characters such as ?
and %
are compatible with substring (infix) search (e.g., t?st*
, run%
, *abc*
). Note that the wildcards operators and the REGEX only function with dict=keywords
.
Indexing with a keywords dictionary is approximately 1.1x to 1.3x slower than regular, non-substring indexing - yet significantly faster than substring indexing (either prefix or infix). The index size should only be slightly larger than that of the standard non-substring table, with a total difference of 1..10% percent. The time it takes for regular keyword searching should be nearly the same or identical across all three index types discussed (CRC non-substring, CRC substring, keywords). Substring searching time can significantly fluctuate based on how many actual keywords match the given substring (i.e., how many keywords the search term expands into). The maximum number of matched keywords is limited by the expansion_limit directive.
In summary, keywords and CRC dictionaries offer two different trade-off decisions for substring searching. You can opt to either sacrifice indexing time and index size to achieve the fastest worst-case searches (CRC dictionary), or minimally impact indexing time but sacrifice worst-case searching time when the prefix expands into a high number of keywords (keywords dictionary).
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) dict = 'keywords'
embedded_limit = size
Embedded exceptions, wordforms, or stop words file size limit. Optional, default is 16K.
When you create a table the above mentioned files can be either saved externally along with the table or embedded directly into the table. Files sized under embedded_limit
get stored into the table. For bigger files, only the file names are stored. This also simplifies moving table files to a different machine; you may get by just copying a single file.
With smaller files, such embedding reduces the number of the external files on which the table depends, and helps maintenance. But at the same time it makes no sense to embed a 100 MB wordforms dictionary into a tiny delta table. So there needs to be a size threshold, and embedded_limit
is that threshold.
- CONFIG
table products {
embedded_limit = 32K
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}
global_idf = /path/to/global.idf
The path to a file with global (cluster-wide) keyword IDFs. Optional, default is empty (use local IDFs).
On a multi-table cluster, per-keyword frequencies are quite likely to differ across different tables. That means that when the ranking function uses TF-IDF based values, such as BM25 family of factors, the results might be ranked slightly differently depending on what cluster node they reside.
The easiest way to fix that issue is to create and utilize a global frequency dictionary, or a global IDF file for short. This directive lets you specify the location of that file. It is suggested (but not required) to use an .idf extension. When the IDF file is specified for a given table and OPTION global_idf is set to 1, the engine will use the keyword frequencies and collection documents counts from the global_idf file, rather than just the local table. That way, IDFs and the values that depend on them will stay consistent across the cluster.
IDF files can be shared across multiple tables. Only a single copy of an IDF file will be loaded by searchd
, even when many tables refer to that file. Should the contents of an IDF file change, the new contents can be loaded with a SIGHUP.
You can build an .idf file using indextool utility, by dumping dictionaries using --dumpdict dict.txt --stats
switch first, then converting those to .idf format using --buildidf
, then merging all the .idf files across cluster using --mergeidf
.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'
hitless_words = {all|path/to/file}
Hitless words list. Optional, allowed values are 'all', or a list file name.
By default, Manticore full-text index stores not only a list of matching documents for every given keyword, but also a list of its in-document positions (known as hitlist). Hitlists enables phrase, proximity, strict order and other advanced types of searching, as well as phrase proximity ranking. However, hitlists for specific frequent keywords (that can not be stopped for some reason despite being frequent) can get huge and thus slow to process while querying. Also, in some cases we might only care about boolean keyword matching, and never need position-based searching operators (such as phrase matching) nor phrase ranking.
hitless_words
lets you create indexes that either do not have positional information (hitlists) at all, or skip it for specific keywords.
Hitless index will generally use less space than the respective regular full-text index (about 1.5x can be expected). Both indexing and searching should be faster, at a cost of missing positional query and ranking support.
If used in positional queries (e.g. phrase queries) the hitless words are taken out from them and used as operand without a position. For example if "hello" and "world" are hitless and "simon" and "says" are not hitless, the phrase query "simon says hello world"
will be converted to ("simon says" & hello & world)
, matching "hello" and "world" anywhere in the document and "simon says" as an exact phrase.
A positional query than contains only hitless words will result in an empty phrase node, therefore the entire query will return an empty result and a warning. If the whole dictionary is hitless (using all
) only boolean matching can be used on the respective index.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) hitless_words = 'all'
index_field_lengths = {0|1}
Enables computing and storing of field lengths (both per-document and average per-index values) into the full-text index. Optional, default is 0 (do not compute and store).
When index_field_lengths
is set to 1 Manticore will:
- create a respective length attribute for every full-text field, sharing the same name but with
__len
suffix - compute a field length (counted in keywords) for every document and store in to a respective attribute
- compute the per-index averages. The lengths attributes will have a special TOKENCOUNT type, but their values are in fact regular 32-bit integers, and their values are generally accessible.
BM25A() and BM25F() functions in the expression ranker are based on these lengths and require index_field_lengths
to be enabled. Historically, Manticore used a simplified, stripped-down variant of BM25 that, unlike the complete function, did not account for document length. There's also support for both a complete variant of BM25, and its extension towards multiple fields, called BM25F. They require per-document length and per-field lengths, respectively. Hence the additional directive.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) index_field_lengths = '1'
index_token_filter = my_lib.so:custom_blend:chars=@#&
Index-time token filter for full-text indexing. Optional, default is empty.
The index_token_filter directive specifies an optional index-time token filter for full-text indexing. This directive is used to create a custom tokenizer that makes tokens according to custom rules. The filter is created by the indexer on indexing source data into a plain table or by an RT table on processing INSERT
or REPLACE
statements. The plugins are defined using the format library name:plugin name:optional string of settings
. For example, my_lib.so:custom_blend:chars=@#&
.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'
overshort_step = {0|1}
Position increment on overshort (less than min_word_len) keywords. Optional, allowed values are 0 and 1, default is 1.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) overshort_step = '1'
phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis
Phrase boundary characters list. Optional, default is empty.
This list controls what characters will be treated as phrase boundaries, in order to adjust word positions and enable phrase-level search emulation through proximity search. The syntax is similar to charset_table, but mappings are not allowed and the boundary characters must not overlap with anything else.
On phrase boundary, additional word position increment (specified by phrase_boundary_step) will be added to current word position. This enables phrase-level searching through proximity queries: words in different phrases will be guaranteed to be more than phrase_boundary_step distance away from each other; so proximity search within that distance will be equivalent to phrase-level search.
Phrase boundary condition will be raised if and only if such character is followed by a separator; this is to avoid abbreviations such as S.T.A.L.K.E.R or URLs being treated as several phrases.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'
phrase_boundary_step = 100
Phrase boundary word position increment. Optional, default is 0.
On phrase boundary, current word position will be additionally incremented by this number.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'
# index '13"' as '13inch'
regexp_filter = \b(\d+)\" => \1inch
# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color
Regular expressions (regexps) used to filter the fields and queries. This directive is optional, multi-valued, and its default is an empty list of regular expressions. The regular expressions engine used by Manticore Search is Google's RE2, which is known for its speed and safety. For detailed information on the syntax supported by RE2, you can visit the RE2 syntax guide.
In certain applications such as product search, there can be many ways to refer to a product, model, or property. For example, iPhone 3gs
and iPhone 3 gs
(or even iPhone3 gs
) are very likely to refer to the same product. Another example could be different ways to express a laptop screen size, such as 13-inch
, 13 inch
, 13"
, or 13in
.
Regexps provide a mechanism to specify rules tailored to handle such cases. In the first example, you could possibly use a wordforms file to handle a handful of iPhone models, but in the second example, it's better to specify rules that would normalize "13-inch" and "13in" to something identical.
Regular expressions listed in regexp_filter
are applied in the order they are listed, at the earliest stage possible, before any other processing (including exceptions), even before tokenization. That is, regexps are applied to the raw source fields when indexing, and to the raw search query text when searching.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'
Wildcard searching is a common text search type. In Manticore, it is performed at the dictionary level. By default, both plain and RT tables use a dictionary type called dict. In this mode, words are stored as they are, so enabling wildcarding does not affect the size of the table. When a wildcard search is performed, the dictionary is searched to find all possible expansions of the wildcarded word. This expansion can be problematic in terms of computation at query time when the expanded word provides many expansions or expansions that have huge hitlists, especially in the case of infixes where the wildcard is added at the start and end of the word. To avoid such problems, the expansion_limit can be used.
min_prefix_len = length
This setting determines the minimum word prefix length to index and search. By default, it is set to 0, meaning prefixes are not allowed.
Prefixes allow for wildcard searching by wordstart*
wildcards.
For example, if the word "example" is indexed with min_prefix_len=3, it can be found by searching for "exa", "exam", "examp", "exampl", as well as the full word.
Note that with dict=crc min_prefix_len will affect the size of the full-text index since each word expansion will be stored additionally.
Manticore can differentiate perfect word matches from prefix matches and rank the former higher if the following conditions are met:
- dict=keywords (on by default)
- index_exact_words=1 (off by default),
- expand_keywords=1 (also off by default)
Note that with either dict=crc mode or any of the above options disabled, it is not possible to differentiate between prefixes and full words, and perfect word matches cannot be ranked higher.
When the minimum infix length is set to a positive number, the minimum prefix length is always considered 1.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) min_prefix_len = '3'
min_infix_len = length
The min_infix_len setting determines the minimum length of an infix prefix to index and search. It is optional and its default value is 0, which means that infixes are not allowed. The minimum allowed non-zero value is 2.
When enabled, infixes allow for wildcard searching with term patterns like start*
, *end
, *middle*
, , and so on. It also allows you to disable too short wildcards if they are too expensive to search for.
If the following conditions are met, Manticore can differentiate perfect word matches from infix matches and rank the former higher:
- dict=keywords (on by default)
- index_exact_words=1 (off by default),
- expand_keywords=1 (also off by default)
Note that with the dict=crc mode or any of the above options disabled, there is no way to differentiate between infixes and full words, and thus perfect word matches cannot be ranked higher.
Infix wildcard search query time can vary greatly, depending on how many keywords the substring will actually expand to. Short and frequent syllables like *in*
or *ti*
might expand to way too many keywords, all of which would need to be matched and processed. Therefore, to generally enable substring searches, you would set min_infix_len to 2. To limit the impact from wildcard searches with too short wildcards, you might set it higher.
Infixes must be at least 2 characters long, and wildcards like *a*
are not allowed for performance reasons.
When min_infix_len is set to a positive number, the minimum prefix length is considered 1. For dict word infixing and prefixing cannot be both enabled at the same time. For dict and other fields to have prefixes declared with prefix_fields, it is forbidden to declare the same field in both lists.
If dict=keywords, besides the wildcard *
two other wildcard characters can be used:
?
can match any (one) character:t?st
will matchtest
, but notteast
%
can match zero or one character:tes%
will matchtes
ortest
, but nottesting
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) min_infix_len = '3'
prefix_fields = field1[, field2, ...]
The prefix_fields setting is used to limit prefix indexing to specific full-text fields in dict=crc mode. By default, all fields are indexed in prefix mode, but because prefix indexing can affect both indexing and searching performance, it may be desired to limit it to certain fields.
To limit prefix indexing to specific fields, use the prefix_fields setting followed by a comma-separated list of field names. If prefix_fields is not set, then all fields will be indexed in prefix mode.
- CONFIG
table products {
prefix_fields = title, name
min_prefix_len = 3
dict = crc
infix_fields = field1[, field2, ...]
The infix_fields setting allows you to specify a list of full-text fields to limit infix indexing to. This applies to dict=crc only and is optional, with the default being to index all fields in infix mode. This setting is similar to prefix_fields, but instead allows you to limit infix indexing to specific fields.
- CONFIG
table products {
infix_fields = title, name
min_infix_len = 3
dict = crc
max_substring_len = length
The max_substring_len directive sets the maximum substring length to be indexed for either prefix or infix searches. This setting is optional, and its default value is 0 (which means that all possible substrings are indexed). It only applies to dict.
By default, substring indexing in dict indexes all possible substrings as separate keywords, which can result in an overly large full-text index. Therefore, the max_substring_len directive allows you to skip too-long substrings that will probably never be searched for.
For example, a test table of 10,000 blog posts takes up a different amount of disk space depending on the settings:
- 6.4 MB baseline (no substrings)
- 24.3 MB (3.8x) with min_prefix_len = 3
- 22.2 MB (3.5x) with min_prefix_len = 3, max_substring_len = 8
- 19.3 MB (3.0x) with min_prefix_len = 3, max_substring_len = 6
- 94.3 MB (14.7x) with min_infix_len = 3
- 84.6 MB (13.2x) with min_infix_len = 3, max_substring_len = 8
- 70.7 MB (11.0x) with min_infix_len = 3, max_substring_len = 6
Therefore, limiting the max substring length can save 10-15% of the table size.
When using dict=keywords mode, there is no performance impact associated with substring length. Therefore, this directive is not applicable and is intentionally forbidden in that case. However, if required, you can still limit the length of a substring that you search for in the application code.
- CONFIG
table products {
max_substring_len = 12
min_infix_len = 3
dict = crc
expand_keywords = {0|1|exact|star}
This setting expands keywords with their exact forms and/or with stars when possible. The supported values are:
- 1 - expand to both the exact form and the form with the stars. For instance,
running
will become(running | *running* | =running)
exact
- - augment the keyword with only its exact form. For instance,running
will become(running | =running)
star
- augment the keyword by adding*
around it. For instance,running
will become(running | *running*)
This setting is optional, and the default value is 0 (keywords are not expanded).
Queries against tables with expand_keywords
feature enabled are internally expanded as follows: if the table was built with prefix or infix indexing enabled, every keyword gets internally replaced with a disjunction of the keyword itself and a respective prefix or infix (keyword with stars). If the table was built with both stemming and index_exact_words enabled, exact form is also added.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) expand_keywords = '1'
Expanded queries take naturally longer to complete, but can possibly improve the search quality, as the documents with exact form matches should be ranked generally higher than documents with stemmed or infix matches.
Note that the existing query syntax does not allow to emulate this kind of expansion, because internal expansion works on keyword level and expands keywords within phrase or quorum operators too (which is not possible through the query syntax). Take a look at the examples and how expand_keywords affects the search result weights and how "runsy" is found by "runs" w/o the need to add a star:
- expand_keywords_enabled
- expand_keywords_disabled
mysql> create table t(f text) min_infix_len='2' expand_keywords='1' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)
mysql> select *, weight() from t where match('runs');
+------+---------+----------+
| id | f | weight() |
+------+---------+----------+
| 2 | runs | 1560 |
| 1 | running | 1500 |
| 3 | runsy | 1500 |
+------+---------+----------+
3 rows in set (0.01 sec)
mysql> drop table t;
Query OK, 0 rows affected (0.01 sec)
mysql> create table t(f text) min_infix_len='2' expand_keywords='exact' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)
mysql> select *, weight() from t where match('running');
+------+---------+----------+
| id | f | weight() |
+------+---------+----------+
| 1 | running | 1590 |
| 2 | runs | 1500 |
+------+---------+----------+
2 rows in set (0.00 sec)
This directive does not affect indexer in any way, it only affects searchd.
expansion_limit = number
Maximum number of expanded keywords for a single wildcard. Details are here.