Supported languages

Manticore supports many languages. Basic support for most is enabled by default via charset_table = non_cjk (which is a default value).

For many languages we provide stopwords file (you can also use your own one) which you can use to improve search relevance.

For few languages advanced morphology is available that allows to improve search relevance significantly by better segmentation and normalization using dictionary based lemmatization or stemming algorithms.

The below table includes a complete list of supported languages. You can use it to find out how to enable:

  • basic support (column "Supported")
  • stopwords (column "Stopwords")
  • advanced morphology (column "Advanced morphology")
Language Supported Stopwords file name Advanced morphology Notes
Afrikaans charset_table=non_cjk af -
Arabic charset_table=non_cjk ar morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar
Armenian charset_table=non_cjk hy -
Assamese specify charset_table specify charset_table manually - -
Basque charset_table=non_cjk eu -
Bengali charset_table=non_cjk bn -
Bishnupriya specify charset_table manually - -
Buhid specify charset_table manually - -
Bulgarian charset_table=non_cjk bg -
Catalan charset_table=non_cjk ca morphology=libstemmer_ca
Chinese charset_table=chinese or ngram_chars=chinese zh morphology=icu_chinese or ngram_chars=1 correspondingly ICU dictionary based segmentation is much more accurate than ngram-based
Croatian charset_table=non_cjk hr -
Kurdish charset_table=non_cjk ckb -
Czech charset_table=non_cjk cz morphology=stem_cz (Czech stemmer)
Danish charset_table=non_cjk da morphology=libstemmer_da
Dutch charset_table=non_cjk nl morphology=libstemmer_nl
English charset_table=non_cjk en morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter's English stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_en (English from libstemmer)
Esperanto charset_table=non_cjk eo -
Estonian charset_table=non_cjk et -
Finnish charset_table=non_cjk fi morphology=libstemmer_fi
French charset_table=non_cjk fr morphology=libstemmer_fr
Galician charset_table=non_cjk gl -
Garo specify charset_table manually - -
German charset_table=non_cjk de morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de
Greek charset_table=non_cjk el morphology=libstemmer_el
Hebrew charset_table=non_cjk he -
Hindi charset_table=non_cjk hi morphology=libstemmer_hi
Hmong specify charset_table manually - -
Ho specify charset_table manually - -
Hungarian charset_table=non_cjk hu morphology=libstemmer_hu
Indonesian charset_table=non_cjk id morphology=libstemmer_id
Irish charset_table=non_cjk ga morphology=libstemmer_ga
Italian charset_table=non_cjk it morphology=libstemmer_it
Japanese ngram_chars=japanese - ngram_chars=japanese ngram_len=1 Requires ngram-based segmentation
Komi specify charset_table manually - -
Korean ngram_chars=korean - ngram_chars=korean ngram_len=1 Requires ngram-based segmentation
Large Flowery Miao specify charset_table manually - -
Latin charset_table=non_cjk la -
Latvian charset_table=non_cjk lv -
Lithuanian charset_table=non_cjk lt morphology=libstemmer_lt
Maba specify charset_table manually - -
Maithili specify charset_table manually - -
Marathi specify charset_table manually - -
Marathi charset_table=non_cjk mr -
Mende specify charset_table manually - -
Mru specify charset_table manually - -
Myene specify charset_table manually - -
Nepali specify charset_table manually - morphology=libstemmer_ne
Ngambay specify charset_table manually - -
Norwegian charset_table=non_cjk no morphology=libstemmer_no
Odia specify charset_table manually - -
Persian charset_table=non_cjk fa -
Polish charset_table=non_cjk pl -
Portuguese charset_table=non_cjk pt morphology=libstemmer_pt
Romanian charset_table=non_cjk ro morphology=libstemmer_ro
Russian charset_table=non_cjk ru morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter's Russian stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_ru (from libstemmer)
Santali specify charset_table manually - -
Sindhi specify charset_table manually - -
Slovak charset_table=non_cjk sk -
Slovenian charset_table=non_cjk sl -
Somali charset_table=non_cjk so -
Sotho charset_table=non_cjk st -
Spanish charset_table=non_cjk es morphology=libstemmer_es
Swahili charset_table=non_cjk sw -
Swedish charset_table=non_cjk sv morphology=libstemmer_sv
Sylheti specify charset_table manually - -
Tamil specify charset_table manually - morphology=libstemmer_ta
Thai charset_table=non_cjk th -
Turkish charset_table=non_cjk tr morphology=libstemmer_tr
Yoruba charset_table=non_cjk yo -
Zulu charset_table=non_cjk zu -

Chinese, Japanese and Korean (CJK) languages

Manticore has built-in support for indexing CJK texts. There are two ways how CJK text can be processed:

  • Precise segmentation using ICU library (only Chinese is supported for now)
create table products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'
  • Basic support with N-grams options ngram_len and ngram_chars There are separate charset tables(chinese, korean, japanese) that can be used for each CJK-language or, alternatively, common cjk charset table can be applied.
create table products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'

There's also built-in stopwords for Chinese with alias zh.

create table products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'

Low-level tokenization

When indexing some text Manticore splits it into words, and does case folding so that e.g. "Abc", "ABC" and "abc" would be treated as the same word.

To do that properly Manticore needs to know:

  • what encoding is the source text in (and it should always be UTF-8)
  • what characters are letters and what are not
  • what letters should be folded to other letters

This can be configured on a per-index basis using charset_table option. charset_table specifies the array that maps letter characters to their case folded versions (or any other characters if you like). The characters that are not in the array are considered to be non-letters and will be treated as word separators when indexing or searching through this index.

The default character set is non_cjk and includes most languages.

You can also specify text pattern replacement rules. For example, given the rules:

regexp_filter = \**(\d+)\" => \1 inch
regexp_filter = (BLUE|RED) => COLOR

text RED TUBE 5" LONG would be indexed as COLOR TUBE 5 INCH LONG, and PLANK 2" x 4" - as PLANK 2 INCH x 4 INCH. The rules are applied in the given order. Text in queries is also replaced; search for BLUE TUBE would actually become a search for COLOR TUBE.

Read more about regexp_filter here.

Index configuration options


# default
charset_table = non_cjk

# only English and Russian letters
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451

# english charset defined with alias
charset_table = 0..9, english, _

Accepted characters array, with case folding rules. Optional, default values are latin and cyrillic characters.

charset_table is a main workhorse of Manticore tokenization process, the process of extracting keywords from document text or query text. It controls what characters are accepted as valid and what are not, and how the accepted characters should be transformed (eg. should the case be removed or not).

You can think of charset_table as of a big table or array that has a mapping for each and every of 100K+ characters in Unicode. By default, every character maps to 0, which means that it does not occur within keywords and should be treated as a separator. Once mentioned in the table, character is mapped to some other character (most frequently, either to itself or to a lowercase letter), and is treated as a valid keyword part.

The expected value format is a commas-separated list of mappings. Two simplest mappings simply declare a character as valid, and map a single character to another single character, respectively. But specifying the whole table in such form would result in bloated and barely manageable specifications. So there are several syntax shortcuts that let you map ranges of characters at once. The complete list is as follows:

  • A->a - Single char mapping, declares source char 'A' as allowed to occur within keywords and maps it to destination char 'a' (but does not declare 'a' as allowed).
  • A..Z->a..z - Range mapping, declares all chars in source range as allowed and maps them to the destination range. Does not declare destination range as allowed. Also checks range's lengths (the lengths must be equal).
  • a - Stray char mapping, declares a character as allowed and maps it to itself. Equivalent to a->a single char mapping.
  • a..z - Stray range mapping, declares all characters in range as allowed and maps them to themselves. Equivalent to a..z->a..z range mapping.
  • A..Z/2 - Checkerboard range map. Maps every pair of chars to the second char. More formally, declares odd characters in range as allowed and maps them to the even ones; also declares even characters as allowed and maps them to themselves. For instance, A..Z/2 is equivalent to A->B, B->B, C->D, D->D, ..., Y->Z, Z->Z. This mapping shortcut is helpful for a number of Unicode blocks where uppercase and lowercase letters go in such interleaved order instead of contiguous chunks.

Control characters with codes from 0 to 32 are always treated as separators. Characters with codes 33 to 127, ie. 7-bit ASCII characters, can be used in the mappings as is. To avoid configuration file encoding issues, 8-bit ASCII characters and Unicode characters must be specified in U+xxx form, where xxx is hexadecimal codepoint number. This form can also be used for 7-bit ASCII characters to encode special ones: e.g. use U+2E to encode dot, U+2C to encode comma.

create table products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'

Besides definitions of characters and mappings, there are several built-in aliases that can be used. Current aliases are:

  • english
  • russian
  • non_cjk
  • cjk
create table products(title text, price float) charset_table = '0..9, english, _'

So if you want your search to support different languages you will need to define sets of valid characters and folding rules for all of them what can be quite a laborious task. We have performed this task for you by preparing default charset tables, non_cjk and cjk, that comprise non-cjk and cjk-languages respectively. These charsets should be sufficient to use in most cases.

The languages that are currently not supported are:

  • Assamese
  • Bishnupriya
  • Buhid
  • Garo
  • Hmong
  • Ho
  • Komi
  • Large Flowery Miao
  • Maba
  • Maithili
  • Marathi
  • Mende
  • Mru
  • Myene
  • Ngambay
  • Odia
  • Santali
  • Sindhi
  • Sylheti

All other languages listed in the following list are supported by default: Unicode languages list.

To be able to work with both cjk and non-cjk languages you should set the options in your configuration file as shown below:

create table products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'

In case you don't need support for cjk-languages you can just omit ngram_len and ngram_chars options. For more information on those see the appropriate documentation sections.

If you're looking for mapping one character to multiple or vice-versa regexp_filter can be helpful.


blend_chars = +, &, U+23
blend_chars = +, &->+

Blended characters list. Optional, default is empty.

Blended characters are indexed both as separators and valid characters. For instance, assume that & is configured as blended and AT&T occurs in an indexed document. Three different keywords will get indexed, namely at&t, treating blended characters as valid, plus at and t, treating them as separators.

Positions for tokens obtained by replacing blended characters with whitespace are assigned as usual, so regular keywords will be indexed just as if there was no blend_chars specified at all. An additional token that mixes blended and non-blended characters will be put at the starting position. For instance, if AT&T company occurs in the very beginning of the text field, at will be given position 1, t position 2, company position 3, and AT&T will also be given position 1 ("blending" with the opening regular keyword). Thus, querying for either AT&T or just AT will match that document, and querying for "AT T" as a phrase will also match it. Last but not least, phrase query for "AT&T company" will also match it, despite the position.

Blended characters can overlap with special characters used in query syntax (think of T-Mobile or @twitter). Where possible, query parser will automatically handle blended character as blended. For instance, "hello @twitter" within quotes (a phrase operator) would handle @-sign as blended, because @-syntax for field operator is not allowed within phrases. Otherwise, the character would be handled as an operator. So you might want to escape the keywords.

Blended characters can be remapped, so that multiple different blended characters could be normalized into just one base form. This is useful when indexing multiple alternative Unicode codepoints with equivalent glyphs.

create table products(title text, price float) blend_chars = '+, &, U+23, @->_'


blend_mode = option [, option [, ...]]
option = trim_none | trim_head | trim_tail | trim_both | skip_pure

Blended tokens indexing mode. Optional, default is trim_none.

By default, tokens that mix blended and non-blended characters get indexed in there entirety. For instance, when both at-sign and an exclamation are in blend_chars, @dude! will get result in two tokens indexed: @dude! (with all the blended characters) and dude (without any). Therefore @dude query will not match it.

blend_mode directive adds flexibility to this indexing behavior. It takes a comma-separated list of options.

Options specify token indexing variants. If multiple options are specified, multiple variants of the same token will be indexed. Regular keywords (resulting from that token by replacing blended with whitespace) are always be indexed.

  • trim_none - Index the entire token
  • trim_head - Trim heading blended characters, and index the resulting token
  • trim_tail - Trim trailing blended characters, and index the resulting token
  • trim_both - Trim both heading and trailing blended characters, and index the resulting token
  • skip_pure - Do not index the token if it's purely blended, that is, consists of blended characters only

Returning to the @dude! example above, setting blend_mode = trim_head, trim_tail will result in two tokens being indexed, @dude and dude!. In this particular example, trim_both would have no effect, because trimming both blended characters results in dude which is already indexed as a regular keyword. Indexing @U.S.A. with trim_both (and assuming that dot is blended two) would result in U.S.A being indexed. Last but not least, skip_pure enables you to fully ignore sequences of blended characters only. For example, one @@@ two would be indexed exactly as one two, and match that as a phrase. That is not the case by default because a fully blended token gets indexed and offsets the second keyword position.

Default behavior is to index the entire token, equivalent to blend_mode = trim_none.

create table products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'


min_word_len = length

Minimum indexed word length. Optional, default is 1 (index everything).

Only those words that are not shorter than this minimum will be indexed. For instance, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.

create table products(title text, price float) min_word_len = '4'


ngram_len = 1

N-gram lengths for N-gram indexing. Optional, default is 0 (disable n-gram indexing). Known values are 0 and 1.

N-grams provide basic CJK (Chinese, Japanese, Korean) support for unsegmented texts. The issue with CJK searching is that there could be no clear separators between the words. In some cases you might not want to use dictionary-based segmentation as the one available for Chinese. In those cases n-gram segmentation might work well too.

When this feature is enabled, streams of CJK (or any other defined in ngram_chars) characters are indexed as N-grams. For example, if incoming text is "ABCDEF" (where A to F represent some CJK characters) and ngram_len is 1, in will be indexed as if it was "A B C D E F". Only ngram_len=1 is supported at the moment. Only those characters that are listed in ngram_chars table will be split this way; other ones will not be affected.

Note that if search query is segmented, i.e. there are separators between individual words, then wrapping the words in quotes and using extended mode will result in proper matches being found even if the text was not segmented. For instance, assume that the original query is BC DEF. After wrapping in quotes on the application side, it should look like "BC" "DEF" (with quotes). This query will be passed to Manticore and internally split into 1-grams too, resulting in "B C" "D E F" query, still with quotes that are the phrase matching operator. And it will match the text even though there were no separators in the text.

Even if the search query is not segmented, Manticore should still produce good results, thanks to phrase based ranking: it will pull closer phrase matches (which in case of N-gram CJK words can mean closer multi-character word matches) to the top.

create table products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'


ngram_chars = cjk

ngram_chars = cjk, U+3000..U+2FA1F

N-gram characters list. Optional, default is empty.

To be used in conjunction with in ngram_len, this list defines characters, sequences of which are subject to N-gram extraction. Words comprised of other characters will not be affected by N-gram indexing feature. The value format is identical to charset_table. N-gram characters cannot appear in the charset_table.

create table products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'

Also you can use an alias for our default N-gram table as in the example. It should be sufficient in most cases.

create table products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'


ignore_chars = U+AD

Ignored characters list. Optional, default is empty.

Useful in cases when some characters, such as soft hyphenation mark (U+00AD), should be not just treated as separators but rather fully ignored. For example, if '-' is simply not in the charset_table, "abc-def" text will be indexed as "abc" and "def" keywords. On the contrary, if '-' is added to ignore_chars list, the same text will be indexed as a single "abcdef" keyword.

The syntax is the same as for charset_table, but it's only allowed to declare characters, and not allowed to map them. Also, the ignored characters must not be present in charset_table.

create table products(title text, price float) ignore_chars = 'U+AD'


bigram_index = {none|all|first_freq|both_freq}

Bigram indexing mode. Optional, default is none.

Bigram indexing is a feature to accelerate phrase searches. When indexing, it stores a document list for either all or some of the adjacent words pairs into the index. Such a list can then be used at searching time to significantly accelerate phrase or sub-phrase matching.

bigram_index controls the selection of specific word pairs. The known modes are:

  • all, index every single word pair
  • first_freq, only index word pairs where the first word is in a list of frequent words (see bigram_freq_words). For example, with bigram_freq_words = the, in, i, a, indexing "alone in the dark" text will result in "in the" and "the dark" pairs being stored as bigrams, because they begin with a frequent keyword (either "in" or "the" respectively), but "alone in" would not be indexed, because "in" is a second word in that pair.
  • both_freq, only index word pairs where both words are frequent. Continuing with the same example, in this mode indexing "alone in the dark" would only store "in the" (the very worst of them all from searching perspective) as a bigram, but none of the other word pairs.

For most use cases, both_freq would be the best mode, but your mileage may vary.

create table products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'


bigram_freq_words = the, a, you, i

A list of keywords considered "frequent" when indexing bigrams. Optional, default is empty.

Some of the bigram indexing modes (see bigram_index) require to define a list of frequent keywords. These are not to be confused with stop words. Stop words are completely eliminated when both indexing and searching. Frequent keywords are only used by bigrams to determine whether to index a current word pair or not.

bigram_freq_words lets you define a list of such keywords.

create table products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'


dict = {keywords|crc}

The keywords dictionary type. Known values are 'crc' and 'keywords'. Optional, default is 'keywords'.

Keywords dictionary mode (dict=keywords), (greatly) reduces indexing impact and enable substring searches on huge collections. That mode is supported both for plain and RT indexes.

CRC dictionaries never store the original keyword text in the index. Instead, keywords are replaced with their control sum value (calculated using FNV64) both when searching and indexing, and that value is used internally in the index.

That approach has two drawbacks. First, there is a chance of control sum collision between several pairs of different keywords, growing quadratically with the number of unique keywords in the index. However, it is not a big concern as a chance of a single FNV64 collision in a dictionary of 1 billion entries is approximately 1:16, or 6.25 percent. And most dictionaries will be much more compact that a billion keywords, as a typical spoken human language has in the region of 1 to 10 million word forms.) Second, and more importantly, substring searches are not directly possible with control sums. Manticore alleviated that by pre-indexing all the possible substrings as separate keywords (see min_prefix_len, min_infix_len directives). That actually has an added benefit of matching substrings in the quickest way possible. But at the same time pre-indexing all substrings grows the index size a lot (factors of 3-10x and even more would not be unusual) and impacts the indexing time respectively, rendering substring searches on big indexes rather impractical.

Keywords dictionary fixes both these drawbacks. It stores the keywords in the index and performs search-time wildcard expansion. For example, a search for a 'test*'prefix could internally expand to 'test|tests|testing' query based on the dictionary contents. That expansion is fully transparent to the application, except that the separate per-keyword statistics for all the actually matched keywords would now also be reported.

For substring (infix) search extended wildcards may be used. Special symbols like '?' and '%' are supported along with substring (infix) search (e.g. "t?st*","run%","*abc*"). Note, however, these wildcards work only with dict=keywords, and not elsewhere.

Indexing with keywords dictionary should be 1.1x to 1.3x slower compared to regular, non-substring indexing - but times faster compared to substring indexing (either prefix or infix). Index size should only be slightly bigger that than of the regular non-substring index, with a 1..10% percent total difference. Regular keyword searching time must be very close or identical across all three discussed index kinds (CRC non-substring, CRC substring, keywords). Substring searching time can vary greatly depending on how many actual keywords match the given substring (in other words, into how many keywords does the search term expand). The maximum number of keywords matched is restricted by the expansion_limit directive.

Essentially, keywords and CRC dictionaries represent the two different trade-off substring searching decisions. You can choose to either sacrifice indexing time and index size in favor of top-speed worst-case searches (CRC dictionary), or only slightly impact indexing time but sacrifice worst-case searching time when the prefix expands into very many keywords (keywords dictionary).

create table products(title text, price float) dict = 'keywords'


embedded_limit = size

Embedded exceptions, wordforms, or stop words file size limit. Optional, default is 16K.

When you create an index the above mentioned files can be either saved externally along with the index or embedded directly into the index. Files sized under embedded_limit get stored into the index. For bigger files, only the file names are stored. This also simplifies moving index files to a different machine; you may get by just copying a single file.

With smaller files, such embedding reduces the number of the external files on which the index depends, and helps maintenance. But at the same time it makes no sense to embed a 100 MB wordforms dictionary into a tiny delta index. So there needs to be a size threshold, and embedded_limit is that threshold.

index products {
  embedded_limit = 32K

  type = rt
  path = idx
  rt_field = title
  rt_attr_uint = price


global_idf = /path/to/global.idf

The path to a file with global (cluster-wide) keyword IDFs. Optional, default is empty (use local IDFs).

On a multi-index cluster, per-keyword frequencies are quite likely to differ across different indexes. That means that when the ranking function uses TF-IDF based values, such as BM25 family of factors, the results might be ranked slightly differently depending on what cluster node they reside.

The easiest way to fix that issue is to create and utilize a global frequency dictionary, or a global IDF file for short. This directive lets you specify the location of that file. It is suggested (but not required) to use an .idf extension. When the IDF file is specified for a given index and OPTION global_idf is set to 1, the engine will use the keyword frequencies and collection documents counts from the global_idf file, rather than just the local index. That way, IDFs and the values that depend on them will stay consistent across the cluster.

IDF files can be shared across multiple indexes. Only a single copy of an IDF file will be loaded by searchd, even when many indexes refer to that file. Should the contents of an IDF file change, the new contents can be loaded with a SIGHUP.

You can build an .idf file using indextool utility, by dumping dictionaries using --dumpdict dict.txt --stats switch first, then converting those to .idf format using --buildidf, then merging all the .idf files across cluster using --mergeidf.

create table products(title text, price float) global_idf = '/usr/local/sphinx/var/global.idf'


hitless_words = {all|path/to/file}

Hitless words list. Optional, allowed values are 'all', or a list file name.

By default, Manticore full-text index stores not only a list of matching documents for every given keyword, but also a list of its in-document positions (known as hitlist). Hitlists enables phrase, proximity, strict order and other advanced types of searching, as well as phrase proximity ranking. However, hitlists for specific frequent keywords (that can not be stopped for some reason despite being frequent) can get huge and thus slow to process while querying. Also, in some cases we might only care about boolean keyword matching, and never need position-based searching operators (such as phrase matching) nor phrase ranking.

hitless_words lets you create indexes that either do not have positional information (hitlists) at all, or skip it for specific keywords.

Hitless index will generally use less space than the respective regular index (about 1.5x can be expected). Both indexing and searching should be faster, at a cost of missing positional query and ranking support. When searching, positional queries (eg. phrase queries) will be automatically converted to respective non-positional (document-level) or combined queries. For instance, if keywords "hello" and "world" are hitless, "hello world" phrase query will be converted to (hello & world) bag-of-words query, matching all documents that mention either of the keywords but not necessarily the exact phrase. And if, in addition, keywords "simon" and "says" are not hitless, "simon says hello world" will be converted to ("simon says" & hello & world) query, matching all documents that contain "hello" and "world" anywhere in the document, and also "simon says" as an exact phrase.

create table products(title text, price float) hitless_words = 'all'


index_field_lengths = {0|1}

Enables computing and storing of field lengths (both per-document and average per-index values) into the index. Optional, default is 0 (do not compute and store).

When index_field_lengths is set to 1 Manticore will:

  • create a respective length attribute for every full-text field, sharing the same name but with __len suffix
  • compute a field length (counted in keywords) for every document and store in to a respective attribute
  • compute the per-index averages. The lengths attributes will have a special TOKENCOUNT type, but their values are in fact regular 32-bit integers, and their values are generally accessible.

BM25A() and BM25F() functions in the expression ranker are based on these lengths and require index_field_lengths to be enabled. Historically, Manticore used a simplified, stripped-down variant of BM25 that, unlike the complete function, did not account for document length. There's also support for both a complete variant of BM25, and its extension towards multiple fields, called BM25F. They require per-document length and per-field lengths, respectively. Hence the additional directive.

create table products(title text, price float) index_field_lengths = '1'


index_token_filter =

Index-time token filter for index. Optional, default is empty.

Index-time token filter gets created by indexer on indexing source data into a plain index or by RT index on processing INSERT or REPLACE statements and lets you implement a custom tokenizer that makes tokens according to custom rules. The plugins are defined as library name:plugin name:optional string of settings.

create table products(title text, price float) index_token_filter = ''


overshort_step = {0|1}

Position increment on overshort (less than min_word_len) keywords. Optional, allowed values are 0 and 1, default is 1.

create table products(title text, price float) overshort_step = '1'


phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis

Phrase boundary characters list. Optional, default is empty.

This list controls what characters will be treated as phrase boundaries, in order to adjust word positions and enable phrase-level search emulation through proximity search. The syntax is similar to charset_table, but mappings are not allowed and the boundary characters must not overlap with anything else.

On phrase boundary, additional word position increment (specified by phrase_boundary_step) will be added to current word position. This enables phrase-level searching through proximity queries: words in different phrases will be guaranteed to be more than phrase_boundary_step distance away from each other; so proximity search within that distance will be equivalent to phrase-level search.

Phrase boundary condition will be raised if and only if such character is followed by a separator; this is to avoid abbreviations such as S.T.A.L.K.E.R or URLs being treated as several phrases.

create table products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'


phrase_boundary_step = 100

Phrase boundary word position increment. Optional, default is 0.

On phrase boundary, current word position will be additionally incremented by this number.

create table products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'


# index '13"' as '13inch'
regexp_filter = \b(\d+)\" => \1inch

# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color

Regular expressions (regexps) to filter the fields and queries with. Optional, multi-value, default is an empty list of regexps.

In certain applications (like product search) there can be many different ways to call a model, or a product, or a property, and so on. For instance, 'iphone 3gs' and 'iphone 3 gs' (or even 'iphone3 gs') are very likely to mean the same product. Or, for a more tricky example, '13-inch', '13 inch', '13"', and '13in' in a laptop screen size descriptions do mean the same.

Regexps provide you with a mechanism to specify a number of rules specific to your application to handle such cases. In the first 'iphone 3gs' example, you could possibly get away with a wordforms files tailored to handle a handful of iPhone models. However even in a comparatively simple second '13-inch' example there is just way too many individual forms and you are better off specifying rules that would normalize both '13-inch' and '13in' to something identical.

Regular expressions listed in regexp_filter are applied in the order they are listed. That happens at the earliest stage possible, before any other processing, even before tokenization. That is, regexps are applied to the raw source fields when indexing, and to the raw search query text when searching.

create table products(title text, price float) regexp_filter = '(blue|red) => color'