Manticore doesn't store text as is for performing full-text searching on it. Instead, it extracts words and creates several structures that allow fast full-text searching. From the found words, a dictionary is built, which allows a quick look to discover if the word is present or not in the index. In addition, other structures record the documents and fields in which the word was found (as well as the position of it inside a field). All these are used when a full-text match is performed.
The process of demarcating and classifying words is called tokenization. The tokenization is applied at both indexing and searching, and it operates at the character and word level.
On the character level, the engine allows only certain characters to pass. This is defined by the charset_table. Anything else is replaced with a whitespace (which is considered the default word separator). The charset_table also allows mappings, such as lowercasing or simply replacing one character with another. Besides that, characters can be ignored, blended, defined as a phrase boundary.
At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the index. A common request is to match singular with plural forms of words. For this, morphology processors can be used.
Going further, we might want a word to be matched as another one because they are synonyms. For this, the word forms feature can be used, which allows one or more words to be mapped to another one.
Very common words can have some unwanted effects on searching, mostly because of their frequency they require lots of computing to process their doc/hit lists. They can be blacklisted with the stop words functionality. This helps not only in speeding up queries but also in decreasing the index size.
A more advanced blacklisting is bigrams, which allows creating a special token between a "bigram" (common) word and an uncommon word. This can speed up several times when common words are used in phrase searches.
In case of indexing HTML content, it's important not to index the HTML tags, as they can introduce a lot of "noise" in the index. HTML stripping can be used and can be configured to strip, but index certain tag attributes or completely ignore the content of certain HTML elements.
Manticore supports a wide range of languages, with basic support enabled for most languages via charset_table = non_cont
(which is the default value). The non_cjk
option which is an alias for non_cont
can be used as well: charset_table = non_cjk
.
For many languages, Manticore provides a stopwords file that can be used to improve search relevance.
Additionally, advanced morphology is available for a few languages that can significantly improve search relevance by using dictionary-based lemmatization or stemming algorithms for better segmentation and normalization.
The table below lists all supported languages and indicates how to enable:
- basic support (column "Supported")
- stopwords (column "Stopwords file name")
- advanced morphology (column "Advanced morphology")
Language | Supported | Stopwords file name | Advanced morphology | Notes |
---|---|---|---|---|
Afrikaans | charset_table=non_cont | af | - | |
Arabic | charset_table=non_cont | ar | morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar | |
Armenian | charset_table=non_cont | hy | - | |
Assamese | specify charset_table specify charset_table manually | - | - | |
Basque | charset_table=non_cont | eu | - | |
Bengali | charset_table=non_cont | bn | - | |
Bishnupriya | specify charset_table manually | - | - | |
Buhid | specify charset_table manually | - | - | |
Bulgarian | charset_table=non_cont | bg | - | |
Catalan | charset_table=non_cont | ca | morphology=libstemmer_ca | |
Chinese using ICU | charset_table=chinese | zh | morphology=icu_chinese | More accurate than using ngrams |
Chinese using Jieba | charset_table=chinese | zh | morphology=jieba_chinese | More accurate than using ngrams |
Chinese using ngrams | ngram_chars=chinese | zh | ngram_chars=1 | Faster indexing, but the search performance might not be as good |
Croatian | charset_table=non_cont | hr | - | |
Kurdish | charset_table=non_cont | ckb | - | |
Czech | charset_table=non_cont | cz | morphology=stem_cz (Czech stemmer) | |
Danish | charset_table=non_cont | da | morphology=libstemmer_da | |
Dutch | charset_table=non_cont | nl | morphology=libstemmer_nl | |
English | charset_table=non_cont | en | morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter's English stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_en (English from libstemmer) | |
Esperanto | charset_table=non_cont | eo | - | |
Estonian | charset_table=non_cont | et | - | |
Finnish | charset_table=non_cont | fi | morphology=libstemmer_fi | |
French | charset_table=non_cont | fr | morphology=libstemmer_fr | |
Galician | charset_table=non_cont | gl | - | |
Garo | specify charset_table manually | - | - | |
German | charset_table=non_cont | de | morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de | |
Greek | charset_table=non_cont | el | morphology=libstemmer_el | |
Hebrew | charset_table=non_cont | he | - | |
Hindi | charset_table=non_cont | hi | morphology=libstemmer_hi | |
Hmong | specify charset_table manually | - | - | |
Ho | specify charset_table manually | - | - | |
Hungarian | charset_table=non_cont | hu | morphology=libstemmer_hu | |
Indonesian | charset_table=non_cont | id | morphology=libstemmer_id | |
Irish | charset_table=non_cont | ga | morphology=libstemmer_ga | |
Italian | charset_table=non_cont | it | morphology=libstemmer_it | |
Japanese | ngram_chars=japanese | - | ngram_chars=japanese ngram_len=1 | Requires ngram-based segmentation |
Komi | specify charset_table manually | - | - | |
Korean | ngram_chars=korean | - | ngram_chars=korean ngram_len=1 | Requires ngram-based segmentation |
Large Flowery Miao | specify charset_table manually | - | - | |
Latin | charset_table=non_cont | la | - | |
Latvian | charset_table=non_cont | lv | - | |
Lithuanian | charset_table=non_cont | lt | morphology=libstemmer_lt | |
Maba | specify charset_table manually | - | - | |
Maithili | specify charset_table manually | - | - | |
Marathi | specify charset_table manually | - | - | |
Marathi | charset_table=non_cont | mr | - | |
Mende | specify charset_table manually | - | - | |
Mru | specify charset_table manually | - | - | |
Myene | specify charset_table manually | - | - | |
Nepali | specify charset_table manually | - | morphology=libstemmer_ne | |
Ngambay | specify charset_table manually | - | - | |
Norwegian | charset_table=non_cont | no | morphology=libstemmer_no | |
Odia | specify charset_table manually | - | - | |
Persian | charset_table=non_cont | fa | - | |
Polish | charset_table=non_cont | pl | - | |
Portuguese | charset_table=non_cont | pt | morphology=libstemmer_pt | |
Romanian | charset_table=non_cont | ro | morphology=libstemmer_ro | |
Russian | charset_table=non_cont | ru | morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter's Russian stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_ru (from libstemmer) | |
Santali | specify charset_table manually | - | - | |
Sindhi | specify charset_table manually | - | - | |
Slovak | charset_table=non_cont | sk | - | |
Slovenian | charset_table=non_cont | sl | - | |
Somali | charset_table=non_cont | so | - | |
Sotho | charset_table=non_cont | st | - | |
Spanish | charset_table=non_cont | es | morphology=libstemmer_es | |
Swahili | charset_table=non_cont | sw | - | |
Swedish | charset_table=non_cont | sv | morphology=libstemmer_sv | |
Sylheti | specify charset_table manually | - | - | |
Tamil | specify charset_table manually | - | morphology=libstemmer_ta | |
Thai | charset_table=thai | th | - | |
Turkish | charset_table=non_cont | tr | morphology=libstemmer_tr | |
Ukrainian | charset_table=non_cont,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491 | - | morphology=lemmatize_uk_all | Requires installation of UK lemmatizer |
Yoruba | charset_table=non_cont | yo | - | |
Zulu | charset_table=non_cont | zu | - |
Manticore provides built-in support for indexing languages with continuous scripts (i.e., languages that do not use spaces or other marks between words or sentences). This allows you to process texts in these languages in two different ways:
- Precise segmentation using the ICU library. Currently, only Chinese is supported.
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'
- Precise segmentation using the Jieba library. Like ICU, it currently supports only Chinese.
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'
- Basic support using the N-gram options ngram_len and ngram_chars
For each language using a continuous script, there are separate character set tables (
chinese
,korean
,japanese
,thai
) that can be used. Alternatively, you can use the commoncont
character set table to support all CJK and Thai languages at once, or thecjk
charset to include all CJK languages only.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'
/* Or, alternatively */
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cjk,thai'
Additionally, there is built-in support for Chinese stopwords with the alias zh
.
- SQL
- JSON
- PHP
- Python
- javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'