Stop words are the words that are skipped during indexing and searching. Typically you'd put most frequent words to the stop words list, because they do not add much value to search results but consume a lot of resources to process.
Small enough files are stored in the index header, see embedded_limit for details.
While stop words are not indexed, they still do affect the keyword positions. For instance, assume that "the" is a stop word, that document 1 contains the line "in office", and that document 2 contains "in the office". Searching for "in office" as for an exact phrase will only return the first document, as expected, even though "the" in the second one is skipped as a stop word. That behavior can be tweaked through the stopword_step directive.
stopwords=path/to/stopwords/file[ path/to/another/file ...]
Stop word files list (space separated). Optional, default is empty. You can specify several file names, separated by spaces. All the files will be loaded. In RT mode only absolute paths are allowed.
Stop words file format is simple plain text. The encoding must be UTF-8. File data will be tokenized with respect to charset_table settings, so you can use the same separators as in the indexed data.
Stop word files can either be created manually, or semi-automatically. indexer provides a mode that creates a frequency dictionary of the index, sorted by the keyword frequency, see --buildstops and --buildfreqs switch for details. Top keywords from that dictionary can usually be used as stop words.
create table products(title text, price float) stopwords = '/usr/local/sphinx/data/stopwords.txt /usr/local/sphinx/data/stopwords-ru.txt /usr/local/sphinx/data/stopwords-en.txt'
Alternatively you can use one of the default stop word files that come with Manticore. Currently stop words for 50 languages are available. Here is the full list of aliases for them:
- af - Afrikaans
- ar - Arabic
- bg - Bulgarian
- bn - Bengali
- ca - Catalan
- ckb- Kurdish
- cz - Czech
- da - Danish
- de - German
- el - Greek
- en - English
- eo - Esperanto
- es - Spain
- et - Estonian
- eu - Basque
- fa - Persian
- fi - Finnish
- fr - French
- ga - Irish
- gl - Galician
- hi - Hindi
- he - Hebrew
- hr - Croatian
- hu - Hungarian
- hy - Armenian
- id - Indonesian
- it - Italian
- ja - Japanese
- ko - Korean
- la - Latin
- lt - Lithuanian
- lv - Latvian
- mr - Marathi
- nl - Dutch
- no - Norwegian
- pl - Polish
- pt - Portuguese
- ro - Romanian
- ru - Russian
- sk - Slovak
- sl - Slovenian
- so - Somali
- st - Sotho
- sv - Swedish
- sw - Swahili
- th - Thai
- tr - Turkish
- yo - Yoruba
- zh - Chinese
- zu - Zulu
For example, to use stop words for Italian language just put the following line in your config file:
create table products(title text, price float) stopwords = 'it'
If you need to use stop words for multiple languages you should list all their aliases, separated with commas (in RT mode) or spaces (plain mode):
create table products(title text, price float) stopwords = 'en, it, ru'
Position increment on stopwords. Optional, allowed values are 0 and 1, default is 1.
create table products(title text, price float) stopwords = 'en' stopword_step = '1'
Whether to apply stop words before or after stemming. Optional, default is 0 (apply stop word filter after stemming).
By default, stop words are stemmed themselves, and applied to tokens after stemming (or any other morphology processing). In other words, by default, a token is stopped when stem(token) is equal to stem(stopword). That can lead to unexpected results when a token gets (erroneously) stemmed to a stopped root. For example, 'Andes' might get stemmed to 'and', so when 'and' is a stopword, 'Andes' is also skipped.
stopwords_unstemmed directive changed this behaviour. When it's enabled, stop words are applied before stemming (and therefore to the original word forms), and the tokens are skipped when token is equal to stopword.
create table products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'
Word forms are applied after tokenizing incoming text by charset_table rules. They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (e.g. to normalize all the variants such as "walks", "walked", "walking" to the normal form "walk"). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.
wordforms = path/to/wordforms.txt wordforms = path/to/alternateforms.txt wordforms = path/to/dict*.txt
Word forms dictionary. Optional, default is empty.
The dictionaries are used to normalize incoming words both during indexing and searching. Therefore, when it comes to a plain index to pick up changes in wordforms file it's required to rotate the index.
Word forms support in Manticore is designed to support big dictionaries well. They moderately affect indexing speed: for instance, a dictionary with 1 million entries slows down indexing about 1.5 times. Searching speed is not affected at all. Additional RAM impact is roughly equal to the dictionary file size, and dictionaries are shared across indexes: i.e. if the very same 50 MB wordforms file is specified for 10 different indexes, additional
searchd RAM usage will be about 50 MB.
Dictionary file should be in a simple plain text format. Each line should contain source and destination word forms, in UTF-8 encoding, separated by "greater" sign. Rules from the charset_table will be applied when the file is loaded. So basically it's as case sensitive as your other full-text indexed data, ie. typically case insensitive. Here's the file contents sample:
walks > walk walked > walk walking > walk
There is a bundled Spelldump utility that helps you create a dictionary file in the format Manticore can read from source
.aff dictionary files in
MySpell format (as bundled with OpenOffice).
You can map several source words to a single destination word. Because the work happens on tokens, not the source text, differences in whitespace and markup are ignored.
You can use
=> instead of
>. Comments (starting with
# are also allowed. Finally, if a line starts with a tilde (
~) the wordform will be applied after morphology, instead of before.
core 2 duo > c2d e6600 > c2d core 2duo => c2d # Some people write '2duo' together... ~run > walk # Along with stem_en morphology enabled replaces 'run', 'running', 'runs' (and any other words that stem to just 'run') to 'walk'
You can specify multiple destination tokens:
s02e02 > season 2 episode 2 s3 e3 > season 3 episode 3
You can specify several files and not only just one. Masks can be used as a pattern, and all matching files will be processed in simple ascending order.
In RT mode only absolute paths are allowed.
If multi-byte codepages are used, and file names can include foreign characters, the resulting order may not be exactly alphabetic. If the same wordform definition is found in several files, the latter one is used, and it overrides previous definitions.
create table products(title text, price float) wordforms = '/usr/local/sphinx/data/wordforms.txt' wordforms = '/usr/local/sphinx/data/alternateforms.txt /usr/local/sphinx/private/dict*.txt'
Exceptions (also known as synonyms) allow to map one or more tokens (including tokens with characters that would normally be excluded) to a single keyword. They are similar to wordforms in that they also perform mapping, but have a number of important differences.
Short summary of the differences from wordforms is as follows:
|Case sensitive||Case insensitive|
|Can use special characters that are not in charset_table||Fully obey charset_table|
|Underperform on huge dictionaries||Designed to handle millions of entries|
exceptions = path/to/exceptions.txt
Tokenizing exceptions file. Optional, default is empty. In RT mode only absolute paths are allowed.
The expected file format is also plain text, with one line per exception, and the line format is as follows:
map-from-tokens => map-to-token
at & t => at&t AT&T => AT&T Standarten Fuehrer => standartenfuhrer Standarten Fuhrer => standartenfuhrer MS Windows => ms windows Microsoft Windows => ms windows C++ => cplusplus c++ => cplusplus C plus plus => cplusplus
All tokens here are case sensitive: they will not be processed by charset_table rules. Thus, with the example exceptions file above, "at&t" text will be tokenized as two keywords "at" and "t", because of lowercase letters. On the other hand, "AT&T" will match exactly and produce single "AT&T" keyword.
Note that this map-to keyword is:
- always interpreted as a single word
- and is both case and space sensitive
In our sample, "ms windows" query will not match the document with "MS Windows" text. The query will be interpreted as a query for two keywords, "ms" and "windows". And what "MS Windows" gets mapped to is a single keyword "ms windows", with a space in the middle. On the other hand, "standartenfuhrer" will retrieve documents with "Standarten Fuhrer" or "Standarten Fuehrer" contents (capitalized exactly like this), or any capitalization variant of the keyword itself, eg. "staNdarTenfUhreR". (It won't catch "standarten fuhrer", however: this text does not match any of the listed exceptions because of case sensitivity, and gets indexed as two separate keywords.)
Whitespace in the map-from tokens list matters, but its amount does not. Any amount of the whitespace in the map-form list will match any other amount of whitespace in the indexed document or query. For instance, "AT & T" map-from token will match "AT & T" text, whatever the amount of space in both map-from part and the indexed text. Such text will therefore be indexed as a special "AT&T" keyword, thanks to the very first entry from the sample.
Exceptions also allow to capture special characters (that are exceptions from general charset_table rules; hence the name). Assume that you generally do not want to treat '+' as a valid character, but still want to be able search for some exceptions from this rule such as 'C++'. The sample above will do just that, totally independent of what characters are in the table and what are not.
Exceptions are applied to raw incoming document and query data during indexing and searching respectively. Therefore, when it comes to a plain index to pick up changes in the file it's required to reindex and restart
create table products(title text, price float) exceptions = '/usr/local/sphinx/data/exceptions.txt'