Ignoring stop words

Stop words are the words that are skipped during indexing and searching. Typically you'd put most frequent words to the stop words list, because they do not add much value to search results but consume a lot of resources to process.

Stemming is by default applied when parsing stop words file. That might however lead to undesired results. You can turn that off with stopwords_unstemmed.

Small enough files are stored in the index header, see embedded_limit for details.

While stop words are not indexed, they still do affect the keyword positions. For instance, assume that "the" is a stop word, that document 1 contains the line "in office", and that document 2 contains "in the office". Searching for "in office" as for an exact phrase will only return the first document, as expected, even though "the" in the second one is skipped as a stop word. That behavior can be tweaked through the stopword_step directive.

stopwords

stopwords=path/to/stopwords/file[ path/to/another/file ...]

Stop word files list (space separated). Optional, default is empty. You can specify several file names, separated by spaces. All the files will be loaded. In RT mode only absolute paths are allowed.

Stop words file format is simple plain text. The encoding must be UTF-8. File data will be tokenized with respect to charset_table settings, so you can use the same separators as in the indexed data.

Stop word files can either be created manually, or semi-automatically. indexer provides a mode that creates a frequency dictionary of the index, sorted by the keyword frequency, see --buildstops and --buildfreqs switch for details. Top keywords from that dictionary can usually be used as stop words.

‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'

Alternatively you can use one of the default stop word files that come with Manticore. Currently stop words for 50 languages are available. Here is the full list of aliases for them:

  • af - Afrikaans
  • ar - Arabic
  • bg - Bulgarian
  • bn - Bengali
  • ca - Catalan
  • ckb- Kurdish
  • cz - Czech
  • da - Danish
  • de - German
  • el - Greek
  • en - English
  • eo - Esperanto
  • es - Spain
  • et - Estonian
  • eu - Basque
  • fa - Persian
  • fi - Finnish
  • fr - French
  • ga - Irish
  • gl - Galician
  • hi - Hindi
  • he - Hebrew
  • hr - Croatian
  • hu - Hungarian
  • hy - Armenian
  • id - Indonesian
  • it - Italian
  • ja - Japanese
  • ko - Korean
  • la - Latin
  • lt - Lithuanian
  • lv - Latvian
  • mr - Marathi
  • nl - Dutch
  • no - Norwegian
  • pl - Polish
  • pt - Portuguese
  • ro - Romanian
  • ru - Russian
  • sk - Slovak
  • sl - Slovenian
  • so - Somali
  • st - Sotho
  • sv - Swedish
  • sw - Swahili
  • th - Thai
  • tr - Turkish
  • yo - Yoruba
  • zh - Chinese
  • zu - Zulu

For example, to use stop words for Italian language just put the following line in your config file:

‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) stopwords = 'it'

If you need to use stop words for multiple languages you should list all their aliases, separated with commas (in RT mode) or spaces (plain mode):

‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'

stopword_step

stopword_step={0|1}

Position increment on stopwords. Optional, allowed values are 0 and 1, default is 1.

‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'

stopwords_unstemmed

stopwords_unstemmed={0|1}

Whether to apply stop words before or after stemming. Optional, default is 0 (apply stop word filter after stemming).

By default, stop words are stemmed themselves, and applied to tokens after stemming (or any other morphology processing). In other words, by default, a token is stopped when stem(token) is equal to stem(stopword). That can lead to unexpected results when a token gets (erroneously) stemmed to a stopped root. For example, 'Andes' might get stemmed to 'and', so when 'and' is a stopword, 'Andes' is also skipped.

stopwords_unstemmed directive changed this behaviour. When it's enabled, stop words are applied before stemming (and therefore to the original word forms), and the tokens are skipped when token is equal to stopword.

‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'