Ignoring stop words

Stop words are words that are ignored during indexing and searching, typically due to their high frequency and low value to search results.

Manticore Search applies stemming to stop words by default, which can lead to undesired results, but this can be turned off using the stopwords_unstemmed.

Small stop word files are stored in the table header, and there is a limit to the size of files that can be embedded, as defined by the embedded_limit option.

Stop words are not indexed, but they do affect keyword positions. For example, if "the" is a stop word, and document 1 contains the phrase "in office" while document 2 contains the phrase "in the office," searching for "in office" as an exact phrase will only return the first document, even though "the" is skipped as a stop word in the second document. This behavior can be modified using the stopword_step directive.

stopwords

stopwords=path/to/stopwords/file[ path/to/another/file ...]

The stopwords setting is optional and by default empty. It allows you to specify the path to one or more stop word files, separated by spaces. All the files will be loaded. In the real-time mode, only absolute paths are allowed.

The stop word file format is simple plain text with UTF-8 encoding. The file data will be tokenized with respect to the charset_table settings, so you can use the same separators as in the indexed data.

Stop word files can be created manually or semi-automatically. The indexer provides a mode that creates a frequency dictionary of the table, sorted by the keyword frequency. Top keywords from that dictionary can usually be used as stop words. See --buildstops and --buildfreqs switch for details. Top keywords from that dictionary can usually be used as stop words.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'

Alternatively you can use one of the default stop word files that come with Manticore. Currently stop words for 50 languages are available. Here is the full list of aliases for them:

  • af - Afrikaans
  • ar - Arabic
  • bg - Bulgarian
  • bn - Bengali
  • ca - Catalan
  • ckb- Kurdish
  • cz - Czech
  • da - Danish
  • de - German
  • el - Greek
  • en - English
  • eo - Esperanto
  • es - Spain
  • et - Estonian
  • eu - Basque
  • fa - Persian
  • fi - Finnish
  • fr - French
  • ga - Irish
  • gl - Galician
  • hi - Hindi
  • he - Hebrew
  • hr - Croatian
  • hu - Hungarian
  • hy - Armenian
  • id - Indonesian
  • it - Italian
  • ja - Japanese
  • ko - Korean
  • la - Latin
  • lt - Lithuanian
  • lv - Latvian
  • mr - Marathi
  • nl - Dutch
  • no - Norwegian
  • pl - Polish
  • pt - Portuguese
  • ro - Romanian
  • ru - Russian
  • sk - Slovak
  • sl - Slovenian
  • so - Somali
  • st - Sotho
  • sv - Swedish
  • sw - Swahili
  • th - Thai
  • tr - Turkish
  • yo - Yoruba
  • zh - Chinese
  • zu - Zulu

For example, to use stop words for Italian language just put the following line in your config file:

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) stopwords = 'it'

If you need to use stop words for multiple languages you should list all their aliases, separated with commas (RT mode) or spaces (plain mode):

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'

stopword_step

stopword_step={0|1}

The position_increment setting on stopwords is optional, and the allowed values are 0 and 1, with the default being 1.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'

stopwords_unstemmed

stopwords_unstemmed={0|1}

Whether to apply stop words before or after stemming. Optional, default is 0 (apply stop word filter after stemming).

By default, stop words are stemmed themselves, and then applied to tokens after stemming (or any other morphology processing). This means that a token is stopped when stem(token) is equal to stem(stopword). This default behavior can lead to unexpected results when a token is erroneously stemmed to a stopped root. For example, "Andes" might get stemmed to "and", so when "and" is a stopword, "Andes" is also skipped.

However, you can change this behavior by enabling the stopwords_unstemmed directive. When this is enabled, stop words are applied before stemming (and therefore to the original word forms), and the tokens are skipped when the token is equal to the stopword.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'