Searching > Collations | Manticore Search Manual

Spell correction also known as:

Auto correction
Text correction
Fixing a spelling error
Typo tolerance
"Did you mean?"

and so on is a software functionality that suggests you alternatives to or makes automatic corrections of the text you have typed in. The concept of correcting typed text dates back to the 1960s, when a computer scientist named Warren Teitelman who also invented the "undo" command came up with a philosophy of computing called D.W.I.M., or "Do What I Mean." Rather than programming computers to accept only perfectly formatted instructions, Teitelman said we should program them to recognize obvious mistakes.

The first well known product which provided spell correction functionality was Microsoft Word 6.0 released in 1993.

There are few ways how spell correction can be done, but the important thing is that there is no purely programmatic way which will convert your mistyped "ipone" into "iphone" (at least with decent quality). Mostly there has to be a data set the system is based on. The data set can be:

A dictionary of properly spelled words. In its turn it can be:
- Based on your real data. The idea here is that mostly in the dictionary made up from your data the spelling is correct and then for each typed word the system just tries to find a word which is most similar to that (we'll speak about how exactly it can be done with Manticore shortly)
- Or can be based on an external dictionary which has nothing to do with your data. The problem which may arise here is that your data and the external dictionary can be too different: some words may be missing in the dictionary, some words may be missing in your data.
Not just dictionary-based, but context-aware, e.g. "white ber" would be corrected to "white bear" while "dark ber" - to "dark beer". The context may be not just a neighbour word in your query, but your location, date of time, current sentence's grammar (to let's say change "there" to "their" or not), your search history and virtually any other things that can affect your intent.
Another classical approach is to use previous search queries as the data set for spell correction. It's even more utilized in autocomplete functionality, but makes sense for autocorrect too. The idea here is that mostly users are right with spelling, therefore we can use words from their search history as a source of truth even if we don't have the words in our documents nor we use an external dictionary. Context-awareness is also possible here.

Manticore provides commands CALL QSUGGEST and CALL SUGGEST that can be used for the purpose of automatic spell correction.

They are both available via SQL only and the general syntax is:

CALL QSUGGEST(word, index [,options])
CALL SUGGEST(word, index [,options])
options: N as option_name[, M as another_option, ...]

These commands provide for a given word all suggestions from the dictionary. They work only on indexes with infixing enabled and dict=keywords. They return the suggested keywords, Levenshtein distance between the suggested and original keywords and the docs statistics of the suggested keyword.

If the first parameter is not a single word, but multiple, then:

CALL QSUGGEST will return suggestions only for the last word, ignoring the rest
CALL SUGGEST will return suggestions only for the first word

That's the only difference between them. Several options are supported for customization:

Option	Description	Default
limit	Returns N top matches	5
max_edits	Keeps only dictionary words which Levenshtein distance is less than or equal to N	4
result_stats	Provides Levenshtein distance and document count of the found words	1 (enabled)
delta_len	Keeps only dictionary words whose length difference is less than N	3
max_matches	Number of matches to keep	25
reject	Rejected words are matches that are not better than those already in the match queue. They are put in a rejected queue that gets reset in case one actually can go in the match queue. This parameter defines the size of the rejected queue (as reject*max(max_matched,limit)). If the rejected queue is filled, the engine stops looking for potential matches	4
result_line	alternate mode to display the data by returning all suggests, distances and docs each per one row	0
non_char	do not skip dictionary words with non alphabet symbols	0 (skip such words)

To show how it works let's create an index and add few documents into it.

create table products(title text) min_infix_len='2';
insert into products values (0,'Crossbody Bag with Tassel'), (0,'microfiber sheet set'), (0,'Pet Hair Remover Glove');

As you can see we have a mistype in "crossbUdy" which gets corrected to the "crossbody". In addition to that by default CALL SUGGEST/QSUGGEST return:

distance - the Levenshtein distance which means how many edits they had to make to convert the given word to the suggestion
docs - and the number of docs that have this word

To disable these stats display you can use option 0 as result_stats.

‹›

Example

Example

📋

call suggest('crossbudy', 'products');

‹›

Response

+-----------+----------+------+
| suggest   | distance | docs |
+-----------+----------+------+
| crossbody | 1        | 1    |
+-----------+----------+------+

If the first parameter is not a single word, but multiple, then CALL SUGGEST will return suggestions only for the first word.

‹›

Example

Example

📋

call suggest('bagg with tasel', 'products');

‹›

Response

+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| bag     | 1        | 1    |
+---------+----------+------+

If the first parameter is not a single word, but multiple, then CALL SUGGEST will return suggestions only for the last word.

‹›

Example

Example

📋

CALL QSUGGEST('bagg with tasel', 'products');

‹›

Response

+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| tassel  | 1        | 1    |
+---------+----------+------+

Using 1 as result_line in the options turns on alternate mode to display the data by returning all suggests, distances and docs each per one row.

‹›

Example

Example

📋

CALL QSUGGEST('bagg with tasel', 'products', 1 as result_line);

‹›

Response

+----------+--------+
| name     | value  |
+----------+--------+
| suggests | tassel |
| distance | 1      |
| docs     | 1      |
+----------+--------+

This interactive course demonstrates online how it works on a web page and provides different examples.

Typical flow with Manticore and a database

Query cache

Last modified: November 09, 2021

Query cache stores a compressed result set in memory, and then reuses it for subsequent queries where possible. You can configure it using the following directives:

qcache_max_bytes, a limit on the RAM use for cached queries storage. Defaults to 16 MB. Setting qcache_max_bytes to 0 completely disables the query cache.
qcache_thresh_msec, the minimum wall query time to cache. Queries that completed faster than this will not be cached. Defaults to 3000 msec, or 3 seconds.
qcache_ttl_sec, cached entry TTL, or time to live. Queries will stay cached for this much. Defaults to 60 seconds, or 1 minute.

These settings can be changed on the fly using the SET GLOBAL statement:

mysql> SET GLOBAL qcache_max_bytes=128000000;

These changes are applied immediately, and the cached result sets that no longer satisfy the constraints are immediately discarded. When reducing the cache size on the fly, MRU (most recently used) result sets win.

Query cache works as follows. When it's enabled, every full-text search result gets completely stored in memory. That happens after full-text matching, filtering, and ranking, so basically we store total_found {docid,weight} pairs. Compressed matches can consume anywhere from 2 bytes to 12 bytes per match on average, mostly depending on the deltas between the subsequent docids. Once the query completes, we check the wall time and size thresholds, and either save that compressed result set for reuse, or discard it.

Note how the query cache impact on RAM is thus not limited by qcache_max_bytes! If you run, say, 10 concurrent queries, each of them matching upto 1M matches (after filters), then the peak temporary RAM use will be in the 40 MB to 240 MB range, even if in the end the queries are quick enough and do not get cached.

Queries can then use cache when the index, the full-text query (ie.MATCH() contents), and the ranker are all a match, and filters are compatible. Meaning:

The full-text part within MATCH() must be a bytewise match. Add a single extra space, and that is now a different query where the query cache is concerned.
The ranker (and its parameters if any, for user-defined rankers) must be a bytewise match.
The filters must be a superset of the original filters. That is, you can add extra filters and still hit the cache. (In this case, the extra filters will be applied to the cached result.) But if you remove one, that will be a new query again.

Cache entries expire with TTL, and also get invalidated on index rotation, or on TRUNCATE, or on ATTACH. Note that at the moment entries are not invalidated on arbitrary RT index writes! So a cached query might be returning older results for the duration of its TTL.

Current cache status can be inspected with in SHOW STATUS through the qcache_XXX variables:

mysql> SHOW STATUS LIKE 'qcache%';
+-----------------------+----------+
| Counter               | Value    |
+-----------------------+----------+
| qcache_max_bytes      | 16777216 |
| qcache_thresh_msec    | 3000     |
| qcache_ttl_sec        | 60       |
| qcache_cached_queries | 0        |
| qcache_used_bytes     | 0        |
| qcache_hits           | 0        |
+-----------------------+----------+
6 rows in set (0.00 sec)

Spell correction Collations

Last modified: February 10, 2021

Collations essentially affect the string attribute comparisons. They specify both the character set encoding and the strategy that Manticore uses to compare strings when doing ORDER BY or GROUP BY with a string attribute involved.

String attributes are stored as is when indexing, and no character set or language information is attached to them. That's okay as long as Manticore only needs to store and return the strings to the calling application verbatim. But when you ask Manticore to sort by a string value, that request immediately becomes quite ambiguous.

First, single-byte (ASCII, or ISO-8859-1, or Windows-1251) strings need to be processed differently that the UTF-8 ones that may encode every character with a variable number of bytes. So we need to know what is the character set type to interpret the raw bytes as meaningful characters properly.

Second, we additionally need to know the language-specific string sorting rules. For instance, when sorting according to US rules in en_US locale, the accented character ï (small letter i with diaeresis) should be placed somewhere after z. However, when sorting with French rules and fr_FR locale in mind, it should be placed between i and j. And some other set of rules might choose to ignore accents at all, allowing ï and i to be mixed arbitrarily.

Third, but not least, we might need case-sensitive sorting in some scenarios and case-insensitive sorting in some others.

Collations combine all of the above: the character set, the language rules, and the case sensitivity. Manticore currently provides the following four collations.

libc_ci
libc_cs
utf8_general_ci
binary

The first two collations rely on several standard C library (libc) calls and can thus support any locale that is installed on your system. They provide case-insensitive (_ci) and case-sensitive (_cs) comparisons respectively. By default they will use C locale, effectively resorting to bytewise comparisons. To change that, you need to specify a different available locale using collation_libc_locale directive. The list of locales available on your system can usually be obtained with the locale command:

$ locale -a
C
en_AG
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_NG
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZW.utf8
es_ES
fr_FR
POSIX
ru_RU.utf8
ru_UA.utf8

The specific list of the system locales may vary. Consult your OS documentation to install additional needed locales.

utf8_general_ci and binary locales are built-in into Manticore. The first one is a generic collation for UTF-8 data (without any so-called language tailoring); it should behave similar to utf8_general_ci collation in MySQL. The second one is a simple bytewise comparison.

Collation can be overridden via SQL on a per-session basis using SET collation_connection statement. All subsequent SQL queries will use this collation. Otherwise all queries will use the server default collation or as specified in collation_server configuration directive. Manticore currently defaults to libc_ci collation.

Collations affect all string attribute comparisons, including those within ORDER BY and GROUP BY, so differently ordered or grouped results can be returned depending on the collation chosen. Note that collations don't affect full-text searching, for that use charset_table

Query cache ️ Updating index schema

Last modified: February 10, 2021

ALTER TABLE index ADD COLUMN column_name [{INTEGER|INT|BIGINT|FLOAT|BOOL|MULTI|MULTI64|JSON|STRING|TIMESTAMP|TEXT [INDEXED [ATTRIBUTE]]}] [engine='columnar']
ALTER TABLE index DROP COLUMN column_name

It supports adding one field at a time for RT indexes. Supported data types are:

int - integer attribute
timestamp - timestamp attribute
bigint - big integer attribute
float - float attribute
bool - boolean attribute
multi - multi-valued integer attribute
multi64 - multi-valued bigint attribute
json - json attribute
string / text attribute / string attribute - string attribute
text / text indexed stored / string indexed stored - full-text indexed field with original value stored in docstore
text indexed / string indexed - full-text indexed field, indexed only (the original value is not stored in docstore)
text indexed attribute / string indexed attribute - full text indexed field + string attribute (not storing the original value in docstore)
text stored / string stored - the value will be only stored in docstore, not full-text indexed, not a string attribute
adding engine='columnar' to any attribute (except for json) will make it stored in the columnar storage

Querying an index is impossible while a column is being added.
Newly created attribute's values are set to 0.
ALTER will not work for distributed indexes and indexes without any attributes.
DROP COLUMN will fail if an index has only one field.
When dropping a field which is both a full-text field and a string attribute the first ALTER DROP drops the attribute, the second one drops the full-text field.
Adding/dropping full-text field is only supported in RT mode.

‹›

Example

Example

📋

mysql> desc rt;
+------------+-----------+
| Field      | Type      |
+------------+-----------+
| id         | bigint    |
| text       | field     |
| group_id   | uint      |
| date_added | timestamp |
+------------+-----------+
mysql> alter table rt add column test integer;
mysql> desc rt;
+------------+-----------+
| Field      | Type      |
+------------+-----------+
| id         | bigint    |
| text       | field     |
| group_id   | uint      |
| date_added | timestamp |
| test       | uint      |
+------------+-----------+
mysql> alter table rt drop column group_id;
mysql> desc rt;
+------------+-----------+
| Field      | Type      |
+------------+-----------+
| id         | bigint    |
| text       | field     |
| date_added | timestamp |
| test       | uint      |
+------------+-----------+
mysql> alter table rt add column title text indexed;
mysql> desc rt;
+------------+-----------+------------+
| Field      | Type      | Properties |
+------------+-----------+------------+
| id         | bigint    |            |
| text       | text      | indexed    |
| title      | text      | indexed    |
| date_added | timestamp |            |
| test       | uint      |            |
+------------+-----------+------------+
mysql> alter table rt add column title text attribute;
mysql> desc rt;
+------------+-----------+------------+
| Field      | Type      | Properties |
+------------+-----------+------------+
| id         | bigint    |            |
| text       | text      | indexed    |
| title      | text      | indexed    |
| date_added | timestamp |            |
| test       | uint      |            |
| title      | string    |            |
+------------+-----------+------------+
mysql> alter table rt drop column title;
mysql> desc rt;
+------------+-----------+------------+
| Field      | Type      | Properties |
+------------+-----------+------------+
| id         | bigint    |            |
| text       | text      | indexed    |
| title      | text      | indexed    |
| date_added | timestamp |            |
| test       | uint      |            |
+------------+-----------+------------+
mysql> alter table rt drop column title;
mysql> desc rt;
+------------+-----------+------------+
| Field      | Type      | Properties |
+------------+-----------+------------+
| id         | bigint    |            |
| text       | text      | indexed    |
| date_added | timestamp |            |
| test       | uint      |            |
+------------+-----------+------------+

ALTER RTINDEX index ft_setting='value'[, ft_setting2='value']

You can also use ALTER to modify full-text settings of your index in RT mode. Just remember that it doesn't affect existing documents, it only affects new ones. Take a look at the example where we:

create an index with a full-text field and charset_table that allows only 3 searchable characters: a, b and c.
then we insert document 'abcd' and find it by query abcd, the d just gets ignored since it's not in the charset_table array
then we understand, that we want d to be searchable too, so we add it with help of ALTER
but the same query where match('abcd') still says it searched by abc, because the existing document remembers previous contents of charset_table
then we add another document abcd and search by abcd again
now it finds the both documents and show meta says it used two keywords: abc (to find the old document) and abcd (for the new one).

‹›

Example

Example

📋

mysql> create table rt(title text) charset_table='a,b,c';
mysql> insert into rt(title) values('abcd');
mysql> select * from rt where match('abcd');
+---------------------+-------+
| id                  | title |
+---------------------+-------+
| 1514630637682688054 | abcd  |
+---------------------+-------+
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total         | 1     |
| total_found   | 1     |
| time          | 0.000 |
| keyword[0]    | abc   |
| docs[0]       | 1     |
| hits[0]       | 1     |
+---------------+-------+
mysql> alter table rt charset_table='a,b,c,d';
mysql> select * from rt where match('abcd');
+---------------------+-------+
| id                  | title |
+---------------------+-------+
| 1514630637682688054 | abcd  |
+---------------------+-------+
mysql> show meta
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total         | 1     |
| total_found   | 1     |
| time          | 0.000 |
| keyword[0]    | abc   |
| docs[0]       | 1     |
| hits[0]       | 1     |
+---------------+-------+
mysql> insert into rt(title) values('abcd');
mysql> select * from rt where match('abcd');
+---------------------+-------+
| id                  | title |
+---------------------+-------+
| 1514630637682688055 | abcd  |
| 1514630637682688054 | abcd  |
+---------------------+-------+
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total         | 2     |
| total_found   | 2     |
| time          | 0.000 |
| keyword[0]    | abc   |
| docs[0]       | 1     |
| hits[0]       | 1     |
| keyword[1]    | abcd  |
| docs[1]       | 1     |
| hits[1]       | 1     |
+---------------+-------+

ALTER RTINDEX index RECONFIGURE

ALTER can also reconfigure an RT index in plain mode, so that new tokenization, morphology and other text processing settings from the configuration file take effect for new documents. Note, that the existing document will be left intact. Internally, it forcibly saves the current RAM chunk as a new disk chunk and adjusts the index header, so that new documents are tokenized using the updated full-text settings.

‹›

Example

Example

📋

mysql> show index rt settings;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| settings      |       |
+---------------+-------+
1 row in set (0.00 sec)
mysql> alter rtindex rt reconfigure;
Query OK, 0 rows affected (0.00 sec)
mysql> show index rt settings;
+---------------+----------------------+
| Variable_name | Value                |
+---------------+----------------------+
| settings      | morphology = stem_en |
+---------------+----------------------+
1 row in set (0.00 sec)

Collations ️ Functions

Last modified: January 05, 2022

Spell correction

How it works

CALL QSUGGEST, CALL SUGGEST

Single word example

CALL SUGGEST takes only the first word

CALL QSUGGEST takes only the last word

Different display mode

Interactive course

Query cache

Collations

Updating index schema

Updating index schema in RT mode

Important notes:

Updating index FT settings in RT mode

Updating index FT settings in plain mode