Autocomplete (or word completion) is a feature in which an application predicts the rest of a word a user is typing. On websites it's used in search boxes, where a user starts to type a word and a dropdown with suggestions pops up so the user can select the ending from the list.
There are few ways how you can do autocomplete in Manticore:
To autocomplete a sentence you can use infixed search. You can find endings of a document's field by providing its beginning and:
- using full-text operators
*
to match anything it substitutes - and optionally
^
to start from the beginning of the field - and perhaps
""
for phrase matching - and optionally highlight the results so you don't have to fetch them in full to your application
There is an article about it in our blog and an interactive course. A quick example is:
- Let's assume you have a document:
My cat loves my dog. The cat (Felis catus) is a domestic species of small carnivorous mammal.
- Then you can use
^
,""
and*
so as the user is typing you make queries like:^"m*"
,^"my *"
,^"my c*"
,^"my ca*"
and so on - It will find the document and if you also do highlighting you will get something like:
<strong>My cat</strong> loves my dog. The cat ( ...
In some cases all you need is just autocomplete a single word or a couple of words. In this case you can use CALL KEYWORDS
.
CALL KEYWORDS
is available via the SQL interface and provides a way to check how keywords are tokenized or to retrieve the tokenized forms of particular keywords. If the table enables infixes it allows to quickly find possible endings for given keywords which makes it possible for use for autocomplete.
This is a good alternative to just general infixed search as it provides higher performance since all it needs for work is table's dictionary, not the documents themselves.
CALL KEYWORDS(text, table [, options])
CALL KEYWORDS
statement splits text into keywords. It returns tokenized and normalized forms of the keywords, and, optionally, keyword statistics. It also returns the position of each keyword in the query and all forms of tokenized keywords in case the table enables lemmatizers.
Parameter | Description |
---|---|
text | Text to break down to keywords |
table | Name of the table from which to take the text processing settings |
0/1 as stats | Show statistics of keywords, default is 0 |
0/1 as fold_wildcards | Fold wildcards, default is 0 |
0/1 as fold_lemmas | Fold morphological lemmas, default is 0 |
0/1 as fold_blended | Fold blended words, default is 0 |
N as expansion_limit | Override expansion_limit defined in the server configuration, default is 0 (use value from the configuration) |
docs/hits as sort_mode | Sort output results by either 'docs' or 'hits'. Default no sorting |
The examples show how it works if assuming the user is trying to get an autocomplete for "my cat ...". So on the application side all you need to do is to suggest the user the endings from the column "normalized" for each new word. It often makes sense to sort by hits or docs using 'hits' as sort_mode
or 'docs' as sort_mode
.
- Examples
MySQL [(none)]> CALL KEYWORDS('m*', 't', 1 as stats);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | m* | my | 1 | 2 |
| 1 | m* | mammal | 1 | 1 |
+------+-----------+------------+------+------+
MySQL [(none)]> CALL KEYWORDS('my*', 't', 1 as stats);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | my* | my | 1 | 2 |
+------+-----------+------------+------+------+
MySQL [(none)]> CALL KEYWORDS('c*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+-------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+-------------+------+------+
| 1 | c* | cat | 1 | 2 |
| 1 | c* | carnivorous | 1 | 1 |
| 1 | c* | catus | 1 | 1 |
+------+-----------+-------------+------+------+
MySQL [(none)]> CALL KEYWORDS('ca*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+-------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+-------------+------+------+
| 1 | ca* | cat | 1 | 2 |
| 1 | ca* | carnivorous | 1 | 1 |
| 1 | ca* | catus | 1 | 1 |
+------+-----------+-------------+------+------+
MySQL [(none)]> CALL KEYWORDS('cat*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | cat* | cat | 1 | 2 |
| 1 | cat* | catus | 1 | 1 |
+------+-----------+------------+------+------+
There is a nice trick how you can improve the above algorithm - use bigram_index. When you have it enabled for the table what you get in it is not just a single word, but each pair of words standing one after another indexed as a separate token.
This allows to predict not just the current word's ending, but the next word too which is especially beneficial for the purpose of autocomplete.
- Examples
MySQL [(none)]> CALL KEYWORDS('m*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | m* | my | 1 | 2 |
| 1 | m* | mammal | 1 | 1 |
| 1 | m* | my cat | 1 | 1 |
| 1 | m* | my dog | 1 | 1 |
+------+-----------+------------+------+------+
MySQL [(none)]> CALL KEYWORDS('my*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | my* | my | 1 | 2 |
| 1 | my* | my cat | 1 | 1 |
| 1 | my* | my dog | 1 | 1 |
+------+-----------+------------+------+------+
MySQL [(none)]> CALL KEYWORDS('c*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+--------------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+--------------------+------+------+
| 1 | c* | cat | 1 | 2 |
| 1 | c* | carnivorous | 1 | 1 |
| 1 | c* | carnivorous mammal | 1 | 1 |
| 1 | c* | cat felis | 1 | 1 |
| 1 | c* | cat loves | 1 | 1 |
| 1 | c* | catus | 1 | 1 |
| 1 | c* | catus is | 1 | 1 |
+------+-----------+--------------------+------+------+
MySQL [(none)]> CALL KEYWORDS('ca*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+--------------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+--------------------+------+------+
| 1 | ca* | cat | 1 | 2 |
| 1 | ca* | carnivorous | 1 | 1 |
| 1 | ca* | carnivorous mammal | 1 | 1 |
| 1 | ca* | cat felis | 1 | 1 |
| 1 | ca* | cat loves | 1 | 1 |
| 1 | ca* | catus | 1 | 1 |
| 1 | ca* | catus is | 1 | 1 |
+------+-----------+--------------------+------+------+
MySQL [(none)]> CALL KEYWORDS('cat*', 't', 1 as stats, 'hits' as sort_mode);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | cat* | cat | 1 | 2 |
| 1 | cat* | cat felis | 1 | 1 |
| 1 | cat* | cat loves | 1 | 1 |
| 1 | cat* | catus | 1 | 1 |
| 1 | cat* | catus is | 1 | 1 |
+------+-----------+------------+------+------+
CALL KEYWORDS
supports distributed tables so no matter how big your data set you can benefit from using it.
Spell correction also known as:
- Auto correction
- Text correction
- Fixing a spelling error
- Typo tolerance
- "Did you mean?"
and so on is a software functionality that suggests you alternatives to or makes automatic corrections of the text you have typed in. The concept of correcting typed text dates back to the 1960s, when a computer scientist named Warren Teitelman who also invented the "undo" command came up with a philosophy of computing called D.W.I.M., or "Do What I Mean." Rather than programming computers to accept only perfectly formatted instructions, Teitelman said we should program them to recognize obvious mistakes.
The first well known product which provided spell correction functionality was Microsoft Word 6.0 released in 1993.
There are few ways how spell correction can be done, but the important thing is that there is no purely programmatic way which will convert your mistyped "ipone" into "iphone" (at least with decent quality). Mostly there has to be a data set the system is based on. The data set can be:
- A dictionary of properly spelled words. In its turn it can be:
- Based on your real data. The idea here is that mostly in the dictionary made up from your data the spelling is correct and then for each typed word the system just tries to find a word which is most similar to that (we'll speak about how exactly it can be done with Manticore shortly)
- Or can be based on an external dictionary which has nothing to do with your data. The problem which may arise here is that your data and the external dictionary can be too different: some words may be missing in the dictionary, some words may be missing in your data.
- Not just dictionary-based, but context-aware, e.g. "white ber" would be corrected to "white bear" while "dark ber" - to "dark beer". The context may be not just a neighbour word in your query, but your location, date of time, current sentence's grammar (to let's say change "there" to "their" or not), your search history and virtually any other things that can affect your intent.
- Another classical approach is to use previous search queries as the data set for spell correction. It's even more utilized in autocomplete functionality, but makes sense for autocorrect too. The idea here is that mostly users are right with spelling, therefore we can use words from their search history as a source of truth even if we don't have the words in our documents nor we use an external dictionary. Context-awareness is also possible here.
Manticore provides commands CALL QSUGGEST
and CALL SUGGEST
that can be used for the purpose of automatic spell correction.
They are both available via SQL only and the general syntax is:
CALL QSUGGEST(word, table [,options])
CALL SUGGEST(word, table [,options])
options: N as option_name[, M as another_option, ...]
These commands provide for a given word all suggestions from the dictionary. They work only on tables with infixing enabled and dict=keywords. They return the suggested keywords, Levenshtein distance between the suggested and original keywords and the docs statistics of the suggested keyword.
If the first parameter is not a single word, but multiple, then:
CALL QSUGGEST
will return suggestions only for the last word, ignoring the restCALL SUGGEST
will return suggestions only for the first word
That's the only difference between them. Several options are supported for customization:
Option | Description | Default |
---|---|---|
limit | Returns N top matches | 5 |
max_edits | Keeps only dictionary words which Levenshtein distance is less than or equal to N | 4 |
result_stats | Provides Levenshtein distance and document count of the found words | 1 (enabled) |
delta_len | Keeps only dictionary words whose length difference is less than N | 3 |
max_matches | Number of matches to keep | 25 |
reject | Rejected words are matches that are not better than those already in the match queue. They are put in a rejected queue that gets reset in case one actually can go in the match queue. This parameter defines the size of the rejected queue (as reject*max(max_matched,limit)). If the rejected queue is filled, the engine stops looking for potential matches | 4 |
result_line | alternate mode to display the data by returning all suggests, distances and docs each per one row | 0 |
non_char | do not skip dictionary words with non alphabet symbols | 0 (skip such words) |
To show how it works let's create a table and add few documents into it.
create table products(title text) min_infix_len='2';
insert into products values (0,'Crossbody Bag with Tassel'), (0,'microfiber sheet set'), (0,'Pet Hair Remover Glove');
As you can see we have a mistype in "crossbUdy" which gets corrected to the "crossbody". In addition to that by default CALL SUGGEST/QSUGGEST
return:
distance
- the Levenshtein distance which means how many edits they had to make to convert the given word to the suggestiondocs
- and the number of docs that have this word
To disable these stats display you can use option 0 as result_stats
.
- Example
call suggest('crossbudy', 'products');
+-----------+----------+------+
| suggest | distance | docs |
+-----------+----------+------+
| crossbody | 1 | 1 |
+-----------+----------+------+
If the first parameter is not a single word, but multiple, then CALL SUGGEST
will return suggestions only for the first word.
- Example
call suggest('bagg with tasel', 'products');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| bag | 1 | 1 |
+---------+----------+------+
If the first parameter is not a single word, but multiple, then CALL SUGGEST
will return suggestions only for the last word.
- Example
CALL QSUGGEST('bagg with tasel', 'products');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| tassel | 1 | 1 |
+---------+----------+------+
Using 1 as result_line
in the options turns on alternate mode to display the data by returning all suggests, distances and docs each per one row.
- Example
CALL QSUGGEST('bagg with tasel', 'products', 1 as result_line);
+----------+--------+
| name | value |
+----------+--------+
| suggests | tassel |
| distance | 1 |
| docs | 1 |
+----------+--------+
This interactive course demonstrates online how it works on a web page and provides different examples.
Query cache stores a compressed result set in memory, and then reuses it for subsequent queries where possible. You can configure it using the following directives:
- qcache_max_bytes, a limit on the RAM use for cached queries storage. Defaults to 16 MB. Setting
qcache_max_bytes
to 0 completely disables the query cache. - qcache_thresh_msec, the minimum wall query time to cache. Queries that completed faster than this will not be cached. Defaults to 3000 msec, or 3 seconds.
- qcache_ttl_sec, cached entry TTL, or time to live. Queries will stay cached for this much. Defaults to 60 seconds, or 1 minute.
These settings can be changed on the fly using the SET GLOBAL
statement:
mysql> SET GLOBAL qcache_max_bytes=128000000;
These changes are applied immediately, and the cached result sets that no longer satisfy the constraints are immediately discarded. When reducing the cache size on the fly, MRU (most recently used) result sets win.
Query cache works as follows. When it's enabled, every full-text search result gets completely stored in memory. That happens after full-text matching, filtering, and ranking, so basically we store total_found
{docid,weight} pairs. Compressed matches can consume anywhere from 2 bytes to 12 bytes per match on average, mostly depending on the deltas between the subsequent docids. Once the query completes, we check the wall time and size thresholds, and either save that compressed result set for reuse, or discard it.
Note how the query cache impact on RAM is thus not limited by qcache_max_bytes
! If you run, say, 10 concurrent queries, each of them matching upto 1M matches (after filters), then the peak temporary RAM use will be in the 40 MB to 240 MB range, even if in the end the queries are quick enough and do not get cached.
Queries can then use cache when the table, the full-text query (ie.MATCH()
contents), and the ranker are all a match, and filters are compatible. Meaning:
- The full-text part within
MATCH()
must be a bytewise match. Add a single extra space, and that is now a different query where the query cache is concerned. - The ranker (and its parameters if any, for user-defined rankers) must be a bytewise match.
- The filters must be a superset of the original filters. That is, you can add extra filters and still hit the cache. (In this case, the extra filters will be applied to the cached result.) But if you remove one, that will be a new query again.
Cache entries expire with TTL, and also get invalidated on table rotation, or on TRUNCATE
, or on ATTACH
. Note that at the moment entries are not invalidated on arbitrary RT table writes! So a cached query might be returning older results for the duration of its TTL.
Current cache status can be inspected with in SHOW STATUS through the qcache_XXX
variables:
mysql> SHOW STATUS LIKE 'qcache%';
+-----------------------+----------+
| Counter | Value |
+-----------------------+----------+
| qcache_max_bytes | 16777216 |
| qcache_thresh_msec | 3000 |
| qcache_ttl_sec | 60 |
| qcache_cached_queries | 0 |
| qcache_used_bytes | 0 |
| qcache_hits | 0 |
+-----------------------+----------+
6 rows in set (0.00 sec)