When Manticore executes a fullscan query, it can either use a plain scan to check every document against the filters or employ additional data and/or algorithms to speed up query execution. Manticore uses a cost-based optimizer (CBO), also known as a "query optimizer" to determine which approach to take.
The CBO can also enhance the performance of full-text queries. See below for more details.
The CBO may decide to replace one or more query filters with one of the following entities if it determines that doing so will improve performance:
- A docid index utilizes a special docid-only secondary index stored in files with the
.spt
extension. Besides improving filters on document IDs, the docid index is also used to accelerate document ID to row ID lookups and to speed up the application of large killlists during daemon startup. - A columnar scan relies on columnar storage and can only be used on a columnar attribute. It scans every value and tests it against the filter, but it is heavily optimized and is typically faster than the default approach.
- Secondary indexes are generated for all attributes by default. They use the PGM index along with Manticore's built-in inverted index to retrieve the list of row IDs corresponding to a value or range of values. Secondary indexes are stored in files with the
.spidx
extension.
The optimizer estimates the cost of each execution path using various attribute statistics, including:
- Information on the data distribution within an attribute (histograms, stored in
.sphi
files). Histograms are generated automatically when data is indexed and serve as the primary source of information for the CBO. - Information from PGM (secondary indexes), which helps estimate the number of document lists to read. This assists in gauging doclist merge performance and in selecting the appropriate merge algorithm (priority queue merge or bitmap merge).
- Columnar encoding statistics, employed to estimate columnar data decompression performance.
- A columnar min-max tree. While the CBO uses histograms to estimate the number of documents left after applying the filter, it also needs to determine how many documents the filter had to process. For columnar attributes, partial evaluation of the min-max tree serves this purpose.
- Full-text dictionary. The CBO utilizes term stats to estimate the cost of evaluating the full-text tree.
The optimizer computes the execution cost for every filter used in a query. Since certain filters can be replaced with several different entities (e.g., for a document id, Manticore can use a plain scan, a docid index lookup, a columnar scan (if the document id is columnar), and a secondary index), the optimizer evaluates all available combinations. However, there is a maximum limit of 1024 combinations.
To estimate query execution costs, the optimizer calculates the estimated costs of the most significant operations performed when executing the query. It uses preset constants to represent the cost of each operation.
The optimizer compares the costs of each execution path and chooses the path with the lowest cost to execute the query.
When working with full-text queries that have filters by attributes, the query optimizer decides between two possible execution paths. One is to execute the full-text query, retrieve the matches, and use filters. The other is to replace filters with one or more entities described above, fetch rowids from them, and inject them into the full-text matching tree. This way, full-text search results will intersect with full-scan results. The query optimizer estimates the cost of full-text tree evaluation and the best possible path for computing filter results. Using this information, the optimizer chooses the execution path.
Another factor to consider is multithreaded query execution (when pseudo_sharding
is enabled). The CBO is aware that some queries can be executed in multiple threads and takes this into account. The CBO prioritizes shorter query execution times (i.e., latency) over throughput. For instance, if a query using a columnar scan can be executed in multiple threads (and occupy multiple CPU cores) and is faster than a query executed in a single thread using secondary indexes, multithreaded execution will be preferred.
Queries using secondary indexes and docid indexes always run in a single thread, as benchmarks indicate that there is little to no benefit in making them multithreaded.
At present, the optimizer only uses CPU costs and does not take memory or disk usage into account.
Manticore Search supports the ability to add embeddings generated by your Machine Learning models to each document, and then doing a nearest-neighbor search on them. This lets you build features like similarity search, recommendations, semantic search, and relevance ranking based on NLP algorithms, among others, including image, video, and sound searches.
An embedding is a method of representing data—such as text, images, or sound—as vectors in a high-dimensional space. These vectors are crafted to ensure that the distance between them reflects the similarity of the data they represent. This process typically employs algorithms like word embeddings (e.g., Word2Vec, BERT) for text or neural networks for images. The high-dimensional nature of the vector space, with many components per vector, allows for the representation of complex and nuanced relationships between items. Their similarity is gauged by the distance between these vectors, often measured using methods like Euclidean distance or cosine similarity.
Manticore Search enables k-nearest neighbor (KNN) vector searches using the HNSW library. This functionality is part of the Manticore Columnar Library.
To run KNN searches, you must first configure your table. It needs to have at least one float_vector attribute, which serves as a data vector. You need to specify the following properties:
knn_type
: A mandatory setting; currently, onlyhnsw
is supported.knn_dims
: A mandatory setting that specifies the dimensions of the vectors being indexed.hnsw_similarity
: A mandatory setting that specifies the distance function used by the HNSW index. Acceptable values are:L2
- Squared L2IP
- Inner productCOSINE
- Cosine similarity
hnsw_m
: An optional setting that defines the maximum number of outgoing connections in the graph. The default is 16.hnsw_ef_construction
: An optional setting that defines a construction time/accuracy trade-off.
- SQL
create table test ( title text, image_vector float_vector knn_type='hnsw' knn_dims='4' hnsw_similarity='l2' );
Query OK, 0 rows affected (0.01 sec)
After creating the table, you need to insert your vector data, ensuring it matches the dimensions you specified when creating the table.
- SQL
- JSON
insert into test values ( 1, 'yellow bag', (0.653448,0.192478,0.017971,0.339821) ), ( 2, 'white bag', (-0.148894,0.748278,0.091892,-0.095406) );
Query OK, 2 rows affected (0.00 sec)
Now, you can perform a KNN search using the knn
clause in either SQL or JSON format. Both interfaces support the same essential parameters, ensuring a consistent experience regardless of the format you choose:
- SQL:
select ... from <table name> where knn ( <field>, <k>, <query vector> [,<ef>] )
- JSON:
POST /search { "index": "<table name>", "knn": { "field": "<field>", "query_vector": [<query vector>], "k": <k>, "ef": <ef> } }
The parameters are:
field
: This is the name of the float vector attribute containing vector data.k
: This represents the number of documents to return and is a key parameter for Hierarchical Navigable Small World (HNSW) indexes. It specifies the quantity of documents that a single HNSW index should return. However, the actual number of documents included in the final results may vary. For instance, if the system is dealing with real-time tables divided into disk chunks, each chunk could returnk
documents, leading to a total that exceeds the specifiedk
(as the cumulative count would benum_chunks * k
). On the other hand, the final document count might be less thank
if, after requestingk
documents, some are filtered out based on specific attributes. It's important to note that the parameterk
does not apply to ramchunks. In the context of ramchunks, the retrieval process operates differently, and thus, thek
parameter's effect on the number of documents returned is not applicable.query_vector
: This is the search vector.ef
: optional size of the dynamic list used during the search. A higheref
leads to more accurate but slower search.
Documents are always sorted by their distance to the search vector. Any additional sorting criteria you specify will be applied after this primary sort condition. For retrieving the distance, there is a built-in function called knn_dist().
- SQL
- JSON
select id, knn_dist() from test where knn ( image_vector, 5, (0.286569,-0.031816,0.066684,0.032926), 2000 );
+------+------------+
| id | knn_dist() |
+------+------------+
| 1 | 0.28146550 |
| 2 | 0.81527930 |
+------+------------+
2 rows in set (0.00 sec)
NOTE: Finding similar documents by id requires Manticore Buddy. If it doesn't work, make sure Buddy is installed.
Finding documents similar to a specific one based on its unique ID is a common task. For instance, when a user views a particular item, Manticore Search can efficiently identify and display a list of items that are most similar to it in the vector space. Here's how you can do it:
- SQL:
select ... from <table name> where knn ( <field>, <k>, <document id> )
- JSON:
POST /search { "index": "<table name>", "knn": { "field": "<field>", "doc_id": <document id>, "k": <k> } }
The parameters are:
field
: This is the name of the float vector attribute containing vector data.k
: This represents the number of documents to return and is a key parameter for Hierarchical Navigable Small World (HNSW) indexes. It specifies the quantity of documents that a single HNSW index should return. However, the actual number of documents included in the final results may vary. For instance, if the system is dealing with real-time tables divided into disk chunks, each chunk could returnk
documents, leading to a total that exceeds the specifiedk
(as the cumulative count would benum_chunks * k
). On the other hand, the final document count might be less thank
if, after requestingk
documents, some are filtered out based on specific attributes. It's important to note that the parameterk
does not apply to ramchunks. In the context of ramchunks, the retrieval process operates differently, and thus, thek
parameter's effect on the number of documents returned is not applicable.document id
: Document ID for KNN similarity search.
- SQL
- JSON
select id, knn_dist() from test where knn ( image_vector, 5, 1 );
+------+------------+
| id | knn_dist() |
+------+------------+
| 2 | 0.81527930 |
+------+------------+
1 row in set (0.00 sec)
Manticore also supports additional filtering of documents returned by the KNN search, either by full-text matching, attribute filters, or both.
- SQL
- JSON
select id, knn_dist() from test where knn ( image_vector, 5, (0.286569,-0.031816,0.066684,0.032926) ) and match('white') and id < 10;
+------+------------+
| id | knn_dist() |
+------+------------+
| 2 | 0.81527930 |
+------+------------+
1 row in set (0.00 sec)
ALTER TABLE table ADD COLUMN column_name [{INTEGER|INT|BIGINT|FLOAT|BOOL|MULTI|MULTI64|JSON|STRING|TIMESTAMP|TEXT [INDEXED [ATTRIBUTE]]}] [engine='columnar']
ALTER TABLE table DROP COLUMN column_name
ALTER TABLE table MODIFY COLUMN column_name bigint
This feature only supports adding one field at a time for RT tables or the expansion of an int
column to bigint
. The supported data types are:
int
- integer attributetimestamp
- timestamp attributebigint
- big integer attributefloat
- float attributebool
- boolean attributemulti
- multi-valued integer attributemulti64
- multi-valued bigint attributejson
- json attributestring
/text attribute
/string attribute
- string attributetext
/text indexed stored
/string indexed stored
- full-text indexed field with original value stored in docstoretext indexed
/string indexed
- full-text indexed field, indexed only (the original value is not stored in docstore)text indexed attribute
/string indexed attribute
- full text indexed field + string attribute (not storing the original value in docstore)text stored
/string stored
- the value will be only stored in docstore, not full-text indexed, not a string attribute- adding
engine='columnar'
to any attribute (except for json) will make it stored in the columnar storage
- ❗It's recommended to backup table files before
ALTER
ing it to avoid data corruption in case of a sudden power interruption or other similar issues. - Querying a table is impossible while a column is being added.
- Newly created attribute's values are set to 0.
ALTER
will not work for distributed tables and tables without any attributes.- You can't delete the
id
column. - When dropping a field which is both a full-text field and a string attribute the first
ALTER DROP
drops the attribute, the second one drops the full-text field. - Adding/dropping full-text field is only supported in the RT mode.
- Example
mysql> desc rt;
+------------+-----------+
| Field | Type |
+------------+-----------+
| id | bigint |
| text | field |
| group_id | uint |
| date_added | timestamp |
+------------+-----------+
mysql> alter table rt add column test integer;
mysql> desc rt;
+------------+-----------+
| Field | Type |
+------------+-----------+
| id | bigint |
| text | field |
| group_id | uint |
| date_added | timestamp |
| test | uint |
+------------+-----------+
mysql> alter table rt drop column group_id;
mysql> desc rt;
+------------+-----------+
| Field | Type |
+------------+-----------+
| id | bigint |
| text | field |
| date_added | timestamp |
| test | uint |
+------------+-----------+
mysql> alter table rt add column title text indexed;
mysql> desc rt;
+------------+-----------+------------+
| Field | Type | Properties |
+------------+-----------+------------+
| id | bigint | |
| text | text | indexed |
| title | text | indexed |
| date_added | timestamp | |
| test | uint | |
+------------+-----------+------------+
mysql> alter table rt add column title text attribute;
mysql> desc rt;
+------------+-----------+------------+
| Field | Type | Properties |
+------------+-----------+------------+
| id | bigint | |
| text | text | indexed |
| title | text | indexed |
| date_added | timestamp | |
| test | uint | |
| title | string | |
+------------+-----------+------------+
mysql> alter table rt drop column title;
mysql> desc rt;
+------------+-----------+------------+
| Field | Type | Properties |
+------------+-----------+------------+
| id | bigint | |
| text | text | indexed |
| title | text | indexed |
| date_added | timestamp | |
| test | uint | |
+------------+-----------+------------+
mysql> alter table rt drop column title;
mysql> desc rt;
+------------+-----------+------------+
| Field | Type | Properties |
+------------+-----------+------------+
| id | bigint | |
| text | text | indexed |
| date_added | timestamp | |
| test | uint | |
+------------+-----------+------------+
ALTER TABLE table ft_setting='value'[, ft_setting2='value']
You can use ALTER
to modify the full-text settings of your table in RT mode. However, it only affects new documents and not existing ones.
Example:
- create a table with a full-text field and
charset_table
that allows only 3 searchable characters:a
,b
andc
. - then we insert document 'abcd' and find it by query
abcd
, thed
just gets ignored since it's not in thecharset_table
array - then we understand, that we want
d
to be searchable too, so we add it with help ofALTER
- but the same query
where match('abcd')
still says it searched byabc
, because the existing document remembers previous contents ofcharset_table
- then we add another document
abcd
and search byabcd
again - now it finds the both documents and
show meta
says it used two keywords:abc
(to find the old document) andabcd
(for the new one).
- Example
mysql> create table rt(title text) charset_table='a,b,c';
mysql> insert into rt(title) values('abcd');
mysql> select * from rt where match('abcd');
+---------------------+-------+
| id | title |
+---------------------+-------+
| 1514630637682688054 | abcd |
+---------------------+-------+
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | abc |
| docs[0] | 1 |
| hits[0] | 1 |
+---------------+-------+
mysql> alter table rt charset_table='a,b,c,d';
mysql> select * from rt where match('abcd');
+---------------------+-------+
| id | title |
+---------------------+-------+
| 1514630637682688054 | abcd |
+---------------------+-------+
mysql> show meta
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | abc |
| docs[0] | 1 |
| hits[0] | 1 |
+---------------+-------+
mysql> insert into rt(title) values('abcd');
mysql> select * from rt where match('abcd');
+---------------------+-------+
| id | title |
+---------------------+-------+
| 1514630637682688055 | abcd |
| 1514630637682688054 | abcd |
+---------------------+-------+
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 2 |
| total_found | 2 |
| time | 0.000 |
| keyword[0] | abc |
| docs[0] | 1 |
| hits[0] | 1 |
| keyword[1] | abcd |
| docs[1] | 1 |
| hits[1] | 1 |
+---------------+-------+
You can change the name of a real-time table in RT mode.
ALTER TABLE table_name RENAME new_table_name;
NOTE: Renaming a real-time table requires Manticore Buddy. If it doesn't work, make sure Buddy is installed.
- Example
ALTER TABLE table_name RENAME new_table_name;
Query OK, 0 rows affected (0.00 sec)
ALTER TABLE table RECONFIGURE
ALTER
can also reconfigure an RT table in the plain mode, so that new tokenization, morphology and other text processing settings from the configuration file take effect for new documents. Note, that the existing document will be left intact. Internally, it forcibly saves the current RAM chunk as a new disk chunk and adjusts the table header, so that new documents are tokenized using the updated full-text settings.
- Example
mysql> show table rt settings;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| settings | |
+---------------+-------+
1 row in set (0.00 sec)
mysql> alter table rt reconfigure;
Query OK, 0 rows affected (0.00 sec)
mysql> show table rt settings;
+---------------+----------------------+
| Variable_name | Value |
+---------------+----------------------+
| settings | morphology = stem_en |
+---------------+----------------------+
1 row in set (0.00 sec)
ALTER TABLE table REBUILD SECONDARY
You can also use ALTER
to rebuild secondary indexes in a given table. Sometimes, a secondary index can be disabled for the entire table or for one or multiple attributes within the table:
- When an attribute is updated, its secondary index gets disabled.
- If Manticore loads a table with an old version of secondary indexes that is no longer supported, the secondary indexes will be disabled for the entire table.
ALTER TABLE table REBUILD SECONDARY
rebuilds secondary indexes from attribute data and enables them again.
Additionally, an old version of secondary indexes may be supported but will lack certain features. REBUILD SECONDARY
can be used to update secondary indexes.
- Example
ALTER TABLE rt REBUILD SECONDARY;
Query OK, 0 rows affected (0.00 sec)
To change the list of local or remote nodes in a distributed table, follow the same syntax you used to create the table. Just replace CREATE
with ALTER
in the command and remove type='distributed'
:
ALTER TABLE `distr_table_name` [[local='local_index_name'], [agent='host:port:remote_index'] ... ]
- Example
ALTER TABLE local_dist local='index1' local='index2' agent='127.0.0.1:9312:remote_index';