Manticore Search supports the ability to add embeddings generated by your Machine Learning models to each document, and then doing a nearest-neighbor search on them. This lets you build features like similarity search, recommendations, semantic search, and relevance ranking based on NLP algorithms, among others, including image, video, and sound searches.
An embedding is a method of representing data—such as text, images, or sound—as vectors in a high-dimensional space. These vectors are crafted to ensure that the distance between them reflects the similarity of the data they represent. This process typically employs algorithms like word embeddings (e.g., Word2Vec, BERT) for text or neural networks for images. The high-dimensional nature of the vector space, with many components per vector, allows for the representation of complex and nuanced relationships between items. Their similarity is gauged by the distance between these vectors, often measured using methods like Euclidean distance or cosine similarity.
Manticore Search enables k-nearest neighbor (KNN) vector searches using the HNSW library. This functionality is part of the Manticore Columnar Library.
To run KNN searches, you must first configure your table. It needs to have at least one float_vector attribute, which serves as a data vector. You need to specify the following properties:
knn_type
: A mandatory setting; currently, onlyhnsw
is supported.knn_dims
: A mandatory setting that specifies the dimensions of the vectors being indexed.hnsw_similarity
: A mandatory setting that specifies the distance function used by the HNSW index. Acceptable values are:L2
- Squared L2IP
- Inner productCOSINE
- Cosine similarity
hnsw_m
: An optional setting that defines the maximum number of outgoing connections in the graph. The default is 16.hnsw_ef_construction
: An optional setting that defines a construction time/accuracy trade-off.
- SQL
create table test ( title text, image_vector float_vector knn_type='hnsw' knn_dims='4' hnsw_similarity='l2' );
Query OK, 0 rows affected (0.01 sec)
After creating the table, you need to insert your vector data, ensuring it matches the dimensions you specified when creating the table.
- SQL
- JSON
insert into test values ( 1, 'yellow bag', (0.653448,0.192478,0.017971,0.339821) ), ( 2, 'white bag', (-0.148894,0.748278,0.091892,-0.095406) );
Query OK, 2 rows affected (0.00 sec)
Now, you can perform a KNN search using the knn
clause in either SQL or JSON format. Both interfaces support the same essential parameters, ensuring a consistent experience regardless of the format you choose:
- SQL:
select ... from <table name> where knn ( <field>, <k>, <query vector> [,<ef>] )
- JSON:
POST /search { "table": "<table name>", "knn": { "field": "<field>", "query_vector": [<query vector>], "k": <k>, "ef": <ef> } }
The parameters are:
field
: This is the name of the float vector attribute containing vector data.k
: This represents the number of documents to return and is a key parameter for Hierarchical Navigable Small World (HNSW) indexes. It specifies the quantity of documents that a single HNSW index should return. However, the actual number of documents included in the final results may vary. For instance, if the system is dealing with real-time tables divided into disk chunks, each chunk could returnk
documents, leading to a total that exceeds the specifiedk
(as the cumulative count would benum_chunks * k
). On the other hand, the final document count might be less thank
if, after requestingk
documents, some are filtered out based on specific attributes. It's important to note that the parameterk
does not apply to ramchunks. In the context of ramchunks, the retrieval process operates differently, and thus, thek
parameter's effect on the number of documents returned is not applicable.query_vector
: This is the search vector.ef
: optional size of the dynamic list used during the search. A higheref
leads to more accurate but slower search.
Documents are always sorted by their distance to the search vector. Any additional sorting criteria you specify will be applied after this primary sort condition. For retrieving the distance, there is a built-in function called knn_dist().
- SQL
- JSON
select id, knn_dist() from test where knn ( image_vector, 5, (0.286569,-0.031816,0.066684,0.032926), 2000 );
+------+------------+
| id | knn_dist() |
+------+------------+
| 1 | 0.28146550 |
| 2 | 0.81527930 |
+------+------------+
2 rows in set (0.00 sec)
NOTE: Finding similar documents by id requires Manticore Buddy. If it doesn't work, make sure Buddy is installed.
Finding documents similar to a specific one based on its unique ID is a common task. For instance, when a user views a particular item, Manticore Search can efficiently identify and display a list of items that are most similar to it in the vector space. Here's how you can do it:
- SQL:
select ... from <table name> where knn ( <field>, <k>, <document id> )
- JSON:
POST /search { "table": "<table name>", "knn": { "field": "<field>", "doc_id": <document id>, "k": <k> } }
The parameters are:
field
: This is the name of the float vector attribute containing vector data.k
: This represents the number of documents to return and is a key parameter for Hierarchical Navigable Small World (HNSW) indexes. It specifies the quantity of documents that a single HNSW index should return. However, the actual number of documents included in the final results may vary. For instance, if the system is dealing with real-time tables divided into disk chunks, each chunk could returnk
documents, leading to a total that exceeds the specifiedk
(as the cumulative count would benum_chunks * k
). On the other hand, the final document count might be less thank
if, after requestingk
documents, some are filtered out based on specific attributes. It's important to note that the parameterk
does not apply to ramchunks. In the context of ramchunks, the retrieval process operates differently, and thus, thek
parameter's effect on the number of documents returned is not applicable.document id
: Document ID for KNN similarity search.
- SQL
- JSON
select id, knn_dist() from test where knn ( image_vector, 5, 1 );
+------+------------+
| id | knn_dist() |
+------+------------+
| 2 | 0.81527930 |
+------+------------+
1 row in set (0.00 sec)
Manticore also supports additional filtering of documents returned by the KNN search, either by full-text matching, attribute filters, or both.
- SQL
- JSON
select id, knn_dist() from test where knn ( image_vector, 5, (0.286569,-0.031816,0.066684,0.032926) ) and match('white') and id < 10;
+------+------------+
| id | knn_dist() |
+------+------------+
| 2 | 0.81527930 |
+------+------------+
1 row in set (0.00 sec)