Percolate queries are also known as Persistent queries, Prospective search, document routing, search in reverse and inverse search.
The normal way of doing searches is to store documents and perform search queries against them. However there are cases when we want to apply a query to an incoming new document to signal the matching. There are some scenarios where this is wanted. For example a monitoring system doesn't just collect data, but it's also desired to notify user on different events. That can be reaching some threshold for a metric or a certain value that appears in the monitored data. Another similar case is news aggregation. You can notify the user about any fresh news, but the user might want to be notified only about certain categories or topics. Going further, they might be only interested about certain "keywords".
This is where a traditional search is not a good fit, since would assume performed the desired search over the entire collection, which gets multiplied by the number of users and we end up with lots of queries running over the entire collection, which can put a lot of extra load. The idea explained in this section is to store instead the queries and test them against an incoming new document or a batch of documents.
Google Alerts, AlertHN, Bloomberg Terminal and other systems that let their users to subscribe to something use a similar technology.
- See percolate about how to create a PQ index.
- See Adding rules to a percolate index to learn how to add percolate rules (as known as PQ rules). Here let's just give a quick example.
The key thing you need to remember about percolate queries is that you already have your search queries in the index. What you need to provide is documents to check if any of them match any of the stored rules.
You can perform a percolate query via SQL or JSON interfaces as well as using programming language clients. The SQL way gives more flexibility while via the HTTP it's simpler and gives all you mostly need. The below table can help you understand the differences.
Desired behaviour | SQL | HTTP | PHP |
---|---|---|---|
Provide a single document | CALL PQ('idx', '{doc1}') | query.percolate.document{doc1} | $client->pq()->search([$percolate]) |
Provide a single document (alternative) | CALL PQ('idx', 'doc1', 0 as docs_json) | - | |
Provide multiple documents | CALL PQ('idx', ('doc1', 'doc2'), 0 as docs_json) | query.percolate.documents[{doc1}, {doc2}] | $client->pq()->search([$percolate]) |
Provide multiple documents (alternative) | CALL PQ('idx', ('{doc1}', '{doc2}')) | - | - |
Provide multiple documents (alternative) | CALL PQ('idx', '[{doc1}, {doc2}]') | - | - |
Return matching document ids | 0/1 as docs (disabled by default) | Enabled by default | Enabled by default |
Use document's own id to show in the result | 'id field' as docs_id (disabled by default) | Not available | Not available |
Consider input documents are JSON | 1 as docs_json (1 by default) | Enabled by default | Enabled by default |
Consider input documents are plain text | 0 as docs_json (1 by default) | Not available | Not available |
Sparsed distribution mode | default | default | default |
Sharded distribution mode | sharded as mode | Not available | Not available |
Return all info about matching query | 1 as query (0 by default) | Enabled by default | Enabled by default |
Skip invalid JSON | 1 as skip_bad_json (0 by default) | Not available | Not available |
Extended info in SHOW META | 1 as verbose (0 by default) | Not available | Not available |
Define the number which will be added to document ids if no docs_id fields provided (makes sense mostly in distributed PQ modes) | 1 as shift (0 by default) | Not available | Not available |
To demonstrate how it works here are few examples. Let's create a PQ index with 2 fields:
- title (text)
- color (string)
and 3 rules in it:
- Just full-text. Query:
@title bag
- Full-text and filtering. Query:
@title shoes
. Filters:color='red'
- Full-text and more complex filtering. Query:
@title shoes
. Filters:color IN('blue', 'green')
- SQL
- HTTP
- PHP
- Python
- javascript
- Java
CREATE TABLE products(title text, color string) type='pq';
INSERT INTO products(query) values('@title bag');
INSERT INTO products(query,filters) values('@title shoes', 'color=\'red\'');
INSERT INTO products(query,filters) values('@title shoes', 'color in (\'blue\', \'green\')');
select * from products;
+---------------------+--------------+------+---------------------------+
| id | query | tags | filters |
+---------------------+--------------+------+---------------------------+
| 1657852401006149635 | @title shoes | | color IN ('blue, 'green') |
| 1657852401006149636 | @title shoes | | color='red' |
| 1657852401006149637 | @title bag | | |
+---------------------+--------------+------+---------------------------+
The first document doesn't match any rules. It could match the first 2, but they require additional filters.
The second document matches one rule. Note that CALL PQ by default expects a document to be a JSON, but if you do 0 as docs_json
you can pass a plain string instead.
- SQL
- HTTP
- PHP
- Python
- javascript
- Java
CALL PQ('products', 'Beautiful shoes', 0 as docs_json);
CALL PQ('products', 'What a nice bag', 0 as docs_json);
CALL PQ('products', '{"title": "What a nice bag"}');
+---------------------+
| id |
+---------------------+
| 1657852401006149637 |
+---------------------+
+---------------------+
| id |
+---------------------+
| 1657852401006149637 |
+---------------------+
- SQL
- HTTP
- PHP
- Python
- javascript
- Java
CALL PQ('products', '{"title": "What a nice bag"}', 1 as query);
+---------------------+------------+------+---------+
| id | query | tags | filters |
+---------------------+------------+------+---------+
| 1657852401006149637 | @title bag | | |
+---------------------+------------+------+---------+
Note that via CALL PQ
you can provide multiple documents different ways:
- as an array of plain document in round brackets
('doc1', 'doc2')
. This requires0 as docs_json
- as a array of JSONs in round brackets
('{doc1}' '{doc2}')
- or as a standard JSON array
'[{doc1}, {doc2}]'
- SQL
- HTTP
- PHP
- Python
- javascript
- Java
CALL PQ('products', ('nice pair of shoes', 'beautiful bag'), 1 as query, 0 as docs_json);
CALL PQ('products', ('{"title": "nice pair of shoes", "color": "red"}', '{"title": "beautiful bag"}'), 1 as query);
CALL PQ('products', '[{"title": "nice pair of shoes", "color": "blue"}, {"title": "beautiful bag"}]', 1 as query);
+---------------------+------------+------+---------+
| id | query | tags | filters |
+---------------------+------------+------+---------+
| 1657852401006149637 | @title bag | | |
+---------------------+------------+------+---------+
+---------------------+--------------+------+-------------+
| id | query | tags | filters |
+---------------------+--------------+------+-------------+
| 1657852401006149636 | @title shoes | | color='red' |
| 1657852401006149637 | @title bag | | |
+---------------------+--------------+------+-------------+
+---------------------+--------------+------+---------------------------+
| id | query | tags | filters |
+---------------------+--------------+------+---------------------------+
| 1657852401006149635 | @title shoes | | color IN ('blue, 'green') |
| 1657852401006149637 | @title bag | | |
+---------------------+--------------+------+---------------------------+
- SQL
- HTTP
- PHP
- Python
- javascript
- Java
CALL PQ('products', '[{"title": "nice pair of shoes", "color": "blue"}, {"title": "beautiful bag"}]', 1 as query, 1 as docs);
+---------------------+-----------+--------------+------+---------------------------+
| id | documents | query | tags | filters |
+---------------------+-----------+--------------+------+---------------------------+
| 1657852401006149635 | 1 | @title shoes | | color IN ('blue, 'green') |
| 1657852401006149637 | 2 | @title bag | | |
+---------------------+-----------+--------------+------+---------------------------+
By default matching document ids correspond to their relative numbers in the list you provide. But in some cases each document already has its own id. For this case there's an option 'id field name' as docs_id
for CALL PQ
.
Note that if the id cannot be found by the provided field name the PQ rule will not be shown in the results.
This option is only available for CALL PQ
via SQL.
- SQL
CALL PQ('products', '[{"id": 123, "title": "nice pair of shoes", "color": "blue"}, {"id": 456, "title": "beautiful bag"}]', 1 as query, 'id' as docs_id, 1 as docs);
+---------------------+-----------+--------------+------+---------------------------+
| id | documents | query | tags | filters |
+---------------------+-----------+--------------+------+---------------------------+
| 1657852401006149664 | 456 | @title bag | | |
| 1657852401006149666 | 123 | @title shoes | | color IN ('blue, 'green') |
+---------------------+-----------+--------------+------+---------------------------+
If you provide documents as separate JSONs there is an option for CALL PQ
to skip invalid JSONs. In the example note that in the 2nd and 3rd queries the 2nd JSON is invalid. Without 1 as skip_bad_json
the 2nd query fails, adding it in the 3rd query allows to avoid that. This option is not available via JSON over HTTP as the whole JSON query should be always valid when sent via the HTTP protocol.
- SQL
CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag"}'));
CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag}'));
CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag}'), 1 as skip_bad_json);
+---------------------+
| id |
+---------------------+
| 1657852401006149635 |
| 1657852401006149637 |
+---------------------+
ERROR 1064 (42000): Bad JSON objects in strings: 2
+---------------------+
| id |
+---------------------+
| 1657852401006149635 |
+---------------------+
Percolate queries were made with high throughput and big data volume in mind, so there are few things how you can optimize your performance in case you are looking for lower latency and higher throughput.
There are two modes of distribution of a percolate index and how a percolate query can work against it:
- Sparsed. Default. When it is good: too many documents, PQ indexes are mirrored. The batch of documents you pass will be split into parts according to the number of agents, so each of the nodes will receive and process only a part of the documents from your request. It will be beneficial when your set of documents is quite big, but the set of queries stored in the pq index is quite small. Assuming that all the hosts are mirrors Manticore will split your set of documents and distribute the chunks among the mirrors. Once the agents are done with the queries it will collect and merge all the results and return final query set as if it comes from one solid index. You can use replication to help the process.
- Sharded. When it is good: too many PQ rules, the rules are split among PQ indexes. The whole documents set will be broad-casted to all indexes of the distributed PQ index without any initial documents split. It is beneficial when you push relatively small set of documents, but the number of stored queries is huge. So in this case it is more appropriate to store just part of PQ rules on each node and then merge the results returned from the nodes that process one and the same set of documents against different sets of PQ rules. This mode has to be explicitly set since first of all it implies multiplication of network payload and secondly it expects indexes with different PQ which replication cannot do out of the box.
Let's assume you have index pq_d2
which is defined as:
index pq_d2
{
type = distributed
agent = 127.0.0.1:6712:pq
agent = 127.0.0.1:6712:ptitle
}
Each of 'pq' and 'ptitle' contains:
- SQL
- HTTP
- PHP
- Python
- javascript
- Java
SELECT * FROM pq;
+------+-------------+------+-------------------+
| id | query | tags | filters |
+------+-------------+------+-------------------+
| 1 | filter test | | gid>=10 |
| 2 | angry | | gid>=10 OR gid<=3 |
+------+-------------+------+-------------------+
2 rows in set (0.01 sec)
And you fire CALL PQ
to the distributed index with a couple of docs.
- SQL
- HTTP
- PHP
- Python
- javascript
- Java
CALL PQ ('pq_d2', ('{"title":"angry test", "gid":3 }', '{"title":"filter test doc2", "gid":13}'), 1 AS docs);
+------+-----------+
| id | documents |
+------+-----------+
| 1 | 2 |
| 2 | 1 |
+------+-----------+
That was an example of the default sparsed mode. To demonstrate the sharded mode let's create a distributed PQ index consisting of 2 local PQ indexes and add 2 documents to "products1" and 1 document to "products2":
create table products1(title text, color string) type='pq';
create table products2(title text, color string) type='pq';
create table products_distributed type='distributed' local='products1' local='products2';
INSERT INTO products1(query) values('@title bag');
INSERT INTO products1(query,filters) values('@title shoes', 'color=\'red\'');
INSERT INTO products2(query,filters) values('@title shoes', 'color in (\'blue\', \'green\')');
Now if you add 'sharded' as mode
to CALL PQ
it will send the documents to all the agents indexes (in this case just local indexes, but they can be remote to utilize external hardware). This mode is not available via the JSON interface.
- SQL
CALL PQ('products_distributed', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag"}'), 'sharded' as mode, 1 as query);
+---------------------+--------------+------+---------------------------+
| id | query | tags | filters |
+---------------------+--------------+------+---------------------------+
| 1657852401006149639 | @title bag | | |
| 1657852401006149643 | @title shoes | | color IN ('blue, 'green') |
+---------------------+--------------+------+---------------------------+
Note that the syntax of agent mirrors in the configuration (when several hosts are assigned to one agent
line, separated with |
) has nothing to do with the CALL PQ
query mode, so each agent
always represents one node despite of the number of HA mirrors specified for this agent.
In some case you might want to get more details about performance a percolate query. For that purposes there is option 1 as verbose
which is available only via SQL and allows to save more performance metrics. You can see them via SHOW META
query which you can run after CALL PQ
. See SHOW META for more info.
- 1 as verbose
- 0 as verbose
CALL PQ('products', ('{"title": "nice pair of shoes", "color": "blue"}', '{"title": "beautiful bag"}'), 1 as verbose); show meta;
+---------------------+
| id |
+---------------------+
| 1657852401006149644 |
| 1657852401006149646 |
+---------------------+
+-------------------------+-----------+
| Name | Value |
+-------------------------+-----------+
| Total | 0.000 sec |
| Setup | 0.000 sec |
| Queries matched | 2 |
| Queries failed | 0 |
| Document matched | 2 |
| Total queries stored | 3 |
| Term only queries | 3 |
| Fast rejected queries | 0 |
| Time per query | 27, 10 |
| Time of matched queries | 37 |
+-------------------------+-----------+