Search profiling

How a query is interpreted

Consider this complex query example:

"hello world" @title "example program"~5 @body python -(php|perl) @* code

The full meaning of this search is:

  • Locate the words 'hello' and 'world' adjacently in any field within a document;
  • Additionally, the same document must also contain the words 'example' and 'program' in the title field, with up to, but not including, 5 words between them; (For instance, "example PHP program" would match, but "example script to introduce outside data into the correct context for your program" would not, as there are 5 or more words between the two terms)
  • Furthermore, the same document must have the word 'python' in the body field, while excluding 'php' or 'perl';
  • Finally, the same document must include the word 'code' in any field.

The OR operator takes precedence over AND, so "looking for cat | dog | mouse" means "looking for (cat | dog | mouse)" rather than "(looking for cat) | dog | mouse".

To comprehend how a query will be executed, Manticore Search provides query profiling tools to examine the query tree generated by a query expression.

Profiling the query tree in SQL

To enable full-text query profiling with an SQL statement, you must activate it before executing the desired query:

SET profiling =1;
SELECT * FROM test WHERE MATCH('@title abc* @body hey');

To view the query tree, execute the SHOW PLAN command immediately after running the query:

SHOW PLAN;

This command will return the structure of the executed query. Keep in mind that the 3 statements - SET profiling, the query, and SHOW - must be executed within the same session.

Profiling the query in HTTP JSON

When using the HTTP JSON protocol we can just enable "profile":true to get in response the full-text query tree structure.

{
  "index":"test",
  "profile":true,
  "query":
  {
    "match_phrase": { "_all" : "had grown quite" }
  }
}

The response will include a profile object containing a query member.

The query property holds the transformed full-text query tree. Each node consists of:

  • type: node type, which can be AND, OR, PHRASE, KEYWORD, etc.
  • description: query subtree for this node represented as a string (in SHOW PLAN format)
  • children: any child nodes, if present
  • max_field_pos: maximum position within a field

A keyword node will additionally include:

  • word: the transformed keyword.
  • querypos: position of this keyword in the query.
  • excluded: keyword excluded from the query.
  • expanded: keyword added by prefix expansion.
  • field_start: keyword must appear at the beginning of the field.
  • field_end: keyword must appear at the end of the field.
  • boost: the keyword's IDF will be multiplied by this value.
‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
📋
SET profiling=1;
SELECT * FROM test WHERE MATCH('@title abc* @body hey');
SHOW PLAN \G
‹›
Response
*************************** 1\. row ***************************
Variable: transformed_tree
   Value: AND(
  OR(fields=(title), KEYWORD(abcx, querypos=1, expanded), KEYWORD(abcm, querypos=1, expanded)),
  AND(fields=(body), KEYWORD(hey, querypos=2)))
1 row in set (0.00 sec)

In some instances, the evaluated query tree may significantly differ from the original one due to expansions and other transformations.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
📋
SET profiling=1;

SELECT id FROM forum WHERE MATCH('@title way* @content hey') LIMIT 1;

SHOW PLAN;
‹›
Response
Query OK, 0 rows affected (0.00 sec)

+--------+
| id     |
+--------+
| 711651 |
+--------+
1 row in set (0.04 sec)

+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable         | Value                                                                                                                                                                                                                                                                                                                                                                                                                   |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| transformed_tree | AND(
  OR(
    OR(
      AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)),
      OR(
        AND(fields=(title), KEYWORD(ways, querypos=1, expanded)),
        AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))),
    AND(fields=(title), KEYWORD(way, querypos=1, expanded)),
    OR(fields=(title), KEYWORD(way*, querypos=1, expanded))),
  AND(fields=(content), KEYWORD(hey, querypos=2))) |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Profiling without running a query

The SQL statement EXPLAIN QUERY enables the display of the execution tree for a given full-text query without performing an actual search query on the table.

‹›
  • SQL
SQL
📋
EXPLAIN QUERY index_base '@title running @body dog'\G
‹›
Response
 EXPLAIN QUERY index_base '@title running @body dog'\G
*************************** 1\. row ***************************
Variable: transformed_tree
   Value: AND(
      OR(
            AND(fields=(title), KEYWORD(run, querypos=1, morphed)),
            AND(fields=(title), KEYWORD(running, querypos=1, morphed))))
  AND(fields=(body), KEYWORD(dog, querypos=2, morphed)))

EXPLAIN QUERY ... option format=dot allows displaying the execution tree of a provided full-text query in a hierarchical format suitable for visualization by existing tools, such as https://dreampuf.github.io/GraphvizOnline:

EXPLAIN QUERY graphviz example

‹›
  • SQL
SQL
📋
EXPLAIN QUERY tbl 'i me' option format=dot\G
‹›
Response
EXPLAIN QUERY tbl 'i me' option format=dot\G
*************************** 1. row ***************************
Variable: transformed_tree
   Value: digraph "transformed_tree"
{

0 [shape=record,style=filled,bgcolor="lightgrey" label="AND"]
0 -> 1
1 [shape=record,style=filled,bgcolor="lightgrey" label="AND"]
1 -> 2
2 [shape=record label="i | { querypos=1 }"]
0 -> 3
3 [shape=record,style=filled,bgcolor="lightgrey" label="AND"]
3 -> 4
4 [shape=record label="me | { querypos=2 }"]
}

Viewing the match factors values

When using an expression ranker, it's possible to reveal the values of the calculated factors with the PACKEDFACTORS() function.

The function returns:

  • The values of document-level factors (such as bm25, field_mask, doc_word_count)
  • A list of each field that generated a hit (including lcs, hit_count, word_count, sum_idf, min_hit_pos, etc.)
  • A list of each keyword from the query along with their tf and idf values

These values can be utilized to understand why certain documents receive lower or higher scores in a search or to refine the existing ranking expression.

‹›
  • SQL
SQL
📋
SELECT id, PACKEDFACTORS() FROM test1 WHERE MATCH('test one') OPTION ranker=expr('1')\G
‹›
Response
             id: 1
packedfactors(): bm25=569, bm25a=0.617197, field_mask=2, doc_word_count=2,
    field1=(lcs=1, hit_count=2, word_count=2, tf_idf=0.152356,
        min_idf=-0.062982, max_idf=0.215338, sum_idf=0.152356, min_hit_pos=4,
        min_best_span_pos=4, exact_hit=0, max_window_hits=1, min_gaps=2,
        exact_order=1, lccs=1, wlccs=0.215338, atc=-0.003974),
    word0=(tf=1, idf=-0.062982),
    word1=(tf=1, idf=0.215338)
1 row in set (0.00 sec)

Boolean optimization

Queries can be automatically optimized if OPTION boolean_simplify=1 is specified. Some transformations performed by this optimization include:

  • Excess brackets: ((A | B) | C) becomes (A | B | C); ((A B) C) becomes (A B C)
  • Excess AND NOT: ((A !N1) !N2) becomes (A !(N1 | N2))
  • Common NOT: ((A !N) | (B !N)) becomes ((A | B) !N)
  • Common Compound NOT: ((A !(N AA)) | (B !(N BB))) becomes (((A | B) !N) | (A !AA) | (B !BB)) if the cost of evaluating N is greater than the sum of evaluating A and B
  • Common subterm: ((A (N | AA)) | (B (N | BB))) becomes (((A | B) N) | (A AA) | (B BB)) if the cost of evaluating N is greater than the sum of evaluating A and B
  • Common keywords: (A | "A B"~N) becomes A; ("A B" | "A B C") becomes "A B"; ("A B"~N | "A B C"~N) becomes ("A B"~N)
  • Common phrase: ("X A B" | "Y A B") becomes ("("X"|"Y") A B")
  • Common AND NOT: ((A !X) | (A !Y) | (A !Z)) becomes (A !(X Y Z))
  • Common OR NOT: ((A !(N | N1)) | (B !(N | N2))) becomes (( (A !N1) | (B !N2) ) !N) Note that optimizing queries consumes CPU time, so for simple queries or hand-optimized queries, you'll achieve better results with the default boolean_simplify=0 value. Simplifications often benefit complex queries or algorithmically generated queries.

Queries like -dog, which could potentially include all documents from the collection are not allowed by default. To allow them, you must specify not_terms_only_allowed=1 either as a global setting or as a search option.

Search results

SQL

When you run a query via SQL over the MySQL protocol, you receive the requested columns as a result or an empty result set if nothing is found.

‹›
  • SQL
SQL
📋
SELECT * FROM tbl;
‹›
Response
+------+------+--------+
| id   | age  | name   |
+------+------+--------+
|    1 |   25 | joe    |
|    2 |   25 | mary   |
|    3 |   33 | albert |
+------+------+--------+
3 rows in set (0.00 sec)

Additionally, you can use the SHOW META call to see extra meta-information about the latest query.

‹›
  • SQL
SQL
📋
SELECT id,story_author,comment_author FROM hn_small WHERE story_author='joe' LIMIT 3; SHOW META;
‹›
Response
++--------+--------------+----------------+
| id     | story_author | comment_author |
+--------+--------------+----------------+
| 152841 | joe          | SwellJoe       |
| 161323 | joe          | samb           |
| 163735 | joe          | jsjenkins168   |
+--------+--------------+----------------+
3 rows in set (0.01 sec)

+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| total          | 3     |
| total_found    | 20    |
| total_relation | gte   |
| time           | 0.010 |
+----------------+-------+
4 rows in set (0.00 sec)

In some cases, such as when performing a faceted search, you may receive multiple result sets as a response to your SQL query.

‹›
  • SQL
SQL
📋
SELECT * FROM tbl WHERE MATCH('joe') FACET age;
‹›
Response
+------+------+
| id   | age  |
+------+------+
|    1 |   25 |
+------+------+
1 row in set (0.00 sec)

+------+----------+
| age  | count(*) |
+------+----------+
|   25 |        1 |
+------+----------+
1 row in set (0.00 sec)

In case of a warning, the result set will include a warning flag, and you can see the warning using SHOW WARNINGS.

‹›
  • SQL
SQL
📋
SELECT * from tbl where match('"joe"/3'); show warnings;
‹›
Response
+------+------+------+
| id   | age  | name |
+------+------+------+
|    1 |   25 | joe  |
+------+------+------+
1 row in set, 1 warning (0.00 sec)

+---------+------+--------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                    |
+---------+------+--------------------------------------------------------------------------------------------+
| warning | 1000 | quorum threshold too high (words=1, thresh=3); replacing quorum operator with AND operator |
+---------+------+--------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

If your query fails, you will receive an error:

‹›
  • SQL
SQL
📋
SELECT * from tbl where match('@surname joe');
‹›
Response
ERROR 1064 (42000): index idx: query error: no field 'surname' found in schema

HTTP

Via the HTTP JSON interface, the query result is sent as a JSON document. Example:

{
  "took":10,
  "timed_out": false,
  "hits":
  {
    "total": 2,
    "hits":
    [
      {
        "_id": "1",
        "_score": 1,
        "_source": { "gid": 11 }
      },
      {
        "_id": "2",
        "_score": 1,
        "_source": { "gid": 12 }
      }
    ]
  }
}
  • took: time in milliseconds it took to execute the search
  • timed_out: whether the query timed out or not
  • hits: search results, with the following properties:
    • total: total number of matching documents
    • hits: an array containing matches

The query result can also include query profile information. See Query profile.

Each match in the hits array has the following properties:

  • _id: match id
  • _score: match weight, calculated by the ranker
  • _source: an array containing the attributes of this match

Source selection

By default, all attributes are returned in the _source array. You can use the _source property in the request payload to select the fields you want to include in the result set. Example:

{
  "index":"test",
  "_source":"attr*",
  "query": { "match_all": {} }
}

You can specify the attributes you want to include in the query result as a string ("_source": "attr*") or as an array of strings ("_source": [ "attr1", "attri*" ]"). Each entry can be an attribute name or a wildcard (*, % and ? symbols are supported).

You can also explicitly specify which attributes you want to include and which to exclude from the result set using the includes and excludes properties:

"_source":
{
  "includes": [ "attr1", "attri*" ],
  "excludes": [ "*desc*" ]
}

An empty list of includes is interpreted as "include all attributes," while an empty list of excludes does not match anything. If an attribute matches both the includes and excludes, then the excludes win.