Escaping characters in query string

As some characters are used as operators in the query string, they should be escaped to avoid query errors or unwanted matching conditions.

The following characters should be escaped using backslash (\):

!    "    $    '    (    )    -    /    <    @    \    ^    |    ~

In MySQL command line client

Single quote (') can be escaped using one backslash:

SELECT * FROM your_index WHERE MATCH('l\'italiano');

The rest of the characters in the above list are operators or query constructs. In order to treat them as simple characters, the engine has to see them with a preceding escape character. The backslash must also be escaped itself, resulting in 2 backslahes:

SELECT * FROM your_index WHERE MATCH('r\\&b | \\(official video\\)');

If we want to use backslash as character, we need to escape both the backslash as character and backslash as the escape operator, resulting in 4 backslashes used:

SELECT * FROM your_index WHERE MATCH('\\\\ABC');

Using MySQL drivers

MySQL drivers offer escaping functions (e.g., mysqli_real_escape_string in PHP or conn.escape_string in Python), but they only escape certain characters. You will still need to add escaping for the characters from the above list that are not escaped by the respective functions. Since these functions will escape backslash for you, you only need to add one backslash.

This also applies for the drivers that support (client-side) prepared statements. For example for PHP PDO prepared statements, you need to add a backslash for $ character:

$statement = $ln_sph->prepare( "SELECT * FROM index WHERE MATCH(:match)");
$match = '\$manticore';
$statement->bindParam(':match',$match,PDO::PARAM_STR);
$results = $statement->execute();

It will make the final query SELECT * FROM index WHERE MATCH('\\$manticore');

In HTTP JSON API

The same rules for SQL protocol apply with the exception that for JSON the double quote must be escaped with a single backslash, while the rest of the characters must be double escaped.

If using JSON libraries/functions that convert data structures to JSON strings the double quote and single backslash are escaped automatically by these functions and must not be explicitly escaped.

In clients

The new official clients (which use HTTP as protocol) are using under the hood common JSON libraries/functions available on the respective programming language. Same rules of escaping as above are applied.

Escaping asterisk

Asterisk (*) is a special character that can have two functionalities:

  • as wildcarding prefix/suffix expander
  • and as any term modifier inside a phrase search.

Unlike other special characters that are operators, the asterisk cannot be escaped when it's in a position to offer one of it's functionalities.

In non-wildcard queries, the asterisk doesn't require escaping, regardless if it's in the charset_table or not.

In wildcard queries, asterisk in the middle of a word doesn't require escaping. As a wildcard operator (either at start or end of the word), the asterisk will always be interpreted as the wildcard operator even if escaping is applied.

Escaping json node names in SQL

To escape special characters in json nodes use a backtick. For example:

MySQL [(none)]> select * from t where json.`a=b`=234;
+---------------------+-------------+------+
| id                  | json        | text |
+---------------------+-------------+------+
| 8215557549554925578 | {"a=b":234} |      |
+---------------------+-------------+------+

MySQL [(none)]> select * from t where json.`a:b`=123;
+---------------------+-------------+------+
| id                  | json        | text |
+---------------------+-------------+------+
| 8215557549554925577 | {"a:b":123} |      |
+---------------------+-------------+------+

Search profiling

How a query is interpreted

Example of a complex query:

"hello world" @title "example program"~5 @body python -(php|perl) @* code

The full meaning of this search is:

  • Find the words 'hello' and 'world' adjacently in any field in a document;
  • Additionally, the same document must also contain the words 'example' and 'program' in the title field, with up to, but not including, 5 words between the words in question; (E.g. "example PHP program" would be matched however "example script to introduce outside data into the correct context for your program" would not because two terms have 5 or more words between them)
  • Additionally, the same document must contain the word 'python' in the body field, but not contain either 'php' or 'perl';
  • Additionally, the same document must contain the word 'code' in any field.

OR operator precedence is higher than AND, so "looking for cat | dog | mouse" means "looking for ( cat | dog | mouse )" and not "(looking for cat) | dog | mouse".

To understand how a query will be executed, Manticore Search offer query profile tooling for viewing the query tree created by a query expression.

Profiling the query tree in SQL

When using SQL statement the full-text query profiling needs to be enabled before running the desired query:

SET profiling =1;
SELECT * FROM test WHERE MATCH('@title abc* @body hey');

To view the query tree, we must run SHOW PLAN right after the execution of the query:

SHOW PLAN;

The command will return the structure of the executed query. Please note that the 3 statements - SET profiling, the query and SHOW - must run on the same session.

Profiling the query in HTTP

In HTTP, we can just enable "profile":true to get in response the full-text query tree structure.

{
  "index":"test",
  "profile":true,
  "query":
  {
    "match_phrase": { "_all" : "had grown quite" }
  }
}

The response will contain a profile object in which we can find a member query.

query property contains the transformed full-text query tree. Each node contains:

  • type: node type. Can be AND, OR, PHRASE, KEYWORD etc.
  • description: query subtree for this node shown as a string (in SHOW PLAN format)
  • children: child nodes, if any
  • max_field_pos: maximum position within a field

A keyword node will also provide:

  • word: transformed keyword.
  • querypos: position of this keyword in a query.
  • excluded: keyword excluded from query.
  • expanded: keyword added by prefix expansion.
  • field_start: keyword must occur at the very start of the field.
  • field_end: keyword must occur at the very end of the field.
  • boost: keyword IDF will be multiplied by this.
‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • javascript
  • Java
📋
SET profiling=1;
SELECT * FROM test WHERE MATCH('@title abc* @body hey');
SHOW PLAN \G
‹›
Response
*************************** 1\. row ***************************
Variable: transformed_tree
   Value: AND(
  OR(fields=(title), KEYWORD(abcx, querypos=1, expanded), KEYWORD(abcm, querypos=1, expanded)),
  AND(fields=(body), KEYWORD(hey, querypos=2)))
1 row in set (0.00 sec)

In some cases the evaluated query tree can be rather different from the original one because of expansions and other transformations.

‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • javascript
  • Java
📋
SET profiling=1;

SELECT id FROM forum WHERE MATCH('@title way* @content hey') LIMIT 1;

SHOW PLAN;
‹›
Response
Query OK, 0 rows affected (0.00 sec)

+--------+
| id     |
+--------+
| 711651 |
+--------+
1 row in set (0.04 sec)

+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable         | Value                                                                                                                                                                                                                                                                                                                                                                                                                   |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| transformed_tree | AND(
  OR(
    OR(
      AND(fields=(title), KEYWORD(wayne, querypos=1, expanded)),
      OR(
        AND(fields=(title), KEYWORD(ways, querypos=1, expanded)),
        AND(fields=(title), KEYWORD(wayyy, querypos=1, expanded)))),
    AND(fields=(title), KEYWORD(way, querypos=1, expanded)),
    OR(fields=(title), KEYWORD(way*, querypos=1, expanded))),
  AND(fields=(content), KEYWORD(hey, querypos=2))) |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Profiling without running a query

The SQL statement EXPLAIN QUERY allows displaying the execution tree of a provided full-text query without running an actual search query on the index.

‹›
  • SQL
SQL
📋
EXPLAIN QUERY index_base '@title running @body dog'\G
‹›
Response
 EXPLAIN QUERY index_base '@title running @body dog'\G
*************************** 1\. row ***************************
Variable: transformed_tree
   Value: AND(
      OR(
            AND(fields=(title), KEYWORD(run, querypos=1, morphed)),
            AND(fields=(title), KEYWORD(running, querypos=1, morphed))))
  AND(fields=(body), KEYWORD(dog, querypos=2, morphed)))

EXPLAIN QUERY ... option format=dot allows displaying the execution tree of a provided full-text query in hierarchical format suitable for visualization by existing tools, for example https://dreampuf.github.io/GraphvizOnline :

EXPLAIN QUERY graphviz example

‹›
  • SQL
SQL
📋
EXPLAIN QUERY idx 'i me' option format=dot\G
‹›
Response
EXPLAIN QUERY idx 'i me' option format=dot\G
*************************** 1. row ***************************
Variable: transformed_tree
   Value: digraph "transformed_tree"
{

0 [shape=record,style=filled,bgcolor="lightgrey" label="AND"]
0 -> 1
1 [shape=record,style=filled,bgcolor="lightgrey" label="AND"]
1 -> 2
2 [shape=record label="i | { querypos=1 }"]
0 -> 3
3 [shape=record,style=filled,bgcolor="lightgrey" label="AND"]
3 -> 4
4 [shape=record label="me | { querypos=2 }"]
}

Viewing the match factors values

When expression ranker is used, it is possible to expose the values of the calculated factors using PACKEDFACTORS().

The function returns:

  • the values of document level factors (like bm25, field_mask, doc_word_count)
  • list with each field that returned a hit (like lcs, hit_count, word_count, sum_idf, min_hit_pos etc.)
  • list with each keyword from the query and their tf and idf values

The values can be used to understand why certain documents get scored lower or higher in a search or to improve the existing ranking expression.

‹›
  • SQL
SQL
📋
SELECT id, PACKEDFACTORS() FROM test1 WHERE MATCH('test one') OPTION ranker=expr('1')\G
‹›
Response
             id: 1
packedfactors(): bm25=569, bm25a=0.617197, field_mask=2, doc_word_count=2,
    field1=(lcs=1, hit_count=2, word_count=2, tf_idf=0.152356,
        min_idf=-0.062982, max_idf=0.215338, sum_idf=0.152356, min_hit_pos=4,
        min_best_span_pos=4, exact_hit=0, max_window_hits=1, min_gaps=2,
        exact_order=1, lccs=1, wlccs=0.215338, atc=-0.003974),
    word0=(tf=1, idf=-0.062982),
    word1=(tf=1, idf=0.215338)
1 row in set (0.00 sec)

Boolean optimization

Queries may be automatically optimized if OPTION boolean_simplify=1 is specified. Some transformations performed by this optimization include:

  • Excess brackets: ((A | B) | C) becomes ( A | B | C ); ((A B) C) becomes ( A B C )
  • Excess AND NOT: ((A !N1) !N2) becomes (A !(N1 | N2))
  • Common NOT: ((A !N) | (B !N)) becomes ((A|B) !N)
  • Common Compound NOT: ((A !(N AA)) | (B !(N BB))) becomes (((A|B) !N) | (A !AA) | (B !BB)) if the cost of evaluating N is greater than the added together costs of evaluating A and B
  • Common subterm: ((A (N | AA)) | (B (N | BB))) becomes (((A|B) N) | (A AA) | (B BB)) if the cost of evaluating N is greater than the added together costs of evaluating A and B
  • Common keywords: (A | "A B"~N) becomes A; ("A B" | "A B C") becomes "A B"; ("A B"~N | "A B C"~N) becomes ("A B"~N)
  • Common phrase: ("X A B" | "Y A B") becomes ("("X"|"Y") A B")
  • Common AND NOT: ((A !X) | (A !Y) | (A !Z)) becomes (A !(X Y Z))
  • Common OR NOT: ((A !(N | N1)) | (B !(N | N2))) becomes (( (A !N1) | (B !N2) ) !N) Note that optimizing the queries consumes CPU time, so for simple queries -or for hand-optimized queries- you'll do better with the default boolean_simplify=0 value. Simplifications are often better for complex queries, or algorithmically generated queries.

Queries like "-dog", which implicitly include all documents from the collection, can not be evaluated. This is both for technical and performance reasons. Technically, Manticore does not always keep a list of all IDs. Performance-wise, when the collection is huge (ie. 10-100M documents), evaluating such queries could take very long.