SphinxSE

SphinxSE is a MySQL storage engine which can be compiled into MySQL/MariaDB server using its pluggable architecture.

Despite the name, SphinxSE does not actually store any data itself. It is actually a built-in client which allows MySQL server to talk to searchd, run search queries, and obtain search results. All indexing and searching happen outside MySQL.

Obvious SphinxSE applications include:

  • easier porting of MySQL FTS applications to Manticore;
  • allowing Manticore use with programming languages for which native APIs are not available yet;
  • optimizations when additional Manticore result set processing on MySQL side is required (eg. JOINs with original document tables, additional MySQL-side filtering, etc).

Installing SphinxSE

You will need to obtain a copy of MySQL sources, prepare those, and then recompile MySQL binary. MySQL sources (mysql-5.x.yy.tar.gz) could be obtained from http://dev.mysql.com Web site.

Compiling MySQL 5.0.x with SphinxSE

  1. copy sphinx.5.0.yy.diff patch file into MySQL sources directory and run
    $ patch -p1 < sphinx.5.0.yy.diff

    If there's no .diff file exactly for the specific version you need to: build, try applying .diff with closest version numbers. It is important that the patch should apply with no rejects.

  2. in MySQL sources directory, run
    $ sh BUILD/autorun.sh
  3. in MySQL sources directory, create sql/sphinx directory in and copy all files in mysqlse directory from Manticore sources there. Example:
    $ cp -R /root/builds/sphinx-0.9.7/mysqlse /root/builds/mysql-5.0.24/sql/sphinx
  4. configure MySQL and enable the new engine:
    $ ./configure --with-sphinx-storage-engine
  5. build and install MySQL:
    $ make
    $ make install

Compiling MySQL 5.1.x with SphinxSE

  1. in MySQL sources directory, create storage/sphinx directory in and copy all files in mysqlse directory from Manticore sources there. Example:
    $ cp -R /root/builds/sphinx-0.9.7/mysqlse /root/builds/mysql-5.1.14/storage/sphinx
  2. in MySQL sources directory, run
    $ sh BUILD/autorun.sh
  3. configure MySQL and enable Manticore engine:
    $ ./configure --with-plugins=sphinx
  4. build and install MySQL:
    $ make
    $ make install

Checking SphinxSE installation

To check whether SphinxSE has been successfully compiled into MySQL, launch newly built servers, run mysql client and issue SHOW ENGINES query. You should see a list of all available engines. Manticore should be present and "Support" column should contain "YES":

‹›
  • sql
sql
📋
mysql> show engines;
‹›
Response
+------------+----------+-------------------------------------------------------------+
| Engine     | Support  | Comment                                                     |
+------------+----------+-------------------------------------------------------------+
| MyISAM     | DEFAULT  | Default engine as of MySQL 3.23 with great performance      |
  ...
| SPHINX     | YES      | Manticore storage engine                                       |
  ...
+------------+----------+-------------------------------------------------------------+
13 rows in set (0.00 sec)

Using SphinxSE

To search via SphinxSE, you would need to create special ENGINE=SPHINX "search table", and then SELECT from it with full text query put into WHERE clause for query column.

Let's begin with an example create statement and search query:

CREATE TABLE t1
(
    id          INTEGER UNSIGNED NOT NULL,
    weight      INTEGER NOT NULL,
    query       VARCHAR(3072) NOT NULL,
    group_id    INTEGER,
    INDEX(query)
) ENGINE=SPHINX CONNECTION="sphinx://localhost:9312/test";

SELECT * FROM t1 WHERE query='test it;mode=any';

First 3 columns of search table must have a types of INTEGER UNSINGED or BIGINT for the 1st column (document id), INTEGER or BIGINT for the 2nd column (match weight), and VARCHAR or TEXT for the 3rd column (your query), respectively. This mapping is fixed; you can not omit any of these three required columns, or move them around, or change types. Also, query column must be indexed; all the others must be kept unindexed. Column names are ignored so you can use arbitrary ones.

Additional columns must be either INTEGER, TIMESTAMP, BIGINT, VARCHAR, or FLOAT. They will be bound to attributes provided in Manticore result set by name, so their names must match attribute names specified in sphinx.conf. If there's no such attribute name in Manticore search results, column will have NULL values.

Special "virtual" attributes names can also be bound to SphinxSE columns. _sph_ needs to be used instead of @ for that. For instance, to obtain the values of @groupby, @count, or @distinct virtualattributes, use _sph_groupby, _sph_count or _sph_distinct column names, respectively.

CONNECTION string parameter is to be used to specify Manticore host, port and table. If no connection string is specified in CREATE TABLE, table name * (i.e. search all tables) and localhost:9312 are assumed. The connection string syntax is as follows:

CONNECTION="sphinx://HOST:PORT/TABLENAME"

You can change the default connection string later:

mysql> ALTER TABLE t1 CONNECTION="sphinx://NEWHOST:NEWPORT/NEWTABLENAME";

You can also override all these parameters per-query.

As seen in example, both query text and search options should be put into WHERE clause on search query column (ie. 3rd column); the options are separated by semicolons; and their names from values by equality sign. Any number of options can be specified. Available options are:

  • query - query text;
  • mode - matching mode. Must be one of "all", "any", "phrase", "boolean", or "extended". Default is "all";
  • sort - match sorting mode. Must be one of "relevance", "attr_desc", "attr_asc", "time_segments", or "extended". In all modes besides "relevance" attribute name (or sorting clause for "extended") is also required after a colon:
    ... WHERE query='test;sort=attr_asc:group_id';
    ... WHERE query='test;sort=extended:@weight desc, group_id asc';
  • offset - offset into result set, default is 0;
  • limit - amount of matches to retrieve from result set, default is 20;
  • index - names of the tables to search:
    ... WHERE query='test;index=test1;';
    ... WHERE query='test;index=test1,test2,test3;';
  • minid, maxid - min and max document ID to match;
  • weights - comma-separated list of weights to be assigned to Manticore full-text fields:
    ... WHERE query='test;weights=1,2,3;';
  • filter, !filter - comma-separated attribute name and a set of values to match:
    # only include groups 1, 5 and 19
    ... WHERE query='test;filter=group_id,1,5,19;';
    # exclude groups 3 and 11
    ... WHERE query='test;!filter=group_id,3,11;';
  • range, !range - comma-separated (integer or bigint) Manticore attribute name, and min and max values to match:
    # include groups from 3 to 7, inclusive
    ... WHERE query='test;range=group_id,3,7;';
    # exclude groups from 5 to 25
    ... WHERE query='test;!range=group_id,5,25;';
  • floatrange, !floatrange - comma-separated (floating point) Manticore attribute name, and min and max values to match:
    # filter by a float size
    ... WHERE query='test;floatrange=size,2,3;';
    # pick all results within 1000 meter from geoanchor
    ... WHERE query='test;floatrange=@geodist,0,1000;';
  • maxmatches - per-query max matches value, as in max_matches search option:
    ... WHERE query='test;maxmatches=2000;';
  • cutoff - maximum allowed matches, as in cutoff search option:
    ... WHERE query='test;cutoff=10000;';
  • maxquerytime - maximum allowed query time (in milliseconds), as in max_query_time search option:
    ... WHERE query='test;maxquerytime=1000;';
  • groupby - group-by function and attribute. Read this about grouping search results:
    ... WHERE query='test;groupby=day:published_ts;';
    ... WHERE query='test;groupby=attr:group_id;';
  • groupsort - group-by sorting clause:
    ... WHERE query='test;groupsort=@count desc;';
  • distinct - an attribute to compute COUNT(DISTINCT) for when doing group-by:
    ... WHERE query='test;groupby=attr:country_id;distinct=site_id';
  • indexweights - comma-separated list of table names and weights to use when searching through several tables:
    ... WHERE query='test;indexweights=tbl_exact,2,tbl_stemmed,1;';
  • fieldweights - comma-separated list of per-field weights that can be used by the ranker:
    ... WHERE query='test;fieldweights=title,10,abstract,3,content,1;';
  • comment - a string to mark this query in query log, as in comment search option:
    ... WHERE query='test;comment=marker001;';
  • select - a string with expressions to compute:
    ... WHERE query='test;select=2*a+3*** as myexpr;';
  • host, port - remote searchd host name and TCP port, respectively:
    ... WHERE query='test;host=sphinx-test.loc;port=7312;';
  • ranker - a ranking function to use with "extended" matching mode, as in ranker. Known values are "proximity_bm25", "bm25", "none", "wordcount", "proximity", "matchany", "fieldmask", "sph04", "expr:EXPRESSION" syntax to support expression-based ranker (where EXPRESSION should be replaced with your specific ranking formula), and "export:EXPRESSION":
    ... WHERE query='test;mode=extended;ranker=bm25;';
    ... WHERE query='test;mode=extended;ranker=expr:sum(lcs);';

    The "export" ranker works exactly like ranker=expr, but it stores the per-document factor values, while ranker=expr discards them after computing the final WEIGHT() value. Note that ranker=export is meant to be used but rarely, only to train a ML (machine learning) function or to define your own ranking function by hand, and never in actual production. When using this ranker, you'll probably want to examine the output of the RANKFACTORS() function that produces a string with all the field level factors for each document.

‹›
  • sql
sql
📋
SELECT *, WEIGHT(), RANKFACTORS()
    FROM myindex
    WHERE MATCH('dog')
    OPTION ranker=export('100*bm25');
‹›
Response
*************************** 1\. row ***************************
           id: 555617
    published: 1110067331
   channel_id: 1059819
        title: 7
      content: 428
     weight(): 69900
rankfactors(): bm25=699, bm25a=0.666478, field_mask=2,
doc_word_count=1, field1=(lcs=1, hit_count=4, word_count=1,
tf_idf=1.038127, min_idf=0.259532, max_idf=0.259532, sum_idf=0.259532,
min_hit_pos=120, min_best_span_pos=120, exact_hit=0,
max_window_hits=1), word1=(tf=4, idf=0.259532)
*************************** 2\. row ***************************
           id: 555313
    published: 1108438365
   channel_id: 1058561
        title: 8
      content: 249
     weight(): 68500
rankfactors(): bm25=685, bm25a=0.675213, field_mask=3,
doc_word_count=1, field0=(lcs=1, hit_count=1, word_count=1,
tf_idf=0.259532, min_idf=0.259532, max_idf=0.259532, sum_idf=0.259532,
min_hit_pos=8, min_best_span_pos=8, exact_hit=0, max_window_hits=1),
field1=(lcs=1, hit_count=2, word_count=1, tf_idf=0.519063,
min_idf=0.259532, max_idf=0.259532, sum_idf=0.259532, min_hit_pos=36,
min_best_span_pos=36, exact_hit=0, max_window_hits=1), word1=(tf=3,
idf=0.259532)
  • geoanchor - geodistance anchor. Read more about Geo-search in this section. Takes 4 parameters which are latitude and longitude attribute names, and anchor point coordinates respectively:
    ... WHERE query='test;geoanchor=latattr,lonattr,0.123,0.456';

One very important note that it is much more efficient to allow Manticore to perform sorting, filtering and slicing the result set than to raise max matches count and use WHERE, ORDER BY and LIMIT clauses on MySQL side. This is for two reasons. First, Manticore does a number of optimizations and performs better than MySQL on these tasks. Second, less data would need to be packed by searchd, transferred and unpacked by SphinxSE.

Additional query info besides result set could be retrieved with SHOW ENGINE SPHINX STATUS statement:

‹›
  • sql
sql
📋
mysql> SHOW ENGINE SPHINX STATUS;
‹›
Response
+--------+-------+-------------------------------------------------+
| Type   | Name  | Status                                          |
+--------+-------+-------------------------------------------------+
| SPHINX | stats | total: 25, total found: 25, time: 126, words: 2 |
| SPHINX | words | sphinx:591:1256 soft:11076:15945                |
+--------+-------+-------------------------------------------------+
2 rows in set (0.00 sec)

This information can also be accessed through status variables. Note that this method does not require super-user privileges.

‹›
  • sql
sql
📋
mysql> SHOW STATUS LIKE 'sphinx_%';
‹›
Response
+--------------------+----------------------------------+
| Variable_name      | Value                            |
+--------------------+----------------------------------+
| sphinx_total       | 25                               |
| sphinx_total_found | 25                               |
| sphinx_time        | 126                              |
| sphinx_word_count  | 2                                |
| sphinx_words       | sphinx:591:1256 soft:11076:15945 |
+--------------------+----------------------------------+
5 rows in set (0.00 sec)

You could perform JOINs on SphinxSE search table and tables using other engines. Here's an example with "documents" from example.sql:

‹›
  • sql
sql
📋
mysql> SELECT content, date_added FROM test.documents docs
-> JOIN t1 ON (docs.id=t1.id)
-> WHERE query="one document;mode=any";

mysql> SHOW ENGINE SPHINX STATUS;
‹›
Response
+-------------------------------------+---------------------+
| content                             | docdate             |
+-------------------------------------+---------------------+
| this is my test document number two | 2006-06-17 14:04:28 |
| this is my test document number one | 2006-06-17 14:04:28 |
+-------------------------------------+---------------------+
2 rows in set (0.00 sec)

+--------+-------+---------------------------------------------+
| Type   | Name  | Status                                      |
+--------+-------+---------------------------------------------+
| SPHINX | stats | total: 2, total found: 2, time: 0, words: 2 |
| SPHINX | words | one:1:2 document:2:2                        |
+--------+-------+---------------------------------------------+
2 rows in set (0.00 sec)

Building snippets via MySQL

SphinxSE also includes a UDF function that lets you create snippets through MySQL. The functionality is similar to HIGHLIGHT(), but accessible through MySQL+SphinxSE.

The binary that provides the UDF is named sphinx.so and should be automatically built and installed to proper location along with SphinxSE itself. If it does not get installed automatically for some reason, look for sphinx.so in the build directory and copy it to the plugins directory of your MySQL instance. After that, register the UDF using the following statement:

CREATE FUNCTION sphinx_snippets RETURNS STRING SONAME 'sphinx.so';

Function name must be sphinx_snippets, you can not use an arbitrary name. Function arguments are as follows:

Prototype: function sphinx_snippets ( document, table, words [, options] );

Document and words arguments can be either strings or table columns. Options must be specified like this: 'value' AS option_name. For a list of supported options, refer to Highlighting section. The only UDF-specific additional option is named sphinx and lets you specify searchd location (host and port).

Usage examples:

SELECT sphinx_snippets('hello world doc', 'main', 'world',
    'sphinx://192.168.1.1/' AS sphinx, true AS exact_phrase,
    '[**]' AS before_match, '[/**]' AS after_match)
FROM documents;

SELECT title, sphinx_snippets(text, 'index', 'mysql php') AS text
    FROM sphinx, documents
    WHERE query='mysql php' AND sphinx.id=documents.id;

FEDERATED

Using MySQL FEDERATED engine you can connect to a local or remote Manticore instance from MySQL/MariaDB and perform search queries.

Using FEDERATED

An actual Manticore query cannot be used directly with FEDERATED engine and must be "proxied" (sent as a string in a column) due to limitations of the FEDERATED engine and the fact that Manticore implements custom syntax like the MATCH clause.

To search via FEDERATED, you would need to create first a FEDERATED engine table. The Manticore query will be included in a query column in the SELECT performed over the FEDERATED table.

Creating FEDERATED compatible MySQL table:

‹›
  • SQL
SQL
📋
CREATE TABLE t1
(
    id          INTEGER UNSIGNED NOT NULL,
    year        INTEGER NOT NULL,
    rating      FLOAT,
    query       VARCHAR(1024) NOT NULL,
    INDEX(query)
) ENGINE=FEDERATED
DEFAULT CHARSET=utf8
CONNECTION='mysql://[email protected]:9306/DB/movies';
‹›
Response
Query OK, 0 rows affected (0.00 sec)

Query FEDERATED compatible table:

‹›
  • SQL
SQL
📋
SELECT * FROM t1 WHERE query='SELECT * FROM movies WHERE MATCH (\'pie\')';
‹›
Response
+----+------+--------+------------------------------------------+
| id | year | rating | query                                    |
+----+------+--------+------------------------------------------+
|  1 | 2019 |      5 | SELECT * FROM movies WHERE MATCH ('pie') |
+----+------+--------+------------------------------------------+
1 row in set (0.04 sec)

The only fixed mapping is query column. It is mandatory and must be the only column with a table attached.

The Manticore table that is linked via FEDERATED must be a physical table (plain or real-time).

The FEDERATED table should have columns with the same names as a remote Manticore table attributes as it will be bound to the attributes provided in Manticore result set by name, however it might map not all attributes, but only some of them.

Manticore server identifies query from a FEDERATED client by user name "FEDERATED". CONNECTION string parameter is to be used to specify Manticore host, SQL port and tables for queries coming through the connection. The connection string syntax is as follows:

CONNECTION="mysql://FEDERATED@HOST:PORT/DB/TABLENAME"

Since Manticore doesn't have the concept of database, the DB string can be random as it will be ignored by Manticore, but MySQL requires a value in the CONNECTION string definition. As seen in the example, full SELECT SQL query should be put into a WHERE clause against column query.

Only SELECT statement is supported, not INSERT, REPLACE, UPDATE, DELETE.

FEDERATED tips

One very important note that it is much more efficient to allow Manticore to perform sorting, filtering and slicing the result set than to raise max matches count and use WHERE, ORDER BY and LIMIT clauses on MySQL side. This is for two reasons. First, Manticore does a number of optimizations and performs better than MySQL on these tasks. Second, less data would need to be packed by searchd, transferred and unpacked between Manticore and MySQL.

JOINs can be performed between FEDERATED table and other MySQL tables. This can be used to retrieve information that is not stored in a Manticore table.

‹›
  • SQL
SQL
📋
Query to JOIN MySQL based table with FEDERATED table served by Manticore:
SELECT t1.id, t1.year, comments.comment FROM t1 JOIN comments ON t1.id=comments.post_id WHERE query='SELECT * FROM movies WHERE MATCH (\'pie\')';
‹›
Response
+----+------+--------------+
| id | year | comment      |
+----+------+--------------+
|  1 | 2019 | was not good |
+----+------+--------------+
1 row in set (0.00 sec)

UDFs and Plugins

Manticore can be extended with user defined functions, or UDFs for short, like this:

SELECT id, attr1, myudf (attr2, attr3+attr4) ...

You can load and unload UDFs dynamically into searchd without having to restart the server, and use them in expressions when searching, ranking, etc. Quick summary of the UDF features is as follows.

  • UDFs can take integer (both 32-bit and 64-bit), float, string, MVA, or PACKEDFACTORS() arguments.
  • UDFs can return integer, float, or string values.
  • UDFs can check the argument number, types, and names during the query setup phase, and raise errors.

We do not yet support aggregation functions. In other words, your UDFs will be called for just a single document at a time and are expected to return some value for that document. Writing a function that can compute an aggregate value like AVG() over the entire group of documents that share the same GROUP BY key is not yet possible. However, you can use UDFs within the builtin aggregate functions: that is, even though MYCUSTOMAVG() is not supported yet, AVG(MYCUSTOMFUNC()) should work alright!

UDFs have a wide variety of uses, for instance:

  • adding custom mathematical or string functions;
  • accessing the database or files from within Manticore;
  • implementing complex ranking functions.

Plugins

Plugins provides more possibilities to extend searching functionality. Currently they may be used to calculate custom ranking, and also to tokenize documents and queries.

Here's the complete plugin type list.

  • UDF plugins (which are just UDFs, but since they're plugged, they're also called 'UDF plugin')
  • ranker plugins
  • indexing-time token filter plugins
  • query-time token filter plugins

This section discusses writing and managing plugins in general; things specific to writing this or that type of a plugin are then discussed in their respective subsections.

So, how do you write and use a plugin? Four-line crash course goes as follows:

  • create a dynamic library (either .so or.dll), most likely in C or C++;
  • load that plugin into searchd using CREATE PLUGIN;
  • invoke it using the plugin specific calls (typically using this or that OPTION).
  • to unload or reload a plugin use DROP PLUGIN and RELOAD PLUGINS respectively.

Note that while UDFs are first-class plugins they are nevertheless installed using a separate CREATE FUNCTION statement. It lets you specify the return type neatly so there was especially little reason to ruin backwards compatibility and change the syntax.

Dynamic plugins are supported in threads and thread_pool workers. Multiple plugins (and/or UDFs) may reside in a single library file. So you might choose to either put all your project-specific plugins in a single common big library; or you might choose to have a separate library for every UDF and plugin; that is up to you.

Just as with UDFs, you want to include src/sphinxudf.h header file. At the very least, you will need the SPH_UDF_VERSION constant to implement a proper version function. Depending on the specific plugin type, you might or might not need to link your plugin with src/sphinxudf.c. However, all the functions implemented in sphinxudf.c are about unpacking the PACKEDFACTORS() blob, and no plugin types are exposed to that kind of data. So currently, you would never need to link with the C-file, just the header would be sufficient. (In fact, if you copy over the UDF version number, then for some of the plugin types you would not even need the header file.)

Formally, plugins are just sets of C functions that follow a certain naming pattern. You are typically required to define just one key function that does the most important work, but you may define a bunch of other functions, too. For example, to implement a ranker called "myrank", you must define myrank_finalize() function that actually returns the rank value, however, you might also define myrank_init(), myrank_update(), and myrank_deinit() functions. Specific sets of well-known suffixes and the call arguments do differ based on the plugin type, but _init() and _deinit() are generic, every plugin has those. Hint: for a quick reference on the known suffixes and their argument types, refer to sphinxplugin.h, we define the call prototypes in the very beginning of that file.

Despite having the public interface defined in ye good old good pure C, our plugins essentially follow the object-oriented model. Indeed, every _init() function receives a void ** userdata out-parameter. And the pointer value that you store at (*userdata) location is then be passed as a 1st argument to all the other plugin functions. So you can think of a plugin as class that gets instantiated every time an object of that class is needed to handle a request: the userdata pointer would be its this pointer; the functions would be its methods, and the _init() and _deinit() functions would be the constructor and destructor respectively.

Why this (minor) OOP-in-C complication? Well, plugins run in a multi-threaded environment, and some of them have to be stateful. You can't keep that state in a global variable in your plugin. So we have to pass around a userdata parameter anyway to let you keep that state. And that naturally brings us to the OOP model. And if you've got a simple, stateless plugin, the interface lets you omit the _init() and _deinit() and whatever other functions just as well.

To summarize, here goes the simplest complete ranker plugin, in just 3 lines of C code.

// gcc -fPIC -shared -o myrank.so myrank.c
#include "sphinxudf.h"
int myrank_ver() { return SPH_UDF_VERSION; }
int myrank_finalize(void *u, int w) { return 123; }

And this is how you use it:

mysql> CREATE PLUGIN myrank TYPE 'ranker' SONAME 'myrank.dll';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT id, weight() FROM test1 WHERE MATCH('test') OPTION ranker=myrank('');
+------+----------+
| id   | weight() |
+------+----------+
|    1 |      123 |
|    2 |      123 |
+------+----------+
2 rows in set (0.01 sec)