Introduction

Manticore Search allows fetching data from databases using specialized drivers or ODBC. Current drivers include:

  • mysql - for MySQL/MariaDB/Percona MySQL databases
  • pgsql - for PostgreSQL database
  • mssql - for Microsoft SQL database
  • odbc - for any database that accepts connections using ODBC

To fetch data from the database, a source must be configured with type as one of the above. The source requires information about how to connect to the database and the query that will be used to fetch the data. Additional pre- and post-queries can also be set - either to configure session settings or to perform pre/post fetch tasks. The source also must contain definitions of data types for the columns that are fetched.

Database connection

The source definition must contain the settings of the connection, this includes the host, port, user credentials, or specific settings of a driver.

sql_host

The database server host to connect to. Note that the MySQL client library chooses whether to connect over TCP/IP or over UNIX socket based on the host name. Specifically, "localhost" will force it to use UNIX socket (this is the default and generally recommended mode) and "127.0.0.1" will force TCP/IP usage.

sql_port

The server IP port to connect to. For mysql the default is 3306 and for pgsql, it is 5432.

sql_db

The SQL database to use after the connection is established and perform further queries within.

sql_user

The username used for connecting.

sql_pass

The user password to use when connecting. If the password includes # (which can be used to add comments in the configuration file), you can escape it with \.

sql_sock

UNIX socket name to connect to for local database servers. Note that it depends on the sql_host setting whether this value will actually be used.

sql_sock = /var/lib/mysql/mysql.sock

Specific settings for drivers

MySQL

mysql_connect_flags

MySQL client connection flags. Optional, the default value is 0 (do not set any flags).

This option must contain an integer value with the sum of the flags. The value will be passed to mysql_real_connect() verbatim. The flags are enumerated in mysql_com.h include file. Flags that are especially interesting in regard to indexing, with their respective values, are as follows:

  • CLIENT_COMPRESS = 32; can use compression protocol
  • CLIENT_SSL = 2048; switch to SSL after handshake
  • CLIENT_SECURE_CONNECTION = 32768; new 4.1 authentication For instance, you can specify 2080 (2048+32) to use both compression and SSL, or 32768 to use new authentication only. Initially, this option was introduced to be able to use compression when the indexer and mysqld are on different hosts. Compression on 1 Gbps links is most likely to hurt indexing time though it reduces network traffic, both in theory and in practice. However, enabling compression on 100 Mbps links may improve indexing time significantly (up to 20-30% of the total indexing time improvement was reported). Your mileage may vary.
mysql_connect_flags = 32 # enable compression

SSL certificate settings

  • mysql_ssl_cert - path to SSL certificate
  • mysql_ssl_key - path to SSL key file
  • mysql_ssl_ca - path to CA certificate

unpack_mysqlcompress

unpack_mysqlcompress_maxsize = 1M

Columns to unpack using MySQL UNCOMPRESS() algorithm. Multi-value, optional, default value is an empty list of columns.

Columns specified using this directive will be unpacked by the indexer using the modified zlib algorithm used by MySQL COMPRESS() and UNCOMPRESS() functions. When indexing on a different box than the database, this lets you offload the database and save on network traffic. This feature is only available if zlib and zlib-devel were both available during build time.

unpack_mysqlcompress = body_compressed
unpack_mysqlcompress = description_compressed

By default, a buffer of 16M is used for uncompressing the data. This can be changed by setting unpack_mysqlcompress_maxsize.

When using unpack_mysqlcompress, due to implementation intricacies, it is not possible to deduce the required buffer size from the compressed data. So, the buffer must be preallocated in advance, and the unpacked data can not go over the buffer size.

unpack_zlib

unpack_zlib = col1
unpack_zlib = col2

Columns to unpack using zlib (aka deflate, aka gunzip). Multi-value, optional, default value is an empty list of columns. Applies to source types mysql and pgsql only.

Columns specified using this directive will be unpacked by the indexer using the standard zlib algorithm (called deflate and also implemented by gunzip). When indexing on a different box than the database, this lets you offload the database and save on network traffic. This feature is only available if zlib and zlib-devel were both available during build time.

MSSQL

MS SQL Windows authentication flag. Whether to use currently logged-in Windows account credentials for authentication when connecting to MS SQL Server.

mssql_winauth = 1

ODBC

Sources using ODBC require the presence of a DSN (Data Source Name) string which can be set with odbc_dsn.

odbc_dsn = Driver={Oracle ODBC Driver};Dbq=myDBName;Uid=myUsername;Pwd=myPassword

Please note that the format depends on the specific ODBC driver used.

Execution of fetch queries

With all the SQL drivers, building a plain table generally works as follows.

  • A connection to the database is established.
  • The sql_query_pre_all queries are executed to perform any necessary initial setup, such as setting per-connection encoding with MySQL. These queries run before the entire indexing process, and also after a reconnect for indexing MVA attributes and joined fields.
  • The sql_query_pre pre-query is executed to perform any necessary initial setup, such as setting up temporary tables or maintaining counter tables. These queries run once for the entire indexing process.
  • Pre-queries as sql_query_pre is executed to perform any necessary initial setup, such as setting up temporary tables, or maintaining counter table. These queries run once per whole indexing.
  • Main query as sql_query is executed and the rows it returns are processed.
  • Post-query as sql_query_post is executed to perform some necessary cleanup.
  • The connection to the database is closed.
  • Indexer does the sorting phase (to be pedantic, table-type specific post-processing).
  • A connection to the database is established again.
  • Post-processing query as sql_query_post_index is executed to perform some necessary final cleanup.
  • The connection to the database is closed again.

Example of a source fetching data from MYSQL:

source mysource {
  type             = mysql
  path             = /path/to/realtime
  sql_host         = localhost
  sql_user         = myuser
  sql_pass         = mypass
  sql_db           = mydb
  sql_query_pre    = SET CHARACTER_SET_RESULTS=utf8
  sql_query_pre    = SET NAMES utf8
  sql_query        =  SELECT id, title, description, category_id  FROM mytable
  sql_query_post   = DROP TABLE view_table
  sql_query_post_index = REPLACE INTO counters ( id, val ) \
    VALUES ( 'max_indexed_id', $maxid )
  sql_attr_uint    = category_id
  sql_field_string = title
 }

table mytable {
  type   = plain
  source = mysource
  path   = /path/to/mytable
  ...
 }

sql_query

This is the query used to retrieve documents from a SQL server. There can be only one sql_query declared, and it's mandatory to have one. See also Processing fetched data

sql_query_pre

Pre-fetch query or pre-query. This is a multi-value, optional setting, with the default being an empty list of queries. The pre-queries are executed before the sql_query in the order they appear in the configuration file. The results of the pre-queries are ignored.

Pre-queries are useful in many ways. They can be used to set up encoding, mark records that are going to be indexed, update internal counters, set various per-connection SQL server options and variables, and so on.

Perhaps the most frequent use of pre-query is to specify the encoding that the server will use for the rows it returns. Note that Manticore accepts only UTF-8 text. Two MySQL specific examples of setting the encoding are:

sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query_pre = SET NAMES utf8

Also, specific to MySQL sources, it is useful to disable query cache (for indexer connection only) in pre-query, because indexing queries are not going to be re-run frequently anyway, and there's no sense in caching their results. That could be achieved with:

sql_query_pre = SET SESSION query_cache_type=OFF

sql_query_post

Post-fetch query. This is an optional setting, with the default value being empty.

This query is executed immediately after sql_query completes successfully. When the post-fetch query produces errors, they are reported as warnings, but indexing is not terminated. Its result set is ignored. Note that indexing is not yet completed at the point when this query gets executed, and further indexing may still fail. Therefore, any permanent updates should not be done from here. For instance, updates on a helper table that permanently change the last successfully indexed ID should not be run from the sql_query_post query; they should be run from the sql_query_post_index query instead.

sql_query_post_index

Post-processing query. This is an optional setting, with the default value being empty.

This query is executed when indexing is fully and successfully completed. If this query produces errors, they are reported as warnings, but indexing is not terminated. Its result set is ignored. The $maxid macro can be used in its text; it will be expanded to the maximum document ID that was actually fetched from the database during indexing. If no documents were indexed, $maxid will be expanded to 0.

Example:

sql_query_post_index = REPLACE INTO counters ( id, val ) \
    VALUES ( 'max_indexed_id', $maxid )

The difference between sql_query_post and sql_query_post_index is that sql_query_post is run immediately when Manticore receives all the documents, but further indexing may still fail for some other reason. On the contrary, by the time the sql_query_post_index query gets executed, it is guaranteed that the table was created successfully. Database connection is dropped and re-established because sorting phase can be very lengthy and would just time out otherwise.