The source definition must contain the settings of the connection, this includes the host, port, user credentials, or specific settings of a driver.
The database server host to connect to. Note that the MySQL client library chooses whether to connect over TCP/IP or over UNIX socket based on the host name. Specifically, "localhost" will force it to use UNIX socket (this is the default and generally recommended mode) and "127.0.0.1" will force TCP/IP usage.
The server IP port to connect to.
For mysql
the default is 3306 and for pgsql
, it is 5432.
The SQL database to use after the connection is established and perform further queries within.
The username used for connecting.
The user password to use when connecting. If the password includes #
(which can be used to add comments in the configuration file), you can escape it with \
.
UNIX socket name to connect to for local database servers. Note that it depends on the sql_host
setting whether this value will actually be used.
sql_sock = /var/lib/mysql/mysql.sock
MySQL client connection flags. Optional, the default value is 0 (do not set any flags).
This option must contain an integer value with the sum of the flags. The value will be passed to mysql_real_connect() verbatim. The flags are enumerated in mysql_com.h include file. Flags that are especially interesting in regard to indexing, with their respective values, are as follows:
- CLIENT_COMPRESS = 32; can use compression protocol
- CLIENT_SSL = 2048; switch to SSL after handshake
- CLIENT_SECURE_CONNECTION = 32768; new 4.1 authentication For instance, you can specify 2080 (2048+32) to use both compression and SSL, or 32768 to use new authentication only. Initially, this option was introduced to be able to use compression when the indexer and mysqld are on different hosts. Compression on 1 Gbps links is most likely to hurt indexing time though it reduces network traffic, both in theory and in practice. However, enabling compression on 100 Mbps links may improve indexing time significantly (up to 20-30% of the total indexing time improvement was reported). Your mileage may vary.
mysql_connect_flags = 32 # enable compression
mysql_ssl_cert
- path to SSL certificatemysql_ssl_key
- path to SSL key filemysql_ssl_ca
- path to CA certificate
unpack_mysqlcompress_maxsize = 1M
Columns to unpack using MySQL UNCOMPRESS()
algorithm. Multi-value, optional, default value is an empty list of columns.
Columns specified using this directive will be unpacked by the indexer using the modified zlib algorithm used by MySQL COMPRESS()
and UNCOMPRESS()
functions. When indexing on a different box than the database, this lets you offload the database and save on network traffic. This feature is only available if zlib and zlib-devel were both available during build time.
unpack_mysqlcompress = body_compressed
unpack_mysqlcompress = description_compressed
By default, a buffer of 16M is used for uncompressing the data. This can be changed by setting unpack_mysqlcompress_maxsize
.
When using unpack_mysqlcompress, due to implementation intricacies, it is not possible to deduce the required buffer size from the compressed data. So, the buffer must be preallocated in advance, and the unpacked data can not go over the buffer size.
unpack_zlib = col1
unpack_zlib = col2
Columns to unpack using zlib (aka deflate, aka gunzip). Multi-value, optional, default value is an empty list of columns. Applies to source types mysql
and pgsql
only.
Columns specified using this directive will be unpacked by the indexer
using the standard zlib algorithm (called deflate and also implemented by gunzip
). When indexing on a different box than the database, this lets you offload the database and save on network traffic. This feature is only available if zlib and zlib-devel were both available during build time.
MS SQL Windows authentication flag. Whether to use currently logged-in Windows account credentials for authentication when connecting to MS SQL Server.
mssql_winauth = 1
Sources using ODBC require the presence of a DSN (Data Source Name) string which can be set with odbc_dsn
.
odbc_dsn = Driver={Oracle ODBC Driver};Dbq=myDBName;Uid=myUsername;Pwd=myPassword
Please note that the format depends on the specific ODBC driver used.
With all the SQL drivers, building a plain table generally works as follows.
- A connection to the database is established.
- The
sql_query_pre_all
queries are executed to perform any necessary initial setup, such as setting per-connection encoding with MySQL. These queries run before the entire indexing process, and also after a reconnect for indexing MVA attributes and joined fields. - The
sql_query_pre
pre-query is executed to perform any necessary initial setup, such as setting up temporary tables or maintaining counter tables. These queries run once for the entire indexing process. - Pre-queries as
sql_query_pre
is executed to perform any necessary initial setup, such as setting up temporary tables, or maintaining counter table. These queries run once per whole indexing. - Main query as
sql_query
is executed and the rows it returns are processed. - Post-query as
sql_query_post
is executed to perform some necessary cleanup. - The connection to the database is closed.
- Indexer does the sorting phase (to be pedantic, table-type specific post-processing).
- A connection to the database is established again.
- Post-processing query as
sql_query_post_index
is executed to perform some necessary final cleanup. - The connection to the database is closed again.
Example of a source fetching data from MYSQL:
source mysource {
type = mysql
path = /path/to/realtime
sql_host = localhost
sql_user = myuser
sql_pass = mypass
sql_db = mydb
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query_pre = SET NAMES utf8
sql_query = SELECT id, title, description, category_id FROM mytable
sql_query_post = DROP TABLE view_table
sql_query_post_index = REPLACE INTO counters ( id, val ) \
VALUES ( 'max_indexed_id', $maxid )
sql_attr_uint = category_id
sql_field_string = title
}
table mytable {
type = plain
source = mysource
path = /path/to/mytable
...
}
This is the query used to retrieve documents from a SQL server. There can be only one sql_query declared, and it's mandatory to have one. See also Processing fetched data
Pre-fetch query or pre-query. This is a multi-value, optional setting, with the default being an empty list of queries. The pre-queries are executed before the sql_query in the order they appear in the configuration file. The results of the pre-queries are ignored.
Pre-queries are useful in many ways. They can be used to set up encoding, mark records that are going to be indexed, update internal counters, set various per-connection SQL server options and variables, and so on.
Perhaps the most frequent use of pre-query is to specify the encoding that the server will use for the rows it returns. Note that Manticore accepts only UTF-8 text. Two MySQL specific examples of setting the encoding are:
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query_pre = SET NAMES utf8
Also, specific to MySQL sources, it is useful to disable query cache (for indexer connection only) in pre-query, because indexing queries are not going to be re-run frequently anyway, and there's no sense in caching their results. That could be achieved with:
sql_query_pre = SET SESSION query_cache_type=OFF
Post-fetch query. This is an optional setting, with the default value being empty.
This query is executed immediately after sql_query completes successfully. When the post-fetch query produces errors, they are reported as warnings, but indexing is not terminated. Its result set is ignored. Note that indexing is not yet completed at the point when this query gets executed, and further indexing may still fail. Therefore, any permanent updates should not be done from here. For instance, updates on a helper table that permanently change the last successfully indexed ID should not be run from the sql_query_post
query; they should be run from the sql_query_post_index
query instead.
Post-processing query. This is an optional setting, with the default value being empty.
This query is executed when indexing is fully and successfully completed. If this query produces errors, they are reported as warnings, but indexing is not terminated. Its result set is ignored. The $maxid
macro can be used in its text; it will be expanded to the maximum document ID that was actually fetched from the database during indexing. If no documents were indexed, $maxid
will be expanded to 0.
Example:
sql_query_post_index = REPLACE INTO counters ( id, val ) \
VALUES ( 'max_indexed_id', $maxid )
The difference between sql_query_post
and sql_query_post_index
is that sql_query_post
is run immediately when Manticore receives all the documents, but further indexing may still fail for some other reason. On the contrary, by the time the sql_query_post_index
query gets executed, it is guaranteed that the table was created successfully. Database connection is dropped and re-established because sorting phase can be very lengthy and would just time out otherwise.
By default, the first column from the result set of sql_query
is indexed as the document id.
Document ID MUST be the very first field, and it MUST BE UNIQUE SIGNED (NON-ZERO) INTEGER NUMBER from -9223372036854775808 to 9223372036854775807.
You can specify up to 256 full-text fields and an arbitrary amount of attributes. All the columns that are neither document ID (the first one) nor attributes will be indexed as full-text fields.
Declares a 64-bit signed integer.
Declares a boolean attribute. It's equivalent to an integer attribute with bit count of 1.
Declares a floating point attribute.
The values will be stored in single precision, 32-bit IEEE 754 format. Represented range is approximately from 1e-38 to 1e+38. The amount of decimal digits that can be stored precisely is approximately 7.
One important usage of float attributes is storing latitude and longitude values (in radians), for further usage in query-time geosphere distance calculations.
Declares a JSON attribute.
When indexing JSON attributes, Manticore expects a text field with JSON formatted data. JSON attributes support arbitrary JSON data with no limitation in nested levels or types.
Declares a multi-value attribute.
Plain attributes only allow attaching 1 value per each document. However, there are cases (such as tags or categories) when it is desired to attach multiple values of the same attribute and be able to apply filtering or grouping to value lists.
The MVA can take the values from a column (like the rest of the data types) - in this case, the column in the result set must provide a string with multiple integer values separated by commas - or by running a separate query to get the values.
When executing a query, the engine runs the query, groups the results by IDs, and assigns the values to their corresponding documents in the table. Values with an ID not found in the table are discarded. Before executing the query, any defined sql_query_pre_all
will be run.
The declaration format for sql_attr_multi is as follows:
sql_attr_multi = ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE \
[;QUERY] \
[;RANGED-QUERY]
where
- ATTR-TYPE is
uint
,bigint
ortimestamp
. - SOURCE-TYPE is
field
,query
,ranged-query
, orranged-main-query
. - QUERY is an optional SQL query used to fetch all (docid, attrvalue) pairs.
- RANGED-QUERY is an optional SQL query used to fetch min and max ID values, similar to
sql_query_range
. - The backslashes are included for clarity only; everything can be declared in a single line as well.
It's used with ranged-query SOURCE-TYPE. If using ranged-main-query
SOURCE-TYPE, then omit the RANGED-QUERY, and it will automatically use the same query from sql_query_range
(useful option in complex inheritance setups to save having to manually duplicate the same query many times).
sql_attr_multi = uint tag from field
sql_attr_multi = uint tag from query; SELECT id, tag FROM tags
sql_attr_multi = bigint tag from ranged-query; \
SELECT id, tag FROM tags WHERE id>=$start AND id<=$end; \
SELECT MIN(id), MAX(id) FROM tags
Declares a string attribute. The maximum size of each value is fixed at 4GB.
Declares a UNIX timestamp.
Timestamps can store dates and times in the range of January 01, 1970, to January 19, 2038, with a precision of one second. The expected column value should be a timestamp in UNIX format, which is a 32-bit unsigned integer number of seconds elapsed since midnight on January 01, 1970, GMT. Timestamps are internally stored and handled as integers everywhere. In addition to working with timestamps as integers, you can also use them with different date-based functions, such as time segments sorting mode or day/week/month/year extraction for GROUP BY.
Note that DATE or DATETIME column types in MySQL cannot be directly used as timestamp attributes in Manticore; you need to explicitly convert such columns using UNIX_TIMESTAMP function (if the data is in range).
Note timestamps can not represent dates before January 01, 1970, and UNIX_TIMESTAMP() in MySQL will not return anything expected. If you only need to work with dates, not times, consider TO_DAYS()
function in MySQL instead.
Declares an unsigned integer attribute.
You can specify the bit count for integer attributes by appending ':BITCOUNT' to attribute name (see example below). Attributes with less than default 32-bit size, or bitfields, perform slower.
sql_attr_uint = group_id
sql_attr_uint = forum_id:9 # 9 bits for forum_id
Declares a combo string attribute/text field. The values will be indexed as a full-text field, but also stored in a string attribute with the same name. Note, it should only be used when you are sure you want the field to be searchable both in a full-text manner and as an attribute (with the ability to sort and group by it). If you just want to be able to fetch the original value of the field, you don't need to do anything for it unless you implicitly removed the field from the stored fields list via stored_fields.
sql_field_string = name
Declares a file based field.
This directive makes indexer interpret field contents as a file name, and load and process the referred file. Files larger than max_file_field_buffer in size are skipped. Any errors during the file loading (IO errors, missed limits, etc.) will be reported as indexing warnings and will not early terminate the indexing. No content will be indexed for such files.
sql_file_field = field_name
Joined/payload field fetch query. Multi-value, optional, the default is an empty list of queries.
sql_joined_field
lets you use two different features: joined fields and payloads (payload fields). Its syntax is as follows:
sql_joined_field = FIELD-NAME 'from' ( 'query' | 'payload-query' | 'ranged-query' | 'ranged-main-query' ); \
QUERY [ ; RANGE-QUERY ]
where
- FIELD-NAME is a joined/payload field name
- QUERY is an SQL query that must fetch values for further processing
- RANGE-QUERY is an optional SQL query that fetches a range of values to process
Joined fields let you avoid JOIN and/or GROUP_CONCAT statements in the main document fetch query (sql_query). This can be useful when the SQL-side JOIN is slow, or needs to be offloaded on the Manticore side, or simply to emulate MySQL-specific GROUP_CONCAT
functionality in case your database server does not support it.
The query must return exactly 2 columns: document ID, and text to append to a joined field. Document IDs can be duplicate, but they must be in ascending order. All the text rows fetched for a given ID will be concatenated together, and the concatenation result will be indexed as the entire contents of a joined field. Rows will be concatenated in the order returned from the query, and separating whitespace will be inserted between them. For instance, if the joined field query returns the following rows:
( 1, 'red' )
( 1, 'right' )
( 1, 'hand' )
( 2, 'mysql' )
( 2, 'manticore' )
then the indexing results would be equivalent to adding a new text field with a value of 'red right hand' to document 1 and 'mysql sphinx' to document 2, including the keyword positions inside the field in the order they come from the query. If the rows need to be in a specific order, that needs to be explicitly defined in the query.
Joined fields are only indexed differently. There are no other differences between joined fields and regular text fields.
Before executing the joined fields query, any set of sql_query_pre_all
will be run, if any exist. This allows you to set the desired encoding, etc., within the joined fields' context.
When a single query is not efficient enough or does not work because of the database driver limitations, ranged queries can be used. It works similarly to the ranged queries in the main indexing loop. The range will be queried for and fetched upfront once, then multiple queries with different $start
and $end
substitutions will be run to fetch the actual data.
When using ranged-main-query
query, omit the ranged-query
, and it will automatically use the same query from sql_query_range
(a useful option in complex inheritance setups to save having to manually duplicate the same query many times).
Payloads let you create a special field in which, instead of keyword positions, so-called user payloads are stored. Payloads are custom integer values attached to every keyword. They can then be used at search time to affect the ranking.
The payload query must return exactly 3 columns:
- document ID
- keyword
- and integer payload value.
Document IDs can be duplicate, but they must be in ascending order. Payloads must be unsigned integers within the 24-bit range, i.e., from 0 to 16777215.
The only ranker that accounts for payloads is proximity_bm25
(the default ranker). On tables with payload fields, it will automatically switch to a variant that matches keywords in those fields, computes a sum of matched payloads multiplied by field weights, and adds that sum to the final rank.
Please note that the payload field is ignored for full-text queries containing complex operators. It only works for simple bag-of-words queries.
- Configuration file
- Just SELECT
- Full-text search
- Complex full-text search
source min {
type = mysql
sql_host = localhost
sql_user = test
sql_pass =
sql_db = test
sql_query = select 1, 'Nike bag' f \
UNION select 2, 'Adidas bag' f \
UNION select 3, 'Reebok bag' f \
UNION select 4, 'Nike belt' f
sql_joined_field = tag from payload-query; select 1 id, 'nike' tag, 10 weight \
UNION select 4 id, 'nike' tag, 10 weight;
}
index idx {
path = idx
source = min
}
sql_column_buffers = <colname>=<size>[K|M] [, ...]
Per-column buffer sizes. Optional, default is empty (deduce the sizes automatically). Applies to odbc
, mssql
source types only.
ODBC and MS SQL drivers sometimes cannot return the maximum actual column size to be expected. For instance,NVARCHAR(MAX)
columns always report their length as 2147483647 bytes to indexer
even though the actually used length is likely considerably less. However, the receiving buffers still need to be allocated upfront, and their sizes have to be determined. When the driver does not report the column length at all, Manticore allocates default 1 KB buffers for each non-char column, and 1 MB buffers for each char column. Driver-reported column length also gets clamped by an upper limit of 8 MB, so in case the driver reports (almost) a 2 GB column length, it will be clamped and an 8 MB buffer will be allocated instead for that column. These hard-coded limits can be overridden using the sql_column_buffers
directive, either in order to save memory on actually shorter columns or to overcome the 8 MB limit on actually longer columns. The directive values must be a comma-separated list of selected column names and sizes:
Example:
sql_query = SELECT id, mytitle, mycontent FROM documents
sql_column_buffers = mytitle=64K, mycontent=10M