Data creation and modification > Adding data from external storages > Fetching from XML streams

Main query, which needs to fetch all the documents, can impose a read lock on the whole table and stall the concurrent queries (e.g. INSERTs to MyISAM table), waste a lot of memory for result set, etc. To avoid this, Manticore supports so-called ranged queries. With ranged queries, Manticore first fetches min and max document IDs from the table, and then substitutes different ID intervals into main query text and runs the modified query to fetch another chunk of documents. Here's an example.

Ranged query usage example:

sql_query_range = SELECT MIN(id),MAX(id) FROM documents
sql_range_step = 1000
sql_query = SELECT * FROM documents WHERE id>=$start AND id<=$end

If the table contains document IDs from 1 to, say, 2345, then sql_query would be run three times:

with $start replaced with 1 and $end replaced with 1000;
with $start replaced with 1001 and $end replaced with 2000;
with $start replaced with 2001 and $end replaced with 2345.

Obviously, that's not much of a difference for 2000-row table, but when it comes to indexing 10-million-row table, ranged queries might be of some help.

Defines the range query. The query specified in this option must fetch min and max document IDs that will be used as range boundaries. It must return exactly two integer fields, min ID first and max ID second; the field names are ignored. When enabled, sql_query will be required to contain $start and $end macros. Note that the intervals specified by $start..$end will not overlap, so you should not remove document IDs that are exactly equal to $start or $end from your query.

This directive defines the range query step. The default value is 1024.

This directive can be used to throttle the ranged query. By default, there is no throttling. Values for sql_ranged_throttle should be specified in milliseconds.

Throttling can be useful when the indexer imposes too much load on the database server. It causes the indexer to sleep for a given amount of time once per each ranged query step. This sleep is unconditional and is performed before the fetch query.

sql_ranged_throttle = 1000 # sleep for 1 sec before each query step

Fetching from XML stream

The xmlpipe2 source type allows for passing custom full-text and attribute data to Manticore in a custom XML format, with the schema (i.e., set of fields and attributes) specified in either the XML stream itself or in the source settings.

To declare the XML stream, the xmlpipe_command directive is mandatory and contains the shell command that produces the XML stream to be indexed. This can be a file, but it can also be a program that generates XML content on-the-fly.

When indexing an xmlpipe2 source, the indexer runs the specified command, opens a pipe to its stdout, and expects a well-formed XML stream.

Here's an example of what the XML stream data might look like:

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>

<sphinx:schema>
<sphinx:field name="subject"/>
<sphinx:field name="content"/>
<sphinx:attr name="published" type="timestamp"/>
<sphinx:attr name="author_id" type="int" bits="16" default="1"/>
</sphinx:schema>

<sphinx:document id="1234">
<content>this is the main content <![CDATA[and this <cdata> entry
must be handled properly by xml parser lib]]></content>
<published>1012325463</published>
<subject>note how field/attr tags can be
in <strong> class="red">randomized</strong> order</subject>
<misc>some undeclared element</misc>
</sphinx:document>

<sphinx:document id="1235">
<subject>another subject</subject>
<content>here comes another document, and i am given to understand,
that in-document field order must not matter, sir</content>
<published>1012325467</published>
</sphinx:document>

<!-- ... even more sphinx:document entries here ... -->

<sphinx:killlist>
<id>1234</id>
<id>4567</id>
</sphinx:killlist>

</sphinx:docset>

Arbitrary fields and attributes are allowed. They can also occur in the stream in arbitrary order within each document; the order is ignored. There is a restriction on the maximum field length; fields longer than 2 MB will be truncated to 2 MB (this limit can be changed in the source).

The schema, i.e., complete fields and attributes list, must be declared before any document can be parsed. This can be done either in the configuration file by using xmlpipe_field and xmlpipe_attr_XXX settings, or right in the stream using <sphinx:schema> element. <sphinx:schema> is optional. It is only allowed to occur as the very first sub-element in <sphinx:docset>. If there is no in-stream schema definition, settings from the configuration file will be used. Otherwise, stream settings take precedence. Note that the document id should be specified as a property id of tag <sphinx:document> (e.g. <sphinx:document id="1235">) and is supposed to be a unique-signed positive non-zero 64-bit integer.

Unknown tags (which were not declared neither as fields nor as attributes) will be ignored with a warning. In the example above, <misc> will be ignored. All embedded tags and their attributes (such as <strong> in <subject> in the example above) will be silently ignored.

Support for incoming stream encodings depends on whether iconv is installed on the system. xmlpipe2 is parsed using the libexpat parser, which understands US-ASCII, ISO-8859-1, UTF-8, and a few UTF-16 variants natively. Manticore's configure script will also check for libiconv presence and utilize it to handle other encodings. libexpat also enforces the requirement to use the UTF-8 charset on the Manticore side because the parsed data it returns is always in UTF-8.

XML elements (tags) recognized by xmlpipe2 (and their attributes where applicable) are:

sphinx:docset - Mandatory top-level element, denotes and contains the xmlpipe2 document set.
sphinx:schema - Optional element, must either occur as the very first child of sphinx:docset or never occur at all. Declares the document schema and contains field and attribute declarations. If present, it overrides per-source settings from the configuration file.
sphinx:field - Optional element, child of sphinx:schema. Declares a full-text field. Known attributes are:
- "name", specifies the XML element name that will be treated as a full-text field in the subsequent documents.
- attr", specifies whether to also index this field as a string. Possible value is "string".
sphinx:attr - Optional element, child of sphinx:schema. Declares an attribute. Known attributes are:
- "name", specifies the element name that should be treated as an attribute in the subsequent documents.
- "type", specifies the attribute type. Possible values are "int", "bigint", "timestamp", "bool", "float", "multi" and "json".
- "bits", specifies the bit size for the "int" attribute type. Valid values are 1 to 32.
- "default", specifies the default value for this attribute that should be used if the attribute's element is not present in the document.
sphinx:document - Mandatory element, must be a child of sphinx:docset. Contains arbitrary other elements with field and attribute values to be indexed, as declared either using sphinx:field and sphinx:attr elements or in the configuration file. The only known attribute is "id" that must contain the unique integer document ID.
sphinx:killlist - Optional element, child of sphinx:docset. Contains a number of "id" elements whose contents are document IDs to be put into a kill-list of the table. The kill-list is used in multi-table searches to suppress documents found in other tables of the search.

If the XML doesn't define a schema, the data types of tables elements must be defined in the source configuration.

xmlpipe_field - declares a text field.
xmlpipe_field_string - declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.
xmlpipe_attr_uint - declares an integer attribute
xmlpipe_attr_timestamp - declares a timestamp attribute
xmlpipe_attr_bool - declares a boolean attribute
xmlpipe_attr_float - declares a float attribute
xmlpipe_attr_bigint - declares a big integer attribute
xmlpipe_attr_multi - declares a multi-value attribute with integers
xmlpipe_attr_multi_64 - declares a multi-value attribute with 64-bit integers
xmlpipe_attr_string - declares a string attribute
xmlpipe_attr_json - declares a JSON attribute

If xmlpipe_fixup_utf8 is set it will enable Manticore-side UTF-8 validation and filtering to prevent XML parser from choking on non-UTF-8 documents. By default, this option is disabled.

Under certain occasions it might be hard or even impossible to guarantee that the incoming XMLpipe2 document bodies are in perfectly valid and conforming UTF-8 encoding. For instance, documents with national single-byte encodings could sneak into the stream. libexpat XML parser is fragile, meaning that it will stop processing in such cases. UTF8 fixup feature lets you avoid that. When fixup is enabled, Manticore will preprocess the incoming stream before passing it to the XML parser and replace invalid UTF-8 sequences with spaces.

xmlpipe_fixup_utf8 = 1

Example of XML source without schema in configuration:

source xml_test_1
{
    type = xmlpipe2
    xmlpipe_command = cat /tmp/products_today.xml
}

Example of XML source with schema in configuration:

source xml_test_2
{
    type = xmlpipe2
    xmlpipe_command = cat /tmp/products_today.xml
    xmlpipe_field = subject
    xmlpipe_field = content
    xmlpipe_attr_timestamp = published
    xmlpipe_attr_uint = author_id:16
}

Ranged queries Fetching from CSV,TSV

Ranged queries

sql_query_range

sql_range_step

sql_ranged_throttle

Fetching from XML streams

Declaration of XML stream

XML file format

Data definition in source configuration

Specific XML source settings