Fetching from XML streams

The xmlpipe2 source type allows for passing custom full-text and attribute data to Manticore in a custom XML format, with the schema (i.e., set of fields and attributes) specified in either the XML stream itself or in the source settings.

Declaration of XML stream

To declare the XML stream, the xmlpipe_command directive is mandatory and contains the shell command that produces the XML stream to be indexed. This can be a file, but it can also be a program that generates XML content on-the-fly.

XML file format

When indexing an xmlpipe2 source, the indexer runs the specified command, opens a pipe to its stdout, and expects a well-formed XML stream.

Here's an example of what the XML stream data might look like:

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>

<sphinx:schema>
<sphinx:field name="subject"/>
<sphinx:field name="content"/>
<sphinx:attr name="published" type="timestamp"/>
<sphinx:attr name="author_id" type="int" bits="16" default="1"/>
</sphinx:schema>

<sphinx:document id="1234">
<content>this is the main content <![CDATA[and this <cdata> entry
must be handled properly by xml parser lib]]></content>
<published>1012325463</published>
<subject>note how field/attr tags can be
in <strong> class="red">randomized</strong> order</subject>
<misc>some undeclared element</misc>
</sphinx:document>

<sphinx:document id="1235">
<subject>another subject</subject>
<content>here comes another document, and i am given to understand,
that in-document field order must not matter, sir</content>
<published>1012325467</published>
</sphinx:document>

<!-- ... even more sphinx:document entries here ... -->

<sphinx:killlist>
<id>1234</id>
<id>4567</id>
</sphinx:killlist>

</sphinx:docset>

Arbitrary fields and attributes are allowed. They can also occur in the stream in arbitrary order within each document; the order is ignored. There is a restriction on the maximum field length; fields longer than 2 MB will be truncated to 2 MB (this limit can be changed in the source).

The schema, i.e., complete fields and attributes list, must be declared before any document can be parsed. This can be done either in the configuration file by using xmlpipe_field and xmlpipe_attr_XXX settings, or right in the stream using <sphinx:schema> element. <sphinx:schema> is optional. It is only allowed to occur as the very first sub-element in <sphinx:docset>. If there is no in-stream schema definition, settings from the configuration file will be used. Otherwise, stream settings take precedence. Note that the document id should be specified as a property id of tag <sphinx:document> (e.g. <sphinx:document id="1235">) and is supposed to be a unique-signed positive non-zero 64-bit integer.

Unknown tags (which were not declared neither as fields nor as attributes) will be ignored with a warning. In the example above, <misc> will be ignored. All embedded tags and their attributes (such as <strong> in <subject> in the example above) will be silently ignored.

Support for incoming stream encodings depends on whether iconv is installed on the system. xmlpipe2 is parsed using the libexpat parser, which understands US-ASCII, ISO-8859-1, UTF-8, and a few UTF-16 variants natively. Manticore's configure script will also check for libiconv presence and utilize it to handle other encodings. libexpat also enforces the requirement to use the UTF-8 charset on the Manticore side because the parsed data it returns is always in UTF-8.

XML elements (tags) recognized by xmlpipe2 (and their attributes where applicable) are:

  • sphinx:docset - Mandatory top-level element, denotes and contains the xmlpipe2 document set.
  • sphinx:schema - Optional element, must either occur as the very first child of sphinx:docset or never occur at all. Declares the document schema and contains field and attribute declarations. If present, it overrides per-source settings from the configuration file.
  • sphinx:field - Optional element, child of sphinx:schema. Declares a full-text field. Known attributes are:
    • "name", specifies the XML element name that will be treated as a full-text field in the subsequent documents.
    • attr", specifies whether to also index this field as a string. Possible value is "string".
  • sphinx:attr - Optional element, child of sphinx:schema. Declares an attribute. Known attributes are:
    • "name", specifies the element name that should be treated as an attribute in the subsequent documents.
    • "type", specifies the attribute type. Possible values are "int", "bigint", "timestamp", "bool", "float", "multi" and "json".
    • "bits", specifies the bit size for the "int" attribute type. Valid values are 1 to 32.
    • "default", specifies the default value for this attribute that should be used if the attribute's element is not present in the document.
  • sphinx:document - Mandatory element, must be a child of sphinx:docset. Contains arbitrary other elements with field and attribute values to be indexed, as declared either using sphinx:field and sphinx:attr elements or in the configuration file. The only known attribute is "id" that must contain the unique integer document ID.
  • sphinx:killlist - Optional element, child of sphinx:docset. Contains a number of "id" elements whose contents are document IDs to be put into a kill-list of the table. The kill-list is used in multi-table searches to suppress documents found in other tables of the search.

Data definition in source configuration

If the XML doesn't define a schema, the data types of tables elements must be defined in the source configuration.

  • xmlpipe_field - declares a text field.
  • xmlpipe_field_string - declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.
  • xmlpipe_attr_uint - declares an integer attribute
  • xmlpipe_attr_timestamp - declares a timestamp attribute
  • xmlpipe_attr_bool - declares a boolean attribute
  • xmlpipe_attr_float - declares a float attribute
  • xmlpipe_attr_bigint - declares a big integer attribute
  • xmlpipe_attr_multi - declares a multi-value attribute with integers
  • xmlpipe_attr_multi_64 - declares a multi-value attribute with 64-bit integers
  • xmlpipe_attr_string - declares a string attribute
  • xmlpipe_attr_json - declares a JSON attribute

Specific XML source settings

If xmlpipe_fixup_utf8 is set it will enable Manticore-side UTF-8 validation and filtering to prevent XML parser from choking on non-UTF-8 documents. By default, this option is disabled.

Under certain occasions it might be hard or even impossible to guarantee that the incoming XMLpipe2 document bodies are in perfectly valid and conforming UTF-8 encoding. For instance, documents with national single-byte encodings could sneak into the stream. libexpat XML parser is fragile, meaning that it will stop processing in such cases. UTF8 fixup feature lets you avoid that. When fixup is enabled, Manticore will preprocess the incoming stream before passing it to the XML parser and replace invalid UTF-8 sequences with spaces.

xmlpipe_fixup_utf8 = 1

Example of XML source without schema in configuration:

source xml_test_1
{
    type = xmlpipe2
    xmlpipe_command = cat /tmp/products_today.xml
}

Example of XML source with schema in configuration:

source xml_test_2
{
    type = xmlpipe2
    xmlpipe_command = cat /tmp/products_today.xml
    xmlpipe_field = subject
    xmlpipe_field = content
    xmlpipe_attr_timestamp = published
    xmlpipe_attr_uint = author_id:16
}

Fetching from TSV,CSV

TSV/CSV is the simplest way to pass data to the Manticore indexer. This method was created due to the limitations of xmlpipe2. In xmlpipe2, the indexer must map each attribute and field tag in the XML file to a corresponding schema element. This mapping requires time, and it increases with the number of fields and attributes in the schema. TSV/CSV has no such issue, as each field and attribute corresponds to a particular column in the TSV/CSV file. In some cases, TSV/CSV could work slightly faster than xmlpipe2.

File format

The first column in TSV/CSV file must be a document ID. The rest columns must mirror the declaration of fields and attributes in the schema definition. Note that you don't need to declare the document ID in the schema, since it's always considered to be present, should be in the 1st column and needs to be a unique-signed positive non-zero 64-bit integer.

The difference between tsvpipe and csvpipe is delimiter and quoting rules. tsvpipe has a tab character as hardcoded delimiter and has no quoting rules. csvpipe has the csvpipe_delimiteroption for delimiter with a default value of , and also has quoting rules, such as:

  • Any field may be quoted
  • Fields containing a line-break, double-quote or commas should be quoted
  • A double quote character in a field must be represented by two double quote characters

Declaration of TSV stream

tsvpipe_command directive is mandatory and contains the shell command invoked to produce the TSV stream that gets indexed. The command can read a TSV file, but it can also be a program that generates on-the-fly the tab delimited content.

TSV indexed columns

The following directives can be used to declare the types of the indexed columns:

  • tsvpipe_field - declares a text field.
  • tsvpipe_field_string - declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.
  • tsvpipe_attr_uint - declares an integer attribute.
  • tsvpipe_attr_timestamp - declares a timestamp attribute.
  • tsvpipe_attr_bool - declares a boolean attribute.
  • tsvpipe_attr_float - declares a float attribute.
  • tsvpipe_attr_bigint - declares a big integer attribute.
  • tsvpipe_attr_multi - declares a multi-value attribute with integers.
  • tsvpipe_attr_multi_64 - declares a multi-value attribute with 64-bit integers.
  • tsvpipe_attr_string - declares a string attribute.
  • tsvpipe_attr_json - declares a JSON attribute.

Example of a source using a TSV file:

source tsv_test
{
    type = tsvpipe
    tsvpipe_command = cat /tmp/rock_bands.tsv
    tsvpipe_field = name
    tsvpipe_attr_multi = genre_tags
}
1   Led Zeppelin    35,23,16
2   Deep Purple 35,92
3   Frank Zappa 35,23,16,92,33,24

Declaration of CSV stream

csvpipe_command directive is mandatory and contains the shell command invoked to produce the CSV stream which gets indexed. The command can just read a CSV file but it can also be a program that generates on-the-fly the comma delimited content.

CSV indexed columns

The following directives can be used to declare the types of the indexed columns:

  • csvpipe_field - declares a text field.
  • csvpipe_field_string - declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.
  • csvpipe_attr_uint - declares an integer attribute.
  • csvpipe_attr_timestamp - declares a timestamp attribute.
  • csvpipe_attr_bool - declares a boolean attribute.
  • csvpipe_attr_float - declares a float attribute.
  • csvpipe_attr_bigint - declares a big integer attribute.
  • csvpipe_attr_multi - declares a multi-value attribute with integers.
  • csvpipe_attr_multi_64 - declares a multi-value attribute with 64-bit integers.
  • csvpipe_attr_string - declares a string attribute.
  • csvpipe_attr_json - declares a JSON attribute.

Example of a source using a CSV file:

source csv_test
{
    type = csvpipe
    csvpipe_command = cat /tmp/rock_bands.csv
    csvpipe_field = name
    csvpipe_attr_multi = genre_tags
}
1,"Led Zeppelin","35,23,16"
2,"Deep Purple","35,92"
3,"Frank Zappa","35,23,16,92,33,24"

Main+delta schema

In many situations, the total dataset is too large to be frequently rebuilt from scratch, while the number of new records remains relatively small. For example, a forum may have 1,000,000 archived posts but only receive 1,000 new posts per day.

In such cases, implementing "live" (nearly real-time) table updates can be achieved using a "main+delta" scheme.

The concept involves setting up two sources and two tables, with one "main" table for data that rarely changes (if ever), and one "delta" table for new documents. In the example, the 1,000,000 archived posts would be stored in the main table, while the 1,000 new daily posts would be placed in the delta table. The delta table can then be rebuilt frequently, making the documents available for searching within seconds or minutes. Determining which documents belong to which table and rebuilding the main table can be fully automated. One approach is to create a counter table that tracks the ID used to split the documents and update it whenever the main table is rebuilt.

Using a timestamp column as the split variable is more effective than using the ID since timestamps can track not only new documents but also modified ones.

For datasets that may contain modified or deleted documents, the delta table should provide a list of affected documents, ensuring they are suppressed and excluded from search queries. This is accomplished using a feature called Kill Lists. The document IDs to be killed can be specified in an auxiliary query defined by sql_query_killlist. The delta table must indicate the target tables for which the kill lists will be applied using the killlist_target directive. The impact of kill lists is permanent on the target table, meaning that even if a search is performed without the delta table, the suppressed documents will not appear in the search results.

Notice how we're overriding sql_query_pre in the delta source. We must explicitly include this override. If we don't, the REPLACE query would be executed during the delta source's build as well, effectively rendering it useless.

‹›
  • Example
Example
📋
# in MySQL
CREATE TABLE deltabreaker (
  index_name VARCHAR(50) NOT NULL,
  created_at TIMESTAMP NOT NULL  DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (index_name)
);

# in manticore.conf
source main {
  ...
  sql_query_pre = REPLACE INTO deltabreaker SET index_name = 'main', created_at = NOW()
  sql_query =  SELECT id, title, UNIX_TIMESTAMP(updated_at) AS updated FROM documents WHERE deleted=0 AND  updated_at  >=FROM_UNIXTIME($start) AND updated_at  <=FROM_UNIXTIME($end)
  sql_query_range  = SELECT ( SELECT UNIX_TIMESTAMP(MIN(updated_at)) FROM documents) min, ( SELECT UNIX_TIMESTAMP(created_at)-1 FROM deltabreaker WHERE index_name='main') max
  sql_query_post_index = REPLACE INTO deltabreaker set index_name = 'delta', created_at = (SELECT created_at FROM deltabreaker t WHERE index_name='main')
  ...
  sql_attr_timestamp = updated
}

source delta : main {
  sql_query_pre =
  sql_query_range = SELECT ( SELECT UNIX_TIMESTAMP(created_at) FROM deltabreaker WHERE index_name='delta') min, UNIX_TIMESTAMP() max
  sql_query_killlist = SELECT id FROM documents WHERE updated_at >=  (SELECT created_at FROM deltabreaker WHERE index_name='delta')
}

table main {
  path = /var/lib/manticore/main
  source = main
}

table delta {
  path = /var/lib/manticore/delta
  source = delta
  killlist_target = main:kl
}