Data types

Full-text fields and attributes

Manticore's data types can be split into two categories: full-text fields and attributes.

Full-text fields

Full-text fields:

  • can be indexed with natural language processing algorithms, therefore can be searched for keywords
  • cannot be used for sorting or grouping
  • original document's content can be retrieved
  • original document's content can be used for highlighting

Full-text fields are represented by the data type text. All other data types are called "attributes".

Attributes

Attributes are non-full-text values associated with each document that can be used to perform non-full-text filtering, sorting and grouping during a search.

It is often desired to process full-text search results based not only on matching document ID and its rank, but also on a number of other per-document values. For example, one might need to sort news search results by date and then relevance, or search through products within a specified price range, or limit a blog search to posts made by selected users, or group results by month. To do this efficiently, Manticore enables not only full-text fields, but also additional attributes to be added to each document. These attributes can be used to filter, sort, or group full-text matches, or to search only by attributes.

The attributes, unlike full-text fields, are not full-text indexed. They are stored in the table, but it is not possible to search them as full-text.

A good example for attributes would be a forum posts table. Assume that only the title and content fields need to be full-text searchable - but that sometimes it is also required to limit search to a certain author or a sub-forum (i.e., search only those rows that have some specific values of author_id or forum_id); or to sort matches by post_date column; or to group matching posts by month of the post_date and calculate per-group match counts.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • C#
  • config
📋
CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp);

This example shows running a full-text query filtered by author_id, forum_id and sorted by post_date.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
📋
select * from forum where author_id=123 and forum_id in (1,3,7) order by post_date desc

Row-wise and columnar attribute storages

Manticore supports two types of attribute storages:

As can be understood from their names, they store data differently. The traditional row-wise storage:

  • stores attributes uncompressed
  • all attributes of the same document are stored in one row close to each other
  • rows are stored one by one
  • accessing attributes is basically done by just multiplying the row ID by the stride (length of a single vector) and getting the requested attribute from the calculated memory location. It gives very low random access latency.
  • attributes have to be in memory to get acceptable performance, otherwise due to the row-wise nature of the storage Manticore may have to read from disk too much unneeded data which is in many cases suboptimal.

With the columnar storage:

  • each attribute is stored independently of all other attributes in its separate "column"
  • storage is split into blocks of 65536 entries
  • the blocks are stored compressed. This often allows storing just a few distinct values instead of storing all of them like in the row-wise storage. High compression ratio allows reading from disk faster and makes the memory requirement much lower
  • when data is indexed, storage scheme is selected for each block independently. For example, if all values in a block are the same, it gets "const" storage and only one value is stored for the whole block. If there are less than 256 unique values per block, it gets "table" storage and stores indexes to a table of values instead of the values themselves
  • search in a block can be early rejected if it's clear the requested value is not present in the block.

The columnar storage was designed to handle large data volume that does not fit into RAM, so the recommendations are:

  • if you have enough memory for all your attributes you will benefit from the row-wise storage
  • otherwise, the columnar storage can still give you decent performance with a much lower memory footprint which will allow you to store much more documents in your table

How to switch between the storages

The traditional row-wise storage is the default, so if you want everything to be stored in a row-wise fashion, you don't need to do anything when you create a table.

To enable the columnar storage you need to:

  • specify engine='columnar' in CREATE TABLE to make all attributes of the table columnar. Then, if you want to keep a specific attribute row-wise, you need to add engine='rowwise' when you declare it. For example:
    create table tbl(title text, type int, price float engine='rowwise') engine='columnar'
  • specify engine='columnar' for a specific attribute in CREATE TABLE to make it columnar. For example:
    create table tbl(title text, type int, price float engine='columnar');

    or

    create table tbl(title text, type int, price float engine='columnar') engine='rowwise';
  • in the plain mode you need to list attributes you want to be columnar in columnar_attrs.

Below is the list of data types supported by Manticore Search:

Document ID

The document identifier is a mandatory attribute, and document IDs must be unique 64-bit unsigned integers. Document IDs can be explicitly specified, but if not, they are still enabled. Document IDs cannot be updated. Note that when retrieving document IDs, they are treated as signed 64-bit integers, which means they may be negative. Use the UINT64() function to cast them to unsigned 64-bit integers if necessary.

‹›
  • Explicit ID
  • Implicit ID
📋

When you create a table, you can specify ID explicitly, but no matter what data type you use, it will be always as said previously - a signed 64-bit integer.

CREATE TABLE tbl(id bigint, content text);
DESC tbl;
‹›
Response
+---------+--------+----------------+
| Field   | Type   | Properties     |
+---------+--------+----------------+
| id      | bigint |                |
| content | text   | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)

Character data types

General syntax:

string|text [stored|attribute] [indexed]

Properties:

  1. indexed - full-text indexed (can be used in full-text queries)
  2. stored - stored in a docstore (stored on disk, not in RAM, lazy read)
  3. attribute - makes it a string attribute (can sort/group by it)

Specifying at least one property overrides all the default ones (see below), i.e., if you decide to use a custom combination of properties, you need to list all the properties you want.

No properties specified:

string and text are aliases, but if you don’t specify any properties, they by default mean different things:

  • just string by default means attribute (see details below).
  • just text by default means stored + indexed (see details below).

Text

The text (just text or text/string indexed) data type forms the full-text part of the table. Text fields are indexed and can be searched for keywords.

Text is passed through an analyzer pipeline that converts the text to words, applies morphology transformations, etc. Eventually, a full-text table (a special data structure that enables quick searches for a keyword) gets built from that text.

Full-text fields can only be used in the MATCH() clause and cannot be used for sorting or aggregation. Words are stored in an inverted index along with references to the fields they belong to and positions in the field. This allows searching a word inside each field and using advanced operators like proximity. By default, the original text of the fields is both indexed and stored in document storage. It means that the original text can be returned with the query results and used in search result highlighting.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • config
📋
CREATE TABLE products(title text);

This behavior can be overridden by explicitly specifying that the text is only indexed.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • config
📋
CREATE TABLE products(title text indexed);

Fields are named, and you can limit your searches to a single field (e.g. search through "title" only) or a subset of fields (e.g. "title" and "abstract" only). You can have up to 256 full-text fields.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
📋
select * from products where match('@title first');

String

Unlike full-text fields, string attributes (just string or string/text attribute) are stored as they are received and cannot be used in full-text searches. Instead, they are returned in results, can be used in the WHERE clause for comparison filtering or REGEX, and can be used for sorting and aggregation. In general, it's not recommended to store large texts in string attributes, but use string attributes for metadata like names, titles, tags, keys.

If you want to also index the string attribute, you can specify both as string attribute indexed. It will allow full-text searching and works as an attribute.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • config
📋
CREATE TABLE products(title text, keys string);
MORE

Integer

Integer type allows storing 32 bit unsigned integer values.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • config
📋
CREATE TABLE products(title text, price int);

Integers can be stored in shorter sizes than 32-bit by specifying a bit count. For example, if we want to store a numeric value which we know is not going to be bigger than 8, the type can be defined as bit(3). Bitcount integers perform slower than the full-size ones, but they require less RAM. They are saved in 32-bit chunks, so in order to save space, they should be grouped at the end of attribute definitions (otherwise a bitcount integer between 2 full-size integers will occupy 32 bits as well).

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • config
📋
CREATE TABLE products(title text, flags bit(3), tags bit(2) );

Big Integer

Big integers (bigint) are 64-bit wide signed integers.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • config
📋
CREATE TABLE products(title text, price bigint );

Boolean

Declares a boolean attribute. It's equivalent to an integer attribute with bit count of 1.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • config
📋
CREATE TABLE products(title text, sold bool );

Timestamps

Timestamp type represents unix timestamps which is stored as a 32-bit integer. The difference is that time and date functions are available for the timestamp type.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • config
📋
CREATE TABLE products(title text, date timestamp);

Float

Real numbers are stored as 32-bit IEEE 754 single precision floats.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • config
📋
CREATE TABLE products(title text, coeff float);

Unlike integer types, comparing two floating-point numbers for equality is not recommended due to potential rounding errors. A more reliable approach is to use a near-equal comparison, by checking the absolute error margin.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
📋
select abs(a-b)<=0.00001 from products

Another alternative, which can also be used to perform IN(attr,val1,val2,val3) is to compare floats as integers by choosing a multiplier factor and convert the floats to integers in operations. The following example illustrates modifying IN(attr,2.0,2.5,3.5) to work with integer values.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
📋
select in(ceil(attr*100),200,250,350) from products

JSON

This data type allows storing JSON objects, which is useful for storing schema-less data. However, it is not supported by columnar storage. However, it can be stored in traditional storage, as it's possible to combine both storage types in the same table.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • config
📋
CREATE TABLE products(title text, data json);

JSON properties can be used in most operations. There are also special functions such as ALL(), ANY(), GREATEST(), LEAST() and INDEXOF() that allow traversal of property arrays.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
📋
select indexof(x>2 for x in data.intarray) from products

Text properties are treated the same as strings, so it's not possible to use them in full-text match expressions. However, string functions such as REGEX() can be used.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
📋
select regex(data.name, 'est') as c from products where c>0

In the case of JSON properties, enforcing data type may be required for proper functionality in certain situations. For example, when working with float values, DOUBLE() must be used for proper sorting.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
📋
select * from products order by double(data.myfloat) desc

Multi-value integer (MVA)

Multi-value attributes allow storing variable-length lists of 32-bit unsigned integers. This can be useful for storing one-to-many numeric values, such as tags, product categories, and properties.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • config
📋
CREATE TABLE products(title text, product_codes multi);

It supports filtering and aggregation, but not sorting. Filtering can be done using a condition that requires at least one element to pass (using ANY()) or all elements (ALL()) to pass.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
📋
select * from products where any(product_codes)=3

Information like least or greatest element and length of the list can be extracted. An example shows ordering by the least element of a multi-value attribute.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
📋
select least(product_codes) l from products order by l asc

When grouping by a multi-value attribute, a document will contribute to as many groups as there are different values associated with that document. For instance, if a collection contains exactly one document having a 'product_codes' multi-value attribute with values 5, 7, and 11, grouping on 'product_codes' will produce 3 groups with COUNT(*)equal to 1 and GROUPBY() key values of 5, 7, and 11, respectively. Also, note that grouping by multi-value attributes may lead to duplicate documents in the result set because each document can participate in many groups.

‹›
  • SQL
SQL
📋
insert into products values ( 1, 'doc one', (5,7,11) );
select id, count(*), groupby() from products group by product_codes;
‹›
Response
Query OK, 1 row affected (0.00 sec)

+------+----------+-----------+
| id   | count(*) | groupby() |
+------+----------+-----------+
|    1 |        1 |        11 |
|    1 |        1 |         7 |
|    1 |        1 |         5 |
+------+----------+-----------+
3 rows in set (0.00 sec)

The order of the numbers inserted as values of multivalued attributes is not preserved. Values are stored internally as a sorted set.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
📋
insert into product values (1,'first',(4,2,1,3));
select * from products;
‹›
Response
Query OK, 1 row affected (0.00 sec)

+------+---------------+-------+
| id   | product_codes | title |
+------+---------------+-------+
|    1 | 1,2,3,4       | first |
+------+---------------+-------+
1 row in set (0.01 sec)

Multi-value big integer

A data type that allows storing variable-length lists of 64-bit signed integers. It has the same functionality as multi-value integer.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • java
  • C#
  • config
📋
CREATE TABLE products(title text, values multi64);

Columnar attribute properties

When you use the columnar storage you can specify the following properties for the attributes.

fast_fetch

By default, Manticore Columnar storage stores all attributes in a columnar fashion, as well as in a special docstore row by row. This enables fast execution of queries like SELECT * FROM ..., especially when fetching a large number of records at once. However, if you are sure that you do not need it or wish to save disk space, you can disable it by specifying fast_fetch='0' when creating a table or (if you are defining a table in a config) by using columnar_no_fast_fetch as shown in the following example.

‹›
  • RT mode
  • Plain mode
📋
create table t(a int, b int fast_fetch='0') engine='columnar'; desc t;
‹›
Response
+-------+--------+---------------------+
| Field | Type   | Properties          |
+-------+--------+---------------------+
| id    | bigint | columnar fast_fetch |
| a     | uint   | columnar fast_fetch |
| b     | uint   | columnar            |
+-------+--------+---------------------+
3 rows in set (0.00 sec)

Creating a local table

In Manticore Search, there are two ways to manage tables:

Online schema management (RT mode)

Real-time mode requires no table definition in the configuration file. However, the data_dir directive in the searchd section is mandatory. Index files are stored inside the data_dir.

Replication is only available in this mode.

You can use SQL commands such as CREATE TABLE, ALTER TABLE and DROP TABLE to create and modify table schema, and to drop it. This mode is particularly useful for real-time and percolate tables.

Table names are converted to lowercase when created.

Defining table schema in config (Plain mode)

In this mode, you can specify the table schema in the configuration file. Manticore reads this schema on startup and creates the table if it doesn't exist yet. This mode is particularly useful for plain tables that use data from an external storage.

To drop a table, remove it from the configuration file or remove the path setting and send a HUP signal to the server or restart it.

Table names are case-sensitive in this mode.

All table types are supported in this mode.

Table types and modes

Table type RT mode Plain mode
Real-time supported supported
Plain not supported supported
Percolate supported supported
Distributed supported supported
Template not supported supported

Real-time table

A Real-time table is a main type of table in Manticore, allowing you to add, update, and delete documents with immediate availability of the changes. The settings for a Real-time Table can be defined in a configuration file or online using CREATE/UPDATE/DELETE/ALTER commands.

A Real-time Table is comprised of one or multiple plain tables called chunks. There are two types of chunks:

  • multiple disk chunks - These are stored on disk and have the same structure as a Plain Table.
  • single ram chunk - This is stored in memory and is used as an accumulator of changes.

The size of the RAM chunk is controlled by the rt_mem_limit setting. Once this limit is reached, the RAM chunk is flushed to disk in the form of a disk chunk. If there are too many disk chunks, they can be merged into one for better performance using the OPTIMIZE command or automatically.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) morphology='stem_en';
‹›
Response
Query OK, 0 rows affected (0.00 sec)

👍 What you can do with a real-time table:

  • Add documents using the Add feature.
  • Update attributes and full-text fields through the Update process.
  • Delete documents using the Delete feature.
  • Emptying the table using the Truncate process.
  • Change the schema online using the ALTER command as described in Change schema online.
  • Define the table in a configuration file as described in Define table.
  • Use the UUID feature for automatic ID provisioning.

⛔ What you cannot do with a real-time table:

  • Index data using the indexer feature.
  • Connect it to sources for easy indexing from external storage.
  • Update the killlist_target, as it is automatically managed by the real-time table.

Real-time table files structure

The following table outlines the different file extensions and their respective descriptions in a real-time table:

Extension Description
.lock A lock file that ensures that only one process can access the table at a time.
.ram The RAM chunk of the table, stored in memory and used as an accumulator of changes.
.meta The headers of the real-time table that define its structure and settings.
.*.sp* Disk chunks that are stored on disk with the same format as plain tables. They are created when the RAM chunk size exceeds the rt_mem_limit.

For more information on the structure of disk chunks, refer to the plain table format)