Plain and real-time index settings

Defining index schema in a configuration file

index <index_name>[:<parent index name>] {
...
}
‹›
  • Plain
  • Real-time
📋
index <index_name> {
  type = plain
  path = /path/to/index
  source = <source_name>
  source = <another source_name>
  [stored_fields = <comma separated list of full-text fields that should be stored>]
}

Common plain and real-time indexes settings

type

type = plain

type = rt

Index type: "plain" or "rt" (real-time)

Value: plain (default), rt

path

path = path/to/index

Absolute or relative path without extension where to store the index or where to look for it

Value: path to the index, mandatory

stored_fields

stored_fields = title, content

By default when an index is defined in a configuration file, full-text fields' original content is not stored, but just indexed. If this option is set, values from the fields will be both indexed and stored.

Value: comma separated list of full-text fields that should be stored. Default is empty (i.e. does not store original field text) for Plain mode, but is enabled for every field for RT mode as long as it's declared as just text.

Note, in case of a real-time index the fields listed in stored_only_fields should be also declared as rt_field.

Note also, that you don't need to list attributes in stored_fields, since their original values are stored anyway. stored_fields is only for full-text fields.

See also docstore_block_size, docstore_compression for document storage compression options.

‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • Javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text stored indexed, content text stored indexed, name text indexed, price float)

stored_only_fields

stored_only_fields = title,content

A list of fields that will be stored in the index but will be not indexed. Similar to stored_fields except when a field is specified in stored_only_fields it is only stored, not indexed and can’t be searched with fulltext queries. It can only be returned with search results.

Value: comma separated list of fields that should be stored only, not indexed. Default is empty. Note, in case of a real-time index the fields listed in stored_only_fields should be also declared as rt_field.

Note also, that you don't need to list attributes in stored_only_fields, since their original values are stored anyway. If to compare stored_only_fields to string attributes the former (stored field):

  • is stored on disk and doesn't require memory
  • is stored compressed
  • can be only fetched, you can't sort/filter/group by the value

The latter (string attribute) is:

  • stored on disk and in memory
  • stored uncompressed
  • can be used for sorting, grouping, filtering and anything else you want to do with attributes.

Real-time index settings:

rt_field

rt_field = subject

Full-text fields to be indexed. The names must be unique. The order is preserved; and so field values in INSERT statements without an explicit list of inserted columns will have to be in the same order as configured.

Value: at least one full-text field should be specified in an index, multiple records allowed.

rt_attr_uint

rt_attr_uint = gid

Unsigned integer attribute declaration

Value: field_name or field_name:N, can be multiple records. N is the max number of bits to keep.

rt_attr_bigint

rt_attr_bigint = gid

BIGINT attribute declaration

Value: field name, multiple records allowed

rt_attr_multi

rt_attr_multi = tags

Multi-valued attribute (MVA) declaration. Declares the UNSIGNED INTEGER (unsigned 32-bit) MVA attribute. Multi-value (ie. there may be more than one such attribute declared), optional.

Value: field name, multiple records allowed.

rt_attr_multi_64

rt_attr_multi_64 = wide_tags

Multi-valued attribute (MVA) declaration. Declares the BIGINT (signed 64-bit) MVA attribute. Multi-value (ie. there may be more than one such attribute declared), optional.

Value: field name, multiple records allowed.

rt_attr_float

rt_attr_float = lat
rt_attr_float = lon

Floating point attribute declaration. Multi-value (an arbitrary number of attributes is allowed), optional. Declares a single precision, 32-bit IEEE 754 format float attribute.

Value: field name, multiple records allowed.

rt_attr_bool

rt_attr_bool = available

Boolean attribute declaration. Multi-value (there might be multiple attributes declared), optional. Declares a 1-bit unsigned integer attribute.

Value: field name, multiple records allowed.

rt_attr_string

rt_attr_string = title

String attribute declaration. Multi-value (an arbitrary number of attributes is allowed), optional.

Value: field name, multiple records allowed.

rt_attr_json

rt_attr_json = properties

JSON attribute declaration. Multi-value (ie. there may be more than one such attribute declared), optional.

Value: field name, multiple records allowed.

rt_attr_timestamp

rt_attr_timestamp = date_added

Timestamp attribute declaration. Multi-value (an arbitrary number of attributes is allowed), optional.

Value: field name, multiple records allowed.

rt_mem_limit

rt_mem_limit = 512M

RAM chunk size limit. Optional, default is 128M.

RT index keeps some data in memory ("RAM chunk") and also maintains a number of on-disk indexes ("disk chunks"). This directive lets you control the RAM chunk size. Once there’s too much data to keep in RAM, RT index will flush it to disk, activate a newly created disk chunk, and reset the RAM chunk.

The limit is pretty strict; RT index should never allocate more memory than it’s limited to. The memory is not preallocated either, hence, specifying 512 MB limit and only inserting 3 MB of data should result in allocating 3 MB, not 512 MB.

The RAM chunk should be sized depending on the size of the data, rate of inserts/updates and hardware. A small rt_mem_limit and frequent insert/updates can lead to creation of many disk chunks, requiring more frequent optimizations of the index.

In RT mode the RAM chunk size limit can be changed using ALTER TABLE . To set rt_mem_limit to 1 Gb for index 't' run query ALTER TABLE t rt_mem_limit='1G'.

In plain mode rt_mem_limit can be changed using the following steps:

  • edit rt_mem_limit value in configuration
  • run ALTER TABLE <index_name> RECONFIGURE
Important notes about RAM chunks
  • RT index is quite similar to a distributed index consisting of multiple local indexes. The local indexes are called "disk chunks".
  • RAM chunk internally consists of multiple "segments".
  • While disk chunks are stored on disk, the segments of RAM chunk are special RAM-only "indexes".
  • Any transaction you make to a real-time index generates a new segment. RAM chunk segments are merged after each transaction commit. Therefore it is beneficial to do bulk INSERTs of hundreds/thousands documents rather than hundreds/thousands different inserts with 1 document to avoid the overhead from merging RAM chunk segments.
  • When the number of segments gets greater than 32, the segments get merged, so the count is not greater than 32.
  • RT index always has a single RAM-chunk (may be empty) and one or multiple disk chunks.
  • Merging larger segments take longer, that's why it may be suboptimal to have very large RAM chunk (and therefore rt_mem_limit).
  • Number of disk chunks depends on the amount of data in the index and rt_mem_limit setting.
  • Searchd flushes RAM chunk to disk on shutdown and periodically according to rt_flush_period. Flushing several gigabytes to disk may take some time.
  • Large RAM chunk will put more pressure on the storage:
    • when flushing the RAM chunk to disk into the .ram file
    • when the RAM chunk is full and is dumped to disk as a disk chunk.
  • Until flushed RAM chunk is not persisted on disk and there's a binary log as its backup in case of a sudden daemon shutdown. In this case the larger rt_mem_limit you have, the longer it will take to replay the binlog on start to recover the RAM chunk.

Plain index settings:

source

source = srcpart1
source = srcpart2
source = srcpart3

Specifies document source to get documents from when the current index is indexed. There must be at least one source. The sources can be of different types (e.g. one - mysql, another - postgresql). Read more about indexing from external storages here

Value: name of the source to build the index from, mandatory. Can be multiple records.

killlist_target

killlist_target = main:kl

Sets the index(es) that the kill-list will be applied to. Suppresses matches in the targeted index that are updated or deleted in the current index. In :kl mode the documents to suppress are taken from the kill-list. In :id mode all document ids from the current index are suppressed in the targeted one. If neither is specified the both modes take effect. Read more about kill-lists here

Value: not specified (default), target_index_name:kl, target_index_name:id, target_index_name. Multiple values are allowed

columnar_attrs

columnar_attrs = id, attr1, attr2, attr3

Specifies what attributes should be stored in the columnar storage instead of the default row-wise storage.

id is also supported.

Creating a real-time index online via CREATE TABLE

General syntax of CREATE TABLE
CREATE TABLE [IF NOT EXISTS] name ( <field name> <field data type> [data type options] [, ...]) [table_options]
CREATE TABLE via /sql HTTP endpoint

Besides using CREATE TABLE via MySQL protocol using any MySQL client you can also create a table via HTTP if you use the /sql endpoint:

http[s]://manticore_host:port/sql
POST: mode=raw&query=CREATE TABLE [IF NOT EXISTS] name ( <field name> <field data type> [data type options] [, ...]) [table_options]
Data types:

Read more about data types here.

Type Equivalent in a configuration file Notes Aliases
text rt_field Options: indexed, stored. Default - both. To keep text stored, but indexed specify "stored" only. To keep text indexed only specify only "indexed". At least one "text" field should be specified in an index string
integer rt_attr_uint integer int, uint
bigint rt_attr_bigint big integer
float rt_attr_float float
multi rt_attr_multi multi-integer
multi64 rt_attr_multi_64 multi-bigint
bool rt_attr_bool boolean
json rt_attr_json JSON
string rt_attr_string string. Option: indexed - also index the strings in a full-text field with same name.
timestamp rt_attr_timestamp timestamp
bit(n) rt_attr_uint field_name:N N is the max number of bits to keep
‹›
  • SQL
SQL
📋
CREATE TABLE products (title text, price float) morphology='stem_en'

creates table "products" with two fields: "title" (full-text) and "price" (float) and setting "morphology" with value "stem_en"

CREATE TABLE products (title text indexed, description text stored, author text, price float)

creates table "products" with three fields:

  • field "title" - indexed, but not stored
  • field "description" - stored, but not indexed
  • field "author" - both stored and indexed

Engine

create table ... engine='columnar';
create table ... engine='rowwise';

Changes default attribute storage for all attributes in the index. Can be overridden by specifying engine separately for each attribute.

See columnar_attrs on how to enable columnar storage for a plain index.

Values:

  • columnar - enables columnar storage for all index attributes except for json
  • rowwise (default) - doesn't change anything, i.e. makes Manticore use the traditional row-wise storage for the index

Other settings

The following settings are similar for both real-time and plain index in either mode: whether specified in a configuration file or online via CREATE or ALTER command.

Performance related

Accessing index files

Manticore uses two access modes to read index data - seek+read and mmap.

In seek+read mode the server performs system call pread to read document lists and keyword positions, i.e. *.spd and *.spp files. Internal read buffers are used to optimize reading. The size of these buffers can be tuned with options read_buffer_docs and read_buffer_hits. There is also option preopen that allows to control how Manticore opens files at start.

In the mmap access mode the search server just maps index's file into memory with mmap system call and OS caches file contents by itself. Options read_buffer_docs and read_buffer_hits have no effect for corresponding files in this mode. The mmap reader can also lock index's data in memory via mlock privileged call which prevents swapping out the cached data to disk by OS.

To control what access mode will be used access_plain_attrs, access_blob_attrs, access_doclists and access_hitlists options are available with the following values:

Value Description
file server reads index file from disk with seek+read using internal buffers on file access
mmap server maps index file into memory and OS caches up its contents on file access
mmap_preread server maps index file into memory and a background thread reads it once to warm up the cache
mlock server maps index file into memory and then issues mlock system call to cache up the file contents and lock it into memory to prevent it being swapped out
Setting Values Description
access_plain_attrs mmap, mmap_preread (default), mlock controls how *.spa (plain attributes) *.spe (skip lists) *.spi (word lists) *.spt (lookups) *.spm (killed docs) will be read
access_blob_attrs mmap, mmap_preread (default), mlock controls how *.spb (blob attributes) (string, mva and json attributes) will be read
access_doclists file (default), mmap, mlock controls how *.spd (doc lists) data will be read
access_hitlists file (default), mmap, mlock controls how *.spp (hit lists) data will be read

Here is a table which can help you select your desired mode:

index part keep it on disk keep it in memory cached in memory on server start lock it in memory
plain attributes in row-wise (non-columnar) storage, skip lists, word lists, lookups, killed docs mmap mmap mmap_preread (default) mlock
row-wise string, multi-value attributes (MVA) and json attributes mmap mmap mmap_preread (default) mlock
columnar numeric, string and multi-value attributes always only by means of OS no not supported
doc lists file (default) mmap no mlock
hit lists file (default) mmap no mlock
The recommendations are:
  • If you want the best search response time and have enough memory - use row-wise attributes and mlock for attributes and for doclists/hitlists
  • If you can't afford lower performance on start and are ready to wait longer on start until it's warmed up - use --force-preread. If you want searchd to be able to restart faster - stay with mmap_preread
  • If you want to save RAM, but still have enough RAM for all the attributes - do not use mlock, then your OS will decide what should be in memory at any given moment of time depending on what is read from disk more frequently
  • If row-wise attributes don't fit into RAM - use columnar attributes
  • If full-text search performance is not a priority and you want to save RAM - use access_doclists/access_hitlists=file

The default mode is to:

  • mmap
  • preread non-columnar attributes
  • seek+read columnar attributes with no preread
  • seek+read doclists/hitlists with no preread

which provides decent search performance, optimal memory usage and faster searchd restart in most cases.

Other performance related settings

attr_update_reserve

attr_update_reserve = 256k

Sets the space to be reserved for blob attribute updates. Optional, default value is 128k. When blob attributes (multi-value attributes (MVA), strings, JSON) are updated, their length may change. If the updated string (or MVA or JSON) is shorter than the old one, it overwrites the old one in the *.spb file. But if the updated string is longer, the updates are written to the end of the *.spb file. This file is memory mapped, that's why resizing it may be a rather slow process, depending on the OS implementation of memory mapped files. To avoid frequent resizes, you can specify the extra space to be reserved at the end of the .spb file by using this setting.

Value: size, default 128k.

docstore_block_size

docstore_block_size = 32k

Size of the block of documents used by document storage. Optional, default is 16kb. When stored_fields or stored_only_fields are specified, original document text is stored inside the index. To use less disk space, documents are compressed. To get more efficient disk access and better compression ratios on small documents, documents are concatenated into blocks. When indexing, documents are collected until their total size reaches the threshold. After that, this block of documents is compressed. This option can be used to get better compression ratio (by increasing block size) or to get faster access to document text (by decreasing block size).

Value: size, default 16k.

docstore_compression

docstore_compression = lz4hc

Type of compression used to compress blocks of documents used by document storage. When stored_fields or stored_only_fields are specified, document storage stores compressed document blocks. 'lz4' has fast compression and decompression speeds, 'lz4hc' (high compression) has the same fast decompression but compression speed is traded for better compression ratio. 'none' disables compression.

Value: lz4 (default), lz4hc, none.

docstore_compression_level

docstore_compression_level = 12

Compression level in document storage when 'lz4hc' compression is used. When 'lz4hc' compression is used, compression level can be fine-tuned to get better performance or better compression ratio. Does not work with 'lz4' compression.

Value: 1-12 (default 9).

preopen

preopen = 1

This option tells searchd that it should pre-open all index files on startup (or rotation) and keep them open while it runs. Currently, the default mode is not to pre-open the files. Pre-opened indexes take a few (currently 2) file descriptors per index. However, they save on per-query open() calls; and also they are invulnerable to subtle race conditions that may happen during index rotation under high load. On the other hand, when serving many indexes (100s to 1000s), it still might be desired to open them on per-query basis in order to save file descriptors

Value: 0 (default), 1.

read_buffer_docs

read_buffer_docs = 1M

Per-keyword read buffer size for document lists. The higher the value the higher per-query RAM use is, but possibly lower IO time

Value: size, default 256k, min 8k.

read_buffer_hits

read_buffer_hits = 1M

Per-keyword read buffer size for hit lists. The higher the value the higher per-query RAM use is, but possibly lower IO time

Value: size, default 256k, min 8k.

Plain index disk footprint settings

inplace_enable

inplace_enable = {0|1}

Whether to enable in-place index inversion. Optional, default is 0 (use separate temporary files).

inplace_enable greatly reduces indexing disk footprint for a plain index, at a cost of slightly slower indexing (it uses around 2x less disk, but yields around 90-95% the original performance).

Indexing involves two major phases. The first phase collects, processes, and partially sorts documents by keyword, and writes the intermediate result to temporary files (.tmp*). The second phase fully sorts the documents, and creates the final index files. Thus, rebuilding a production index on the fly involves around 3x peak disk footprint: 1st copy for the intermediate temporary files, 2nd copy for newly constructed copy, and 3rd copy for the old index that will be serving production queries in the meantime. (Intermediate data is comparable in size to the final index.) That might be too much disk footprint for big data collections, and inplace_enable allows to reduce it. When enabled, it reuses the temporary files, outputs the final data back to them, and renames them on completion. However, this might require additional temporary data chunk relocation, which is where the performance impact comes from.

This directive does not affect searchd in any way, it only affects indexer.

‹›
  • CONFIG
CONFIG
📋
index products {
  inplace_enable = 1

  path = products
  source = src_base
}

inplace_hit_gap

inplace_hit_gap = size

In-place inversion fine-tuning option. Controls preallocated hitlist gap size. Optional, default is 0.

This directive does not affect searchd in any way, it only affects indexer.

‹›
  • CONFIG
CONFIG
📋
index products {
  inplace_hit_gap = 1M
  inplace_enable = 1

  path = products
  source = src_base
}

inplace_reloc_factor

inplace_reloc_factor = 0.1

Controls relocation buffer size within indexing memory arena. Optional, default is 0.1.

This directive does not affect searchd in any way, it only affects indexer.

‹›
  • CONFIG
CONFIG
📋
index products {
  inplace_reloc_factor = 0.1
  inplace_enable = 1

  path = products
  source = src_base
}

inplace_write_factor

inplace_write_factor = 0.1

Controls in-place write buffer size within indexing memory arena. Optional, default is 0.1.

This directive does not affect searchd in any way, it only affects indexer.

‹›
  • CONFIG
CONFIG
📋
index products {
  inplace_write_factor = 0.1
  inplace_enable = 1

  path = products
  source = src_base
}

Natural language processing specific settings

The following settings are supported. They are all described in section NLP and tokenization.

Percolate index

Percolate index is a special index which stores queries instead of documents. It is used for prospective searches (or "search in reverse").

The schema of a percolate index is fixed and contains the following fields:

Field Description
ID Unsigned 64-bit integer with autoincrement functionality therefore it can be omitted when you add a PQ rule
Query Full-text query of the rule. You can think of it as the value of a MATCH clause or JSON /search. If per field operators are used inside the query, the full text fields need to be declared in the percolate index configuration. If the stored query is supposed to do only attribute filtering (no full-text querying), the query value can be empty or simply omitted. The value of this field should correspond to the expected document schema which you specify when you create a percolate index
Filters Filters is an optional string containing attribute filters and/or expressions the same way they are defined in the WHERE clause or JSON filtering. The value of this field should correspond to the expected document schema which you specify when you create a percolate index
Tags Optional. Tags represent a list of string labels separated by comma that can be used for filtering/deleting PQ rules. The tags can be returned along with matching documents when you Percolate query

You don't need to worry about adding the above fields when you create a percolate index.

What you need to take care of when you add a new index is to specify the expected schema of a document which is to be checked against the rules you will add later. This is done the same way as for any other local index.

‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • javascript
  • java
  • CONFIG
📋
Creating a percolate index via MySQL protocol:
CREATE TABLE products(title text, meta json) type='pq';
‹›
Response
Query OK, 0 rows affected (0.00 sec)

Template index

Template index is a pseudo-index since it does not store any data and does not create any files on your disk. At the same time it can have the same NLP settings as a plain or a real-time index. Template indexes can be used for few purposes:

  • as templates to inherit plain/real-time indexes in Plain mode just to minify Manticore configuration file
  • keywords generation with help of CALL KEYWORDS
  • highlighting of an arbitrary string using CALL SNIPPETS
‹›
  • CONFIG
CONFIG
📋
Creating a template index via a configuration file:
index template {
  type = template
  morphology = stem_en
  wordforms = wordforms.txt
  exceptions = exceptions.txt
  stopwords = stopwords.txt
}