Plain table is a basic element for non-percolate searching. It can be defined only in a configuration file using the Plain mode, and is not supported in the RT mode. It is typically used in conjunction with a source to process data from the external storage and can later be attached to a real-time table.
To create a plain table, you'll need to define it in a configuration file. It's not supported by the CREATE TABLE
command.
Here's an example of a plain table configuration and a source for fetching data from a MySQL database:
- Plain table example
source source {
type = mysql
sql_host = localhost
sql_user = myuser
sql_pass = mypass
sql_db = mydb
sql_query = SELECT id, title, description, category_id from mytable
sql_attr_uint = category_id
sql_field_string = title
}
table tbl {
type = plain
source = source
path = /path/to/table
}
- Build it from an external storage using a source and indexer
- Perform an in-place update of integer, float, string and MVA attribute
- Update it's
killlist_target
- Insert additional data into the table once it has been built
- Delete data from the table
- Create, delete, or alter the table schema online
- Use UUID for automatic ID generation (data from external storage must include a unique identifier)
Numeric attributes, including MVAs, are the only elements that can be updated in a plain table. All other data in the table is immutable. If updates or new records are required, the table must be rebuilt. During the rebuilding process, the existing table remains available to serve requests, and a process called rotation is performed when the new version is ready, bringing it online and discarding the old version.
The speed at which a plain table is indexed depends on several factors, including:
- Data source retrieval speed
- Tokenization settings
- The hardware specifications (such as CPU, RAM, and disk performance)
For small data sets, the simplest option is to have a single plain table that is fully rebuilt as needed. This approach is acceptable when:
- The data in the table is not as fresh as the data in the source
- The time it takes to build the table increases as the data set grows
For larger data sets, a plain table can be used instead of a Real-Time. The main+delta scenario involves:
- Creating a smaller table for incremental indexing
- Combining the two tables using a distributed table
This approach allows for infrequent rebuilding of the larger table and more frequent processing of updates from the source. The smaller table can be rebuilt more often (e.g. every minute or even every few seconds).
However, as time goes on, the indexing duration for the smaller table will become too long, requiring a rebuild of the larger table and the emptying of the smaller one.
The main+delta schema is explained in detail in this interactive course.
The mechanism of kill list and killlist_target directive is used to ensure that documents from the current table take precedence over those from the other table.
For more information on this topic, see here.
The following table outlines the various file extensions used in a plain table and their respective descriptions:
Extension | Description |
---|---|
.spa |
stores document attributes in row-wise mode |
.spb |
stores blob attributes in row-wise mode: strings, MVA, json |
.spc |
stores document attributes in columnar mode |
.spd |
stores matching document ID lists for each word ID |
.sph |
stores table header information |
.sphi |
stores histograms of attribute values |
.spi |
stores word lists (word IDs and pointers to .spd file) |
.spidx |
stores secondary indexes data |
.spjidx |
stores data of secondary indexes generated for JSON attributes |
.spk |
stores kill-lists |
.spl |
lock file |
.spm |
stores a bitmap of killed documents |
.spp |
stores hit (aka posting, aka word occurrence) lists for each word ID |
.spt |
stores additional data structures to speed up lookups by document ids |
.spe |
stores skip-lists to speed up doc-list filtering |
.spds |
stores document texts |
.tmp* |
temporary files during index_settings_and_status |
.new.sp* |
new version of a plain table before rotation |
.old.sp* |
old version of a plain table after rotation |
- Plain
- Real-time
table <table name> {
type = plain
path = /path/to/table
source = <source_name>
source = <another source_name>
[stored_fields = <comma separated list of full-text fields that should be stored, all are stored by default, can be empty>]
}
type = plain
type = rt
Table type: "plain" or "rt" (real-time)
Value: plain (default), rt
path = path/to/table
The path to where the table will be stored or located, either absolute or relative, without the extension.
Value: The path to the table, mandatory
stored_fields = title, content
By default, the original content of full-text fields is indexed and stored when a table is defined in a configuration file. This setting allows you to specify the fields that should have their original values stored.
Value: A comma-separated list of full-text fields that should be stored. An empty value (i.e. stored_fields =
) disables the storage of original values for all fields.
Note: In the case of a real-time table, the fields listed in stored_fields
should also be declared as rt_field.
Also, note that you don't need to list attributes in stored_fields
, since their original values are stored anyway. stored_fields
can only be used for full-text fields.
See also docstore_block_size, docstore_compression for document storage compression options.
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)
stored_only_fields = title,content
List of fields that will be stored in the table, but not indexed. This setting works similarly to stored_fields except that when a field is specified in stored_only_fields
t will only be stored, not indexed, and cannot be searched using full-text queries. It can only be retrieved in search results.
The value is a comma-separated list of fields that should be stored only, not indexed. By default, this value is empty. If a real-time table is being defined, the fields listed in stored_only_fields
must also be declared as rt_field.
Note also, that you don't need to list attributes instored_only_fields
,since their original values are stored anyway. If to compare stored_only_fields
to string attributes the former (stored field):
- is stored on disk and does not take up memory
- is stored in a compressed format
- can only be fetched, it cannot be used for sorting, filtering or grouping
In contrast, the latter (string attribute):
- is stored on disk and in memory
- is stored in an uncompressed format
- can be used for sorting, grouping, filtering, and any other actions you want to take with attributes.
json_secondary_indexes = json_attr
By default, secondary indexes are generated for all attributes except JSON attributes. However, secondary indexes for JSON attributes can be explicitly generated using the json_secondary_indexes
setting. When a JSON attribute is included in this option, its contents are flattened into multiple secondary indexes. These indexes can be used by the query optimizer to speed up queries.
You can view the available secondary indexes using the SHOW TABLE
Value: A comma-separated list of JSON attributes for which secondary indexes should be generated.
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
- C#
- CONFIG
CREATE TABLE products(title text, j json secondary_index='1')
The maximum number of disk chunks for the RT table. Learn more here.
rt_field = subject
This field declaration determines the full-text fields that will be indexed. The field names must be unique, and the order is preserved. When inserting data, the field values must be in the same order as specified in the configuration.
This is a multi-value, optional field.
rt_attr_uint = gid
This declaration defines an unsigned integer attribute.
Value: the field name or field_name:N (where N is the maximum number of bits to keep).
rt_attr_bigint = gid
This declaration defines a BIGINT attribute.
Value: field name, multiple records allowed.
rt_attr_multi = tags
Declares a multi-valued attribute (MVA) with unsigned 32-bit integer values.
Value: field name. Multiple records allowed.
rt_attr_multi_64 = wide_tags
Declares a multi-valued attribute (MVA) with signed 64-bit BIGINT values.
Value: field name. Multiple records allowed.
rt_attr_float = lat
rt_attr_float = lon
Declares floating point attributes with single precision, 32-bit IEEE 754 format.
Value: field name. Multiple records allowed.
rt_attr_float_vector = image_vector
Declares a vector of floating-point values.
Value: field name. Multiple records allowed.
rt_attr_bool = available
Declares a boolean attribute with 1-bit unsigned integer values.
Value: field name.
rt_attr_string = title
String attribute declaration.
Value: field name.
rt_attr_json = properties
Declares a JSON attribute.
Value: field name.
rt_attr_timestamp = date_added
Declares a timestamp attribute.
Value: field name.
rt_mem_limit = 512M
Memory limit for a RAM chunk of the table. Optional, default is 128M.
RT tables store some data in memory, known as the "RAM chunk," and also maintain a number of on-disk tables, referred to as "disk chunks." This directive allows you to control the size of the RAM chunk. When there is too much data to keep in memory, RT tables will flush it to disk, activate a newly created disk chunk, and reset the RAM chunk.
Please note that the limit is strict, and RT tables will never allocate more memory than what is specified in the rt_mem_limit. Additionally, memory is not preallocated, so specifying a 512MB limit and only inserting 3MB of data will result in allocating only 3MB, not 512MB.
The rt_mem_limit
is never exceeded, but the actual RAM chunk size can be significantly lower than the limit. RT tables adapt to the data insertion pace and adjust the actual limit dynamically to minimize memory usage and maximize data write speed. This is how it works:
- By default, the RAM chunk size is 50% of the
rt_mem_limit
, referred to as the "rt_mem_limit
". - As soon as the RAM chunk accumulates data equivalent to
rt_mem_limit * rate
data (50% ofrt_mem_limit
by default), Manticore starts saving the RAM chunk as a new disk chunk. - While a new disk chunk is being saved, Manticore assesses the number of new/updated documents.
- After saving a new disk chunk, the
rt_mem_limit
rate is updated. - The rate is reset to 50% each time you restart the searchd.
For instance, if 90MB of data is saved to a disk chunk and an additional 10MB of data arrives while the save is in progress, the rate would be 90%. Next time, the RT table will collect up to 90% of rt_mem_limit
before flushing the data. The faster the insertion pace, the lower the rt_mem_limit
rate. The rate varies between 33.3% to 95%. You can view the current rate of a table using the SHOW TABLE
In real-time mode, you can adjust the size limit of RAM chunks and the maximum number of disk chunks using the ALTER TABLE
statement. To set rt_mem_limit
to 1 gigabyte for the table "t," run the following query: ALTER TABLE t rt_mem_limit='1G'
. To change the maximum number of disk chunks, run the query: ALTER TABLE t optimize_cutoff='5'
.
In the plain mode, you can change the values of rt_mem_limit
and optimize_cutoff
by updating the table configuration or running the command ALTER TABLE <table_name> RECONFIGURE
- Real-time tables are similar to distributed consisting of multiple local tables, also known as disk chunks.
- Each RAM chunk is made up of multiple segments, which are special RAM-only tables.
- While disk chunks are stored on disk, RAM chunks are stored in memory.
- Each transaction made to a real-time table generates a new segment, and RAM chunk segments are merged after each transaction commit. It is more efficient to perform bulk INSERTs of hundreds or thousands of documents rather than multiple separate INSERTs with one document to reduce the overhead from merging RAM chunk segments.
- When the number of segments exceeds 32, they will be merged to keep the count below 32.
- Real-time tables always have one RAM chunk (which may be empty) and one or more disk chunks.
- Merging larger segments takes longer, so it's best to avoid having a very large RAM chunk (and therefore
rt_mem_limit
). - The number of disk chunks depends on the data in the table and the
rt_mem_limit
setting. - Searchd flushes the RAM chunk to disk (as a persisted file, not as a disk chunk) on shutdown and periodically according to the rt_flush_period setting. Flushing several gigabytes to disk may take some time.
- A large RAM chunk puts more pressure on storage, both when flushing to disk into the
.ram
file and when the RAM chunk is full and dumped to disk as a disk chunk. - The RAM chunk is backed up by a binary log until it is flushed to disk, and a larger
rt_mem_limit
, setting will increase the time it takes to replay the binary log and recover the RAM chunk. - The RAM chunk may be slightly slower than a disk chunk.
- Although the RAM chunk itself doesn't take up more memory than
rt_mem_limit
Manticore may take up more memory in some cases, such as when you start a transaction to insert data and don't commit it for a while. In this case, the data you have already transmitted within the transaction will remain in memory.
source = srcpart1
source = srcpart2
source = srcpart3
The source field specifies the source from which documents will be obtained during indexing of the current table. There must be at least one source. The sources can be of different types (e.g. one could be MySQL, another PostgreSQL). For more information on indexing from external storages, indexing from external storages here
Value: The name of the source is mandatory. Multiple values are allowed.
killlist_target = main:kl
This setting determines the table(s) to which the kill-list will be applied. Matches in the targeted table that are updated or deleted in the current table will be suppressed. In :kl
mode, the documents to suppress are taken from the kill-list. In :id
mode, all document IDs from the current table are suppressed in the targeted one. If neither is specified, both modes will take effect. Learn more about kill-lists here
Value: not specified (default), target_table_name:kl, target_table_name:id, target_table_name. Multiple values are allowed
columnar_attrs = *
columnar_attrs = id, attr1, attr2, attr3
This configuration setting determines which attributes should be stored in the columnar storage instead of the row-wise storage.
You can set columnar_attrs = *
to store all supported data types in the columnar storage.
Additionally, id
is a supported attribute to store in the columnar storage.
columnar_strings_no_hash = attr1, attr2, attr3
By default, all string attributes stored in columnar storage store pre-calculated hashes. These hashes are used for grouping and filtering. However, they occupy extra space, and if you don't need to group by that attribute, you can save space by disabling hash generation.
CREATE TABLE [IF NOT EXISTS] name ( <field name> <field data type> [data type options] [, ...]) [table_options]
For more information on data types, see more about data types here.
Type | Equivalent in a configuration file | Notes | Aliases |
---|---|---|---|
text | rt_field | Options: indexed, stored. Default: both. To keep text stored, but indexed, specify "stored" only. To keep text indexed only, specify "indexed" only. | string |
integer | rt_attr_uint | integer | int, uint |
bigint | rt_attr_bigint | big integer | |
float | rt_attr_float | float | |
float_vector | rt_attr_float_vector | a vector of float values | |
multi | rt_attr_multi | multi-integer | |
multi64 | rt_attr_multi_64 | multi-bigint | |
bool | rt_attr_bool | boolean | |
json | rt_attr_json | JSON | |
string | rt_attr_string | string. Option indexed, attribute will make the value full-text indexed and filterable, sortable and groupable at the same time |
|
timestamp | rt_attr_timestamp | timestamp | |
bit(n) | rt_attr_uint field_name:N | N is the max number of bits to keep |
- SQL
CREATE TABLE products (title text, price float) morphology='stem_en'
This creates the "products" table with two fields: "title" (full-text) and "price" (float), and sets the "morphology" to "stem_en".
CREATE TABLE products (title text indexed, description text stored, author text, price float)
This creates the "products" table with three fields:
- "title" is indexed, but not stored.
- "description" is stored, but not indexed.
- "author" is both stored and indexed.
create table ... engine='columnar';
create table ... engine='rowwise';
The engine setting changes the default attribute storage for all attributes in the table. You can also specify engine
separately for each attribute.
For information on how to enable columnar storage for a plain table, see columnar_attrs .
Values:
- columnar - Enables columnar storage for all table attributes, except for json
- rowwise (default) - Doesn't change anything and uses the traditional row-wise storage for the table.
The following settings are applicable for both real-time and plain tables, regardless of whether they are specified in a configuration file or set online using the CREATE
or ALTER
command.
Manticore supports two access modes for reading table data: seek+read and mmap.
In seek+read mode, the server uses the pread
system call to read document lists and keyword positions, represented by the*.spd
and *.spp
files. The server uses internal read buffers to optimize the reading process, and the size of these buffers can be adjusted using the options read_buffer_docs and read_buffer_hits.There is also the option preopen that controls how Manticore opens files at start.
In mmap access mode, the search server maps the table's file into memory using the mmap
system call, and the OS caches the file contents. The options read_buffer_docs and read_buffer_hits have no effect for corresponding files in this mode. The mmap reader can also lock the table's data in memory using themlock
privileged call, which prevents the OS from swapping the cached data out to disk.
To control which access mode to use, the options access_plain_attrs, access_blob_attrs, access_doclists, access_hitlists and access_dict are available, with the following values:
Value | Description |
---|---|
file | server reads the table files from disk with seek+read using internal buffers on file access |
mmap | server maps the table files into memory and OS caches up its contents on file access |
mmap_preread | server maps the table files into memory and a background thread reads it once to warm up the cache |
mlock | server maps the table files into memory and then executes the mlock() system call to cache up the file contents and lock it into memory to prevent it being swapped out |
Setting | Values | Description |
---|---|---|
access_plain_attrs | mmap, mmap_preread (default), mlock | controls how *.spa (plain attributes) *.spe (skip lists) *.spt (lookups) *.spm (killed docs) will be read |
access_blob_attrs | mmap, mmap_preread (default), mlock | controls how *.spb (blob attributes) (string, mva and json attributes) will be read |
access_doclists | file (default), mmap, mlock | controls how *.spd (doc lists) data will be read |
access_hitlists | file (default), mmap, mlock | controls how *.spp (hit lists) data will be read |
access_dict | mmap, mmap_preread (default), mlock | controls how *.spi (dictionary) will be read |
Here is a table which can help you select your desired mode:
table part | keep it on disk | keep it in memory | cached in memory on server start | lock it in memory |
---|---|---|---|---|
plain attributes in row-wise (non-columnar) storage, skip lists, word lists, lookups, killed docs | mmap | mmap | mmap_preread (default) | mlock |
row-wise string, multi-value attributes (MVA) and json attributes | mmap | mmap | mmap_preread (default) | mlock |
columnar numeric, string and multi-value attributes | always | only by means of OS | no | not supported |
doc lists | file (default) | mmap | no | mlock |
hit lists | file (default) | mmap | no | mlock |
dictionary | mmap | mmap | mmap_preread (default) | mlock |
- For the fastest search response time and ample memory availability, use row-wise attributes and lock them in memory using
mlock
. Additionally, use mlock for doclists/hitlists. - If you prioritize can't afford lower performance after start and are willing to sacrifice longer startup time, use the --force-preread. option. If you desire faster searchd restart, stick to the default
mmap_preread
option. - If you are looking to conserve memory, while still having enough memory for all attributes, skip the use of
mlock
. The operating system will determine what should be kept in memory based on frequent disk reads. - If row-wise attributes do not fit into memory, opt for columnar attributes
- If full-text search performance is not a concern, and you wish to save memory, use
access_doclists/access_hitlists=file
The default mode offers a balance of:
- mmap,
- Prereading non-columnar attributes,
- Seeking and reading columnar attributes with no preread,
- Seeking and reading doclists/hitlists with no preread.
This provides a decent search performance, optimal memory utilization, and faster searchd restart in most scenarios.
attr_update_reserve = 256k
This setting reserves extra space for updates to blob attributes such as multi-value attributes (MVA), strings, and JSON. The default value is 128k. When updating these attributes, their length may change. If the updated string is shorter than the previous one, it will overwrite the old data in the *.spb
file. If the updated string is longer, it will be written to the end of the *.spb
file. This file is memory-mapped, making resizing it a potentially slow process, depending on the operating system's memory-mapped file implementation. To avoid frequent resizing, you can use this setting to reserve extra space at the end of the .spb file.
Value: size, default 128k.
docstore_block_size = 32k
This setting controls the size of blocks used by the document storage. The default value is 16kb. When original document text is stored using stored_fields or stored_only_fields, it is stored within the table and compressed for efficiency. To optimize disk access and compression ratios for small documents, these documents are concatenated into blocks. The indexing process collects documents until their total size reaches the threshold specified by this option. At that point, the block of documents is compressed. This option can be adjusted to achieve better compression ratios (by increasing the block size) or faster access to document text (by decreasing the block size).
Value: size, default 16k.
docstore_compression = lz4hc
This setting determines the type of compression used for compressing blocks of documents stored in document storage. If stored_fields or stored_only_fields are specified, the document storage stores compressed document blocks. 'lz4' offers fast compression and decompression speeds, while 'lz4hc' (high compression) sacrifices some compression speed for a better compression ratio. 'none' disables compression completely.
Values: lz4 (default), lz4hc, none.
docstore_compression_level = 12
The compression level used when 'lz4hc' compression is applied in document storage. By adjusting the compression level, you can find the right balance between performance and compression ratio when using 'lz4hc' compression. Note that this option is not applicable when using 'lz4' compression.
Value: An integer between 1 and 12, with a default of 9.
preopen = 1
This setting indicates that searchd should open all table files on startup or rotation, and keep them open while running. By default, the files are not pre-opened. Pre-opened tables require a few file descriptors per table, but they eliminate the need for per-query open() calls and are immune to race conditions that might occur during table rotation under high load. However, if you are serving many tables, it may still be more efficient to open them on a per-query basis in order to conserve file descriptors.
Value: 0 (default), or 1.
read_buffer_docs = 1M
Buffer size for storing the list of documents per keyword. Increasing this value will result in higher memory usage during query execution, but may reduce I/O time.
Value: size, default 256k, minimum value is 8k.
read_buffer_hits = 1M
Buffer size for storing the list of hits per keyword. Increasing this value will result in higher memory usage during query execution, but may reduce I/O time.
Value: size, default 256k, minimum value is 8k.
inplace_enable = {0|1}
Enables in-place table inversion. Optional, default is 0 (uses separate temporary files).
The inplace_enable
option reduces the disk footprint during indexing of plain tables, while slightly slowing down indexing (it uses approximately 2 times less disk, but yields around 90-95% of the original performance).
Indexing is comprised of two primary phases. During the first phase, documents are collected, processed, and partially sorted by keyword, and the intermediate results are written to temporary files (.tmp*). During the second phase, the documents are fully sorted and the final table files are created. Rebuilding a production table on-the-fly requires approximately 3 times the peak disk footprint: first for the intermediate temporary files, second for the newly constructed copy, and third for the old table that will be serving production queries in the meantime. (Intermediate data is comparable in size to the final table.) This may be too much disk footprint for large data collections, and the inplace_enable
option can be used to reduce it. When enabled, it reuses the temporary files, outputs the final data back to them, and renames them upon completion. However, this may require additional temporary data chunk relocation, which is where the performance impact comes from.
This directive has no effect on searchd, it only affects the indexer.
- CONFIG
table products {
inplace_enable = 1
path = products
source = src_base
}
inplace_hit_gap = size
The option In-place inversion fine-tuning option. Controls preallocated hitlist gap size. Optional, default is 0.
This directive only affects the searchd tool, and does not have any impact on the indexer.
- CONFIG
table products {
inplace_hit_gap = 1M
inplace_enable = 1
path = products
source = src_base
}
inplace_reloc_factor = 0.1
The inplace_reloc_factor setting determines the size of the relocation buffer within the memory arena used during indexing. The default value is 0.1.
This option is optional and only affects the indexer tool, not the searchd server.
- CONFIG
table products {
inplace_reloc_factor = 0.1
inplace_enable = 1
path = products
source = src_base
}
inplace_write_factor = 0.1
Controls the size of the buffer used for in-place writing during indexing. Optional, with a default value of 0.1.
It's important to note that this directive only impacts the indexer tool and not the searchd server.
- CONFIG
table products {
inplace_write_factor = 0.1
inplace_enable = 1
path = products
source = src_base
}
The following settings are supported. They are all described in section NLP and tokenization.
- bigram_freq_words
- bigram_index
- blend_chars
- blend_mode
- charset_table
- dict
- embedded_limit
- exceptions
- expand_keywords
- global_idf
- hitless_words
- html_index_attrs
- html_remove_elements
- html_strip
- ignore_chars
- index_exact_words
- index_field_lengths
- index_sp
- index_token_filter
- index_zones
- infix_fields
- killlist_target
- max_substring_len
- min_infix_len
- min_prefix_len
- min_stemming_len
- min_word_len
- morphology
- morphology_skip_fields
- ngram_chars
- ngram_len
- overshort_step
- phrase_boundary
- phrase_boundary_step
- prefix_fields
- regexp_filter
- stopwords
- stopword_step
- stopwords_unstemmed
- stored_fields
- stored_only_fields
- wordforms
A percolate table is a special table that stores queries rather than documents. It is used for prospective searches, or "search in reverse."
- To learn more about performing a search query against a percolate table, see the section Percolate query.
- To learn how to prepare a table for searching, see the section Adding rules to a percolate table.
The schema of a percolate table is fixed and contains the following fields:
Field | Description |
---|---|
ID | An unsigned 64-bit integer with auto-increment functionality. It can be omitted when adding a PQ rule, as described in add a PQ rule |
Query | Full-text query of the rule, which can be thought of as the value of MATCH clause or JSON /search. If per field operators are used inside the query, the full-text fields need to be declared in the percolate table configuration. If the stored query is only for attribute filtering (without full-text querying), the query value can be empty or omitted. The value of this field should correspond to the expected document schema, which is specified when creating the percolate table. |
Filters | Optional. Filters are an optional string containing attribute filters and/or expressions, defined the same way as in the WHERE clause or JSON filtering. The value of this field should correspond to the expected document schema, which is specified when creating the percolate table. |
Tags | Optional. Tags represent a list of string labels separated by commas that can be used for filtering/deleting PQ rules. The tags can also be returned along with matching documents when performing a Percolate query |
Note that you do not need to add the above fields when creating a percolate table.
What you need to keep in mind when creating a new percolate table is to specify the expected schema of a document, which will be checked against the rules you will add later. This is done in the same way as for any other local table.
- SQL
- JSON
- PHP
- Python
- javascript
- java
- C#
- typescript
- go
- CONFIG
CREATE TABLE products(title text, meta json) type='pq';
Query OK, 0 rows affected (0.00 sec)