Securing and compacting a table > Flushing RAM chunk to disk

A plain table can be created from an external source using a special tool called indexer, which reads a "recipe" from the configuration, connects to the data sources, pulls documents, and builds table files. This is a lengthy process. If your data changes, the table becomes outdated, and you need to rebuild it from the refreshed sources. If your data changes incrementally, such as a blog or newsfeed where old documents never change and only new ones are added, the rebuild will take more and more time, as you will need to process the archive sources again and again with each pass.

One way to deal with this problem is by using several tables instead of one solid table. For example, you can process sources produced in previous years and save the table. Then, take only sources from the current year and put them into a separate table, rebuilding it as often as necessary. You can then place both tables as parts of a distributed table and use it for querying. The point here is that each time you rebuild, you only process data from the last 12 months at most, and the table with older data remains untouched without needing to be rebuilt. You can go further and divide the last 12 months table into monthly, weekly, or daily tables, and so on.

This approach works, but you need to maintain your distributed table manually. That is, add new chunks, delete old ones, and keep the overall number of partial tables not too large (with too many tables, searching can become slower, and the OS usually limits the number of simultaneously opened files). To deal with this, you can manually merge several tables together by running indexer --merge. However, that only solves the problem of having many tables, making maintenance more challenging. And even with 'per-hour' reindexing, you will most likely have a noticeable time gap between new data arriving in sources and rebuilding the table, which populates this data for searching.

A real-time table is designed to solve this problem. It consists of two parts:

A special RAM-based table (called RAM chunk) that contains portions of data arriving right now.
A collection of plain tables called disk chunks that were built in the past.

This is very similar to a standard distributed table, made from several local tables.

You don't need to build such a table by running indexer, which reads a "recipe" from the config and tables data sources. Instead, the real-time table provides the ability to 'insert' new documents and 'replace' existing ones. When executing the 'insert' command, you push new documents to the server. It then builds a small table from the added documents and immediately brings it online. So, right after the 'insert' command completes, you can perform searches in all table parts, including the just-added documents.

The search server automatically maintains the table, so you don't have to worry about it. However, you might be interested in learning a few details about 'how it is maintained'.

First, since indexed data is stored in RAM - what about emergency power-off? Will I lose my table then? Well, before completion, the server saves new data into a special 'binlog'. This consists of one or several files, living on your persistent storage, which incrementally grows as you add more and more changes. You can adjust the behavior regarding how often new queries (or transactions) are stored in the binlog, and how often the 'sync' command is executed over the binlog file to force the OS to actually save the data on a safe storage. The most paranoid approach is to flush and sync after every transaction. This is the slowest but also the safest method. The least expensive way is to switch off the binlog entirely. This is the fastest method, but you risk losing your indexed data. Intermediate variants, like flush/sync every second, are also provided.

The binlog is designed specifically for sequential saving of newly arriving transactions; it is not a table and cannot be searched over. It is merely an insurance policy to ensure that the server will not lose your data. If a sudden disruption occurs and everything crashes due to a software or hardware problem, the server will load the freshest available dump of the RAM chunk and then replay the binlog, repeating stored transactions. Ultimately, it will achieve the same state as it was in at the moment of the last change.

Second, what about limits? What if I want to process, say, 10TB of data, but it just doesn't fit into RAM! RAM for a real-time table is limited and can be configured. When a certain amount of data is indexed, the server manages the RAM part of the table by merging together small transactions, keeping their number and overall size small. This process can sometimes cause delays during insertion, however. When merging no longer helps, and new insertions hit the RAM limit, the server converts the RAM-based table into a plain table stored on disk (called a disk chunk). This table is added to the collection of tables in the second part of the RT table and becomes accessible online. The RAM is then flushed, and the space is deallocated.

When the data from RAM is securely saved to disk, which occurs:

when the server saves the collected data as a disk table
or when it dumps the RAM part during a clean shutdown or by manual flushing

the binlog for that table is no longer necessary. So, it gets discarded. If all the tables are saved, the binlog will be deleted.

Third, what about disk collection? If having many disk parts makes searching slower, what's the difference if I make them manually in the distributed table manner, or they're produced as disk parts (or, 'chunks') by an RT table? Well, in both cases, you can merge several tables into one. For example, you can merge hourly tables from yesterday and keep one 'daily' table for yesterday instead. With manual maintenance, you have to think about the schema and commands yourself. With an RT table, the server provides the OPTIMIZE command, which does the same, but keeps you away from unnecessary internal details.

Fourth, if my "document" constitutes a 'mini-table' and I don't need it anymore, I can just throw it away. But if it is 'optimized', i.e. mixed together with tons of other documents, how can I undo or delete it? Yes, indexed documents are 'mixed' together, and there is no easy way to delete one without rebuilding the whole table. And if for plain tables rebuilding or merging is just a normal way of maintenance, for a real-time table it keeps only the simplicity of manipulation, but not 'real-timeness'. To address the problem, Manticore uses a trick: when you delete a document, identified by document ID, the server just tracks the number. Together with other deleted documents, their IDs are saved in a so-called kill-list. When you search over the table, the server first retrieves all matching documents, and then throws out the documents that are found in the kill-list (that is the most basic description; in fact, internally it's more complex). The point is - for the sake of 'immediate' deletion, documents are not actually deleted, but are just marked as 'deleted'. They still occupy space in different table structures, being essentially garbage. Word statistics, which affect ranking, also aren't affected, meaning it works exactly as it is declared: we search among all documents, and then just hide ones marked as deleted from the final result. When a document is replaced, it means that it is killed in the old parts of the table and is inserted again in the freshest part. All consequences of 'hiding by killlist' are also in play in this case.

When a rebuild of some part of a table happens, e.g., when some transactions (segments) of a RAM chunk are merged, or when a RAM chunk is converted into a disk chunk, or when two disk chunks are merged together, the server performs a comprehensive iteration over the affected parts and physically excludes deleted documents from all of them. That is, if they were in document lists of some words - they are wiped away. If it was a unique word - it gets removed completely.

As a summary: the deletion works in two phases:

First, we mark documents as 'deleted' in real-time and suppress them in search results.
During some operation with an RT table chunk, we finally physically wipe the deleted documents for good.

Fifth, if an RT table contains plain disk tables in its collection, can I just add my ready old disk table to it? No. It's not possible to avoid unneeded complexity and prevent accidental corruption. However, if your RT table has just been created and contains no data, then you can ATTACH TABLE your disk table to it. Your old table will be moved inside the RT table and will become its part.

As a summary about the RT table structure: it is a cleverly organized collection of plain disk tables with a fast in-memory table, intended for real-time insertions and semi-real-time deletions of documents. The RT table has a common schema, common settings, and can be easily maintained without deep digging into details.

Flushing RAM chunk to a new disk chunk

FLUSH RAMCHUNK rt_table

The FLUSH RAMCHUNK command creates a new disk chunk in an RT table.

Normally, an RT table automatically flushes and converts the contents of the RAM chunk into a new disk chunk when one of the special conditions is met. In some cases, though, you may want to trigger the flush manually — and the FLUSH RAMCHUNK statement lets you do that.

‹›

SQL

SQL

📋

FLUSH RAMCHUNK rt;

‹›

Response

Query OK, 0 rows affected (0.05 sec)

Few words about RT table structure Flushing RT table to disk

FLUSH TABLE rt_table

FLUSH TABLE forcefully flushes RT table RAM chunk contents to disk.

The real-time table RAM chunk is automatically flushed to disk during a clean shutdown, or periodically every rt_flush_period seconds.

Issuing a FLUSH TABLE command not only forces the RAM chunk contents to be written to disk but also triggers the cleanup of binary log files.

‹›

SQL

SQL

📋

FLUSH TABLE rt;

‹›

Response

Query OK, 0 rows affected (0.05 sec)

Flushing RAM chunk to a new disk chunk Compacting a table

Over time, RT tables may become fragmented into numerous disk chunks and/or contaminated with deleted, yet unpurged data, affecting search performance. In these cases, optimization is necessary. Essentially, the optimization process combines pairs of disk chunks, removing documents that were previously deleted using DELETE statements.

Beginning with Manticore 4, this process occurs automatically by default. However, you can also use the following commands to manually initiate table compaction.

OPTIMIZE TABLE table_name [OPTION opt_name = opt_value [,...]]

OPTIMIZE statement adds an RT table to the optimization queue, which will be processed in a background thread.

‹›

SQL

SQL

📋

OPTIMIZE TABLE rt;

By default, OPTIMIZE merges the RT table's disk chunks down to a number less than or equal to the number of logical CPU cores multiplied by 2.

However, if the table has attributes with KNN indexes, this threshold is different. In this case, it is set to the number of physical CPU cores divided by 2 to improve KNN search performance.

You can also control the number of optimized disk chunks manually using the cutoff option.

Additional options include:

Server setting optimize_cutoff for overriding the default threshold
Per-table setting optimize_cutoff

‹›

SQL

SQL

📋

OPTIMIZE TABLE rt OPTION cutoff=4;

When using OPTION sync=1 (0 by default), the command will wait for the optimization process to complete before returning. If the connection is interrupted, the optimization will continue running on the server.

‹›

SQL

SQL

📋

OPTIMIZE TABLE rt OPTION sync=1;

Optimization can be a lengthy and I/O-intensive process. To minimize the impact, all actual merge work is executed serially in a special background thread, and the OPTIMIZE statement simply adds a job to its queue. The optimization thread can be I/O-throttled, and you can control the maximum number of I/Os per second and the maximum I/O size with the rt_merge_iops and rt_merge_maxiosize directives, respectively.

During optimization, the RT table being optimized remains online and available for both searching and updates nearly all the time. It is locked for a very brief period when a pair of disk chunks is successfully merged, allowing for the renaming of old and new files and updating the table header.

As long as auto_optimize is not disabled, tables are optimized automatically.

If you are experiencing unexpected SSTs or want tables across all nodes of the cluster to be binary identical, you need to:

Disable auto_optimize.
Manually optimize tables:
On one of the nodes, drop the table from the cluster:
‹›

SQL

SQL

📋
⚙

ALTER CLUSTER mycluster DROP myindex;
Optimize the table:
‹›

SQL

SQL

📋
⚙

OPTIMIZE TABLE myindex;
Add back the table to the cluster:
‹›

SQL

SQL

📋
⚙

ALTER CLUSTER mycluster ADD myindex;
When the table is added back, the new files created by the optimization process will be replicated to the other nodes in the cluster. Any local changes made to the table on other nodes will be lost.

Table data modifications (inserts, replaces, deletes, updates) should either:

Be postponed, or
Be directed to the node where the optimization process is running.

Note that while the table is out of the cluster, insert/replace/delete/update commands should refer to it without the cluster name prefix (for SQL statements or the cluster property in case of an HTTP JSON request), otherwise they will fail. Once the table is added back to the cluster, you must resume write operations on the table and include the cluster name prefix again, or they will fail.

Search operations are available as usual during the process on any of the nodes.

Flushing RT table to disk Isolation during flushing and merging

Real-time table structure

Flushing RAM chunk to a new disk chunk

FLUSH RAMCHUNK

Flushing RAM chunk to disk

FLUSH TABLE

Compacting a Table

OPTIMIZE TABLE

Number of optimized disk chunks

Running in foreground

Throttling the IO impact

Optimizing clustered tables