Backup and restore

It's crucial to regularly back up your tables to recover them in case of system crashes, hardware failure, or data corruption/loss. Backups are also necessary before upgrading to a new version of Manticore Search that changes the table format, and for transferring data to another system when migrating to a new server.

The manticore-backup tool, included in the official Manticore Search packages, automates the process of backing up tables for an instance running in RT mode.

Installation

If you followed the official installation instructions, you should already have everything installed and don't need to worry. Otherwise, manticore-backup requires PHP 8.1.10 and specific modules or manticore-executor, which is a part of the manticore-extra package, and you need to ensure that one of these is available.

Note that manticore-backup is not available for Windows yet.

How to use

First, make sure you're running manticore-backup on the same server where the Manticore instance you are about to back up is running.

Second, we recommend running the tool under the root user so the tool can transfer ownership of the files you are backing up. Otherwise, a backup will be also made but with no ownership transfer. In either case, you should make sure that manticore-backup has access to the data dir of the Manticore instance.

The only required argument for manticore-backup is --backup-dir, which specifies the destination for the backup. If you don't provide any additional arguments, manticore-backup will:

  • locate a Manticore instance running with the default configuration
  • create a subdirectory in the --backup-dir directory with a timestamped name
  • backup all tables found in the instance
‹›
  • Example
Example
📋
manticore-backup --config=path/to/manticore.conf --backup-dir=backupdir
‹›
Response
Copyright (c) 2023, Manticore Software LTD (https://manticoresearch.com)

Manticore config file: /etc/manticoresearch/manticore.conf
Tables to backup: all tables
Target dir: /mnt/backup/

Manticore config
  endpoint =  127.0.0.1:9308

Manticore versions:
  manticore: 5.0.2
  columnar: 1.15.4
  secondary: 1.15.4
2022-10-04 17:18:39 [Info] Starting the backup...
2022-10-04 17:18:39 [Info] Backing up config files...
2022-10-04 17:18:39 [Info]   config files - OK
2022-10-04 17:18:39 [Info] Backing up tables...
2022-10-04 17:18:39 [Info]   pq (percolate) [425B]...
2022-10-04 17:18:39 [Info]    OK
2022-10-04 17:18:39 [Info]   products (rt) [512B]...
2022-10-04 17:18:39 [Info]    OK
2022-10-04 17:18:39 [Info] Running sync
2022-10-04 17:18:42 [Info]  OK
2022-10-04 17:18:42 [Info] You can find backup here: /mnt/backup/backup-20221004171839
2022-10-04 17:18:42 [Info] Elapsed time: 2.76s
2022-10-04 17:18:42 [Info] Done

To back up specific tables only, use the --tables flag followed by a comma-separated list of tables, for example --tables=tbl1,tbl2. This will only backup the specified tables and ignore the rest.

‹›
  • Example
Example
📋
manticore-backup --backup-dir=/mnt/backup/ --tables=products
‹›
Response
Copyright (c) 2023, Manticore Software LTD (https://manticoresearch.com)

Manticore config file: /etc/manticoresearch/manticore.conf
Tables to backup: products
Target dir: /mnt/backup/

Manticore config
  endpoint =  127.0.0.1:9308

Manticore versions:
  manticore: 5.0.3
  columnar: 1.16.1
  secondary: 0.0.0
2022-10-04 17:25:02 [Info] Starting the backup...
2022-10-04 17:25:02 [Info] Backing up config files...
2022-10-04 17:25:02 [Info]   config files - OK
2022-10-04 17:25:02 [Info] Backing up tables...
2022-10-04 17:25:02 [Info]   products (rt) [512B]...
2022-10-04 17:25:02 [Info]    OK
2022-10-04 17:25:02 [Info] Running sync
2022-10-04 17:25:06 [Info]  OK
2022-10-04 17:25:06 [Info] You can find backup here: /mnt/backup/backup-20221004172502
2022-10-04 17:25:06 [Info] Elapsed time: 4.82s
2022-10-04 17:25:06 [Info] Done

Arguments

Argument Description
--backup-dir=path This is the path to the backup directory where the backup will be stored. The directory must already exist. This argument is required and has no default value. On each backup run, manticore-backup will create a subdirectory in the provided directory with a timestamp in the name (backup-[datetime]), and will copy all required tables to it. So the --backup-dir is a container for all your backups, and it's safe to run the script multiple times.
--restore[=backup] Restore from --backup-dir. Just --restore lists available backups. --restore=backup will restore from <--backup-dir>/backup.
--config=/path/to/manticore.conf Path to Manticore config. This is optional. If it's not passed, a default one for your operating system will be used. It's used to get the host and port to communicate with the Manticore daemon.
--tables=tbl1,tbl2, ... Semicolon-separated list of tables that you want to back up. To back up all tables, omit this argument. All the provided tables must exist in the Manticore instance you are backing up from, or the backup will fail.
--compress Whether the backed up files should be compressed. Not enabled by default.
--unlock In rare cases when something goes wrong, tables can be left in a locked state. Use this argument to unlock them.
--version Show the current version.
--help Show this help.

BACKUP SQL command reference

You can also back up your data through SQL by running the simple command BACKUP TO /path/to/backup.

Note, this command is not supported in Windows yet.

General syntax of BACKUP

BACKUP
  [{TABLE | TABLES} a[, b]]
  [{OPTION | OPTIONS}
    async = {on | off | 1 | 0 | true | false | yes | no}
    [, compress = {on | off | 1 | 0 | true | false | yes | no}]
  ]
  TO path_to_backup

For instance, to back up tables a and b to the /backup directory, run the following command:

BACKUP TABLES a, b TO /backup

There are options available to control and adjust the backup process, such as:

  • async: makes the backup non-blocking, allowing you to receive a response with the query ID immediately and run other queries while the backup is ongoing. The default value is 0.
  • compress: enables file compression using zstd. The default value is 0. For example, to run a backup of all tables in async mode with compression enabled to the /tmp directory:
BACKUP OPTION async = yes, compress = yes TO /tmp

Important considerations

  1. The path should not contain special symbols or spaces, as they are not supported.
  2. Ensure that Manticore Buddy is launched (it is by default).

Restore

To restore a Manticore instance from a backup, use the manticore-backup command with the --backup-dir and --restore arguments. For example: manticore-backup --backup-dir=/path/to/backups --restore. If you don't provide any argument for --restore, it will simply list all the backups in the --backup-dir.

‹›
  • Example
Example
📋
manticore-backup --backup-dir=/mnt/backup/ --restore
‹›
Response
Copyright (c) 2023, Manticore Software LTD (https://manticoresearch.com)

Manticore config file:
Backup dir: /tmp/

Available backups: 3
  backup-20221006144635 (Oct 06 2022 14:46:35)
  backup-20221006145233 (Oct 06 2022 14:52:33)
  backup-20221007104044 (Oct 07 2022 10:40:44)

To start a restore job, run manticore-backup with the flag --restore=backup name, where backup name is the name of the backup directory within the --backup-dir. Note that:

  1. There can't be any Manticore instance running on the same host and port as the one being restored.
  2. The old manticore.json file must not exist.
  3. The old configuration file must not exist.
  4. The old data directory must exist and be empty.

If all conditions are met, the restore will proceed. The tool will provide hints, so you don't have to memorize them. It's crucial to avoid overwriting existing files, so make sure to remove them prior to the restore if they still exist. Hence all the conditions.

‹›
  • Example
Example
📋
manticore-backup --backup-dir=/mnt/backup/ --restore=backup-20221007104044
‹›
Response
Copyright (c) 2023, Manticore Software LTD (https://manticoresearch.com)

Manticore config file:
Backup dir: /tmp/
2022-10-07 11:17:25 [Info] Starting to restore...

Manticore config
  endpoint =  127.0.0.1:9308
2022-10-07 11:17:25 [Info] Restoring config files...
2022-10-07 11:17:25 [Info]   config files - OK
2022-10-07 11:17:25 [Info] Restoring state files...
2022-10-07 11:17:25 [Info]   config files - OK
2022-10-07 11:17:25 [Info] Restoring data files...
2022-10-07 11:17:25 [Info]   config files - OK
2022-10-07 11:17:25 [Info] The backup '/tmp/backup-20221007104044' was successfully restored.
2022-10-07 11:17:25 [Info] Elapsed time: 0.02s
2022-10-07 11:17:25 [Info] Done

Real-time table structure

Plain table can be created from an external source by special tool [indexer], which reads a "recipe from configuration, then connects to the data sources, pulls documents and builds table files. That is quite a long process. If your data then changes, the table becomes no more actual, and you need to rebuild it from the refreshed sources. If your data changes incrementally - for example, you table some blog or newsfeed, where old documents never change, and only new added, such rebuild will take more and more time, as on each pass you will need to process the archive sources again and again.

One of the ways to deal with this problem is by using several tables instead of one solid. For example, you can process sources produced previous years and save the table. Then take only sources from current year and put them into a separate table, rebuilding it as often as necessary. Then you can place both tables as parts of a distributed table, and use it for querying. The point here is that each time you rebuild only data for at most last 12 months, and the table with older data remains untouched without need to be rebuilt. You can go further and divide the last 12 months table into: monthly, weekly, daily tables, and so on.

This approach works, but you need to maintain your distributed table manually. I.e., add new chunks, delete old and keep overall number of partial tables not so big (with too many tables searching can become slower, and also the OS usually limits the number of simultaneously opened files). To deal with it, you can manually merge several tables together, by running indexer --merge. However, that solves only the problem of many tables, by making maintenance harder. And even with 'per-hour' reindexing you most probably will have noticeable time gap between arriving new data in sources and rebuilding the table which populates this data for searching.

Real-time table is intended to solve the problem. It consists of two parts:

  1. Special RAM-based table (called RAM chunk), which contains portions of data arriving right now.
  2. Collection of plain tables called disk chunks, that were built in past.

That is very similar to a usual distributed table, made from several locals.

You don't need to build such table the traditional way - by running indexer, which reads a "recipe" from config and tables data sources. Instead, real-time table provides ability to 'insert' new documents, and 'replace' existing. When executing the 'insert' command, you push new documents to the server. It then builds a small table from the added documents, and immediately brings it on-line. So, right after the 'insert' command completes you can perform searches in all the table parts, including just added documents.

Maintaining is performed automatically by search server, so you don't have to care about it. But you may be interested to know about few details on 'how it is maintained'.

First, since indexed data is stored in RAM - what about emergency power-off? Will I lose my table then? Well, before completion, server saves new data into special 'binlog'. That is one or several files, living on your persistent storage, which incrementally grows as you add more and more changes. You may tune the behaviour on how often new queries (or transactions) are stored there, and how often 'sync' command is executed over the binlog file in order to force the OS to actually save the data on a safe storage. Most paranoid way - to flush and sync on every transaction. That is slowest, but also the safest approach. The least expensive way - to switch off binlog at all. That is fastest, but you can lose your indexed data. Intermediate variants, like flush/sync every second are also provided.

Binlog is designed especially for sequential saving of new arriving transactions, it is not a table, and it can't be searched over. That is just an insurance that the server will not lose your data. If a sudden disruption happened and everything crashed because of a software or hardware problem, the server will load teh freshest available dump of the RAM chunk, and then will replay the binlog, repeating stored transactions. Finally it will achieve the same state as was at the moment of the last change.

Second, what about limits? What if I want to process, say, 10TB of data, it just doesn't fit to RAM! RAM for a real-time table is limited and may be configured. When some quantity of data indexed, the server maintains RAM part of table by merging together small transactions, keeping their number and overall size small. That sometimes causes delays on insertion, however. When merging helps no more, and new insertions hit the RAM limit, the server converts the RAM-based table into a plain table, stored on disk (called disk chunk). That table is added to the collection of tables of the second part of the RT table and comes on-line. The RAM is then flushed and the space gets deallocated.

When the data from RAM is surely saved to disk, which happens:

  • when the server saves the collected data as a disk table
  • or when it dumps the RAM part during a clean shutdown or by manual flushing

the binlog for that table is no more necessary. So, it gets discarded. If all the tables are saved, it will be deleted.

Third, what about disk collection? If having many disk parts makes searching slower, what's difference if I make them manually in the distributed table manner, or they're produced as disk parts (or, 'chunks') by an RT table? Well, in both cases you can merge several tables into one. Say, you can merge hourly tables from yesterday and keep one 'daily' yesterday's table instead. With the manual maintenance you have to think about the schema and commands yourself. With an RT table the server provides command OPTIMIZE, which does the same, but keeps you away from unnecessary internal details.

Fourth, if my "document" constitutes a 'mini-table' and I don't need it anymore I can just throw it away. But if it is 'optimized', i.e. mixed together with tons of other documents, how can I undo or delete it? Yes, indexed documents are 'mixed' together, and there is no easy way to delete one without rebuilding the whole table. And if for plain tables rebuilding or merging is just a normal way of maintenance, for a real-time table it keeps only the simplicity of manipulation, but not 'real-timeness'. To address the problem, Manticore uses a trick: when you delete a document, identified by document ID, the server just tracks the number. Together with other deleted documents their ids are saved in so-called kill-list. When you search over the table, the server first retrieves all matching documents, and then throws out the documents that are found in the kill-list (that is the most basic description; in fact internally it's more complex). The point is - for the sake of 'immediate' deletion documents are not actually deleted, but are just marked as 'deleted'. They still occupy space in different table structures, being actually a garbage. Word statistics, which affects ranking, also isn't affected, meaning, it works exactly as it is declared: we search among all documents, and then just hide ones, marked as deleted from the final result. When document is replaced means that it is killed in old parts of the table and is inserted again in the freshest part. All consequences of 'hiding by killlist' are also in play in this case.

When a rebuild of some part of a table happens, e.g when some transactions (segments) of a RAM chunk are merged, or when RAM chunk is converted into a disk chunk, or when two disk chunk are merged together the server performs comprehensive iteration over the affected parts and physically exclude deleted documents from all them. I.e., if they were in document lists of some words - they are wiped away. If it was a unique word - it gets removed completely.

As a summary: the deletion works in two phases:

  1. First, we mark documents as 'deleted' in realtime and suppress them in search results
  2. During some operation with an RT table chunk we finally physically wipe the deleted documents for good

Fifth, if RT table contains plain disk tables in it's collection, can I just add my ready old disk table to it? No. It's not possible to avoid unneeded complexity and avoid accidental corruption. But if your RT table has just been created and contains no data - then you can ATTACH TABLE your disk table to it. Your old table will be moved inside the RT table, and will become it's part.

As a summary about the RT table structure: that is clever-organized collection of plain disk tables with a fast in-memory table, intended for real-time insertions and semi-realtime deletions of documents, which has common schema, common settings, and which can be easily maintained without deep digging into details.

Flushing RAM chunk to a new disk chunk

FLUSH RAMCHUNK

FLUSH RAMCHUNK rtindex

FLUSH RAMCHUNK forcibly creates a new disk chunk in an RT table.

Normally, RT table would flush and convert the contents of the RAM chunk into a new disk chunk automatically, once the RAM chunk reaches the maximum allowed rt_mem_limit size. However, for debugging and testing it might be useful to forcibly create a new disk chunk, and FLUSH RAMCHUNK statement does exactly that.

Note that using FLUSH RAMCHUNK increases RT table fragmentation. Most likely, you want to use FLUSH TABLE instead. We suggest that you abstain from using just this statement unless you're absolutely sure what you're doing. As the right way is to issue FLUSH RAMCHUNK with following OPTIMIZE command. Such combo allows to keep RT table fragmentation on minimum.

‹›
  • SQL
SQL
📋
FLUSH RAMCHUNK rt;
‹›
Response
Query OK, 0 rows affected (0.05 sec)