Killlist in plain tables

When using plain tables, there's a challenge arising from the need to have the data in the table as fresh as possible.

In this case, one or more secondary (also known as delta) tables are used to capture the modified data between the time the main table was created and the current time. The modified data can include new, updated, or deleted documents. The search becomes a search over the main table and the delta table. This works seamlessly when you just add new documents to the delta table, but when it comes to updated or deleted documents, there remains the following issue.

If a document is present in both the main and delta tables, it can cause issues during searching, as the engine will see two versions of a document and won't know how to pick the right one. So, the delta needs to somehow inform the search that there are deleted documents in the main table that should be disregarded. This is where kill lists come in.

Table kill-list

A table can maintain a list of document IDs that can be used to suppress records in other tables. This feature is available for plain tables using database sources or plain tables using XML sources. In the case of database sources, the source needs to provide an additional query defined by sql_query_killlist. It will store in the table a list of documents that can be used by the server to remove documents from other plain tables.

This query is expected to return a number of 1-column rows, each containing just the document ID.

In many cases, the query is a union between a query that retrieves a list of updated documents and a list of deleted documents, e.g.:

sql_query_killlist = \
    SELECT id FROM documents WHERE updated_ts>=@last_reindex UNION \
    SELECT id FROM documents_deleted WHERE deleted_ts>=@last_reindex

Removing documents in a plain table

A plain table can contain a directive called killlist_target that will tell the server it can provide a list of document IDs that should be removed from certain existing tables. The table can use either its document IDs as the source for this list or provide a separate list.

killlist_target

Sets the table(s) that the kill-list will be applied to. Optional, default value is empty.

When you use plain tables you often need to maintain not just a single table, but a set of them to be able to add/update/delete new documents sooner (read about delta table updates). n order to suppress matches in the previous (main) table that were updated or deleted in the next (delta) table, you need to:

  1. Create a kill-list in the delta table using sql_query_killlist
  2. Specify main table as killlist_target in delta table settings:
‹›
  • CONFIG
CONFIG
📋
table products {
  killlist_target = main:kl

  path = products
  source = src_base
}

When killlist_target is specified, the kill-list is applied to all the tables listed in it on searchd startup. If any of the tables from killlist_target are rotated, the kill-list is reapplied to these tables. When the kill-list is applied, tables that were affected save these changes to disk.

killlist_target has 3 modes of operation:

  1. killlist_target = main:kl. Document IDs from the kill-list of the delta table are suppressed in the main table (see sql_query_killlist).
  2. killlist_target = main:id. All document IDs from the delta table are suppressed in the main table. The kill-list is ignored.
  3. killlist_target = main. Both document IDs from the delta table and its kill-list are suppressed in the main table.

Multiple targets can be specified, separated by commas like:

killlist_target = table_one:kl,table_two:kl

You can change the killlist_target settings for a table without rebuilding it by using ALTER.

However, since the 'old' main table has already written the changes to disk, the documents that were deleted in it will remain deleted even if it is no longer in the killlist_target of the delta table.

‹›
  • SQL
  • HTTP
📋
ALTER TABLE delta KILLLIST_TARGET='new_main_table:kl'

Attaching a plain table to a real-time table

A plain table can be converted into a real-time table or added to an existing real-time table.

The first case is useful when you need to regenerate a real-time table completely, which may be needed, for example, if tokenization settings need an update. In this situation, preparing a plain table and converting it into a real-time table may be easier than preparing a batch job to perform INSERTs for adding all the data into a real-time table.

In the second case, you normally want to add a large bulk of new data to a real-time table, and again, creating a plain table with that data is easier than populating the existing real-time table.

Attaching table - general syntax

The ATTACH statement allows you to convert a plain table to be attached to an existing real-time table.

ATTACH TABLE plain_table TO TABLE rt_table [WITH TRUNCATE]

ATTACH TABLE statement lets you move data from a plain table to an RT table.

After a successful ATTACH the data originally stored in the source plain table becomes a part of the target RT table, and the source plain table becomes unavailable (until the next rebuild). ATTACH does not result in any table data changes. Essentially, it just renames the files (making the source table a new disk chunk of the target RT table) and updates the metadata. So it is generally a quick operation that might (frequently) complete as fast as under a second.

Note that when a table is attached to an empty RT table, the fields, attributes, and text processing settings (tokenizer, wordforms, etc.) from the source table are copied over and take effect. The respective parts of the RT table definition from the configuration file will be ignored.

When the TRUNCATE option is used, the RT table gets truncated prior to attaching the source plain table. This allows the operation to be atomic or ensures that the attached source plain table will be the only data in the target RT table.

ATTACH TABLE comes with a number of restrictions. Most notably, the target RT table is currently required to be either empty or have the same settings as the source plain table. In case the source plain table gets attached to a non-empty RT table, the RT table data collected so far gets stored as a regular disk chunk, and the table being attached becomes the newest disk chunk, with documents having the same IDs getting killed. The complete list of restrictions is as follows:

  • The target RT table needs to be either empty or have the same settings as the source plain table.
  • The source plain table needs to have phrase_boundary_stepset to 0 and stopword_stepset to 1.
‹›
  • Example
Example
📋

Before the ATTACH, the RT table is empty and has 3 fields:

mysql> DESC rt;
Empty set (0.00 sec)

mysql> SELECT * FROM rt;
+-----------+---------+
| Field     | Type    |
+-----------+---------+
| id        | integer |
| testfield | field   |
| testattr  | uint    |
+-----------+---------+
3 rows in set (0.00 sec)

The plain table is not empty:

mysql> SELECT * FROM plain WHERE MATCH('test');
+------+--------+----------+------------+
| id   | weight | group_id | date_added |
+------+--------+----------+------------+
|    1 |   1304 |        1 | 1313643256 |
|    2 |   1304 |        1 | 1313643256 |
|    3 |   1304 |        1 | 1313643256 |
|    4 |   1304 |        1 | 1313643256 |
+------+--------+----------+------------+
4 rows in set (0.00 sec)

Attaching the plain table to the RT table:

mysql> ATTACH TABLE plain TO TABLE rt;
Query OK, 0 rows affected (0.00 sec)

The RT table now has 5 fields:

mysql> DESC rt;
+------------+-----------+
| Field      | Type      |
+------------+-----------+
| id         | integer   |
| title      | field     |
| content    | field     |
| group_id   | uint      |
| date_added | timestamp |
+------------+-----------+
5 rows in set (0.00 sec)

And it's not empty:

mysql> SELECT * FROM rt WHERE MATCH('test');
+------+--------+----------+------------+
| id   | weight | group_id | date_added |
+------+--------+----------+------------+
|    1 |   1304 |        1 | 1313643256 |
|    2 |   1304 |        1 | 1313643256 |
|    3 |   1304 |        1 | 1313643256 |
|    4 |   1304 |        1 | 1313643256 |
+------+--------+----------+------------+
4 rows in set (0.00 sec)

After the ATTACH, the plain table is removed and no longer available for searching:

mysql> SELECT * FROM plain WHERE MATCH('test');
ERROR 1064 (42000): no enabled local indexes to search

Importing table

If you decide to migrate from Plain mode to RT mode or in some other cases, real-time and percolate tables built in Plain mode can be imported to Manticore running in RT mode using the IMPORT TABLE statement. The general syntax is as follows:

IMPORT TABLE table_name FROM 'path'

where the 'path' parameter must be set as: /your_backup_folder/your_backup_name/data/your_table_name/your_table_name

‹›
  • bash
bash
📋
mysql -P9306 -h0 -e 'create table t(f text)'

mysql -P9306 -h0 -e "backup table t to /tmp/"

mysql -P9306 -h0 -e "drop table t"

BACKUP_NAME=$(ls /tmp | grep 'backup-' | tail -n 1)

mysql -P9306 -h0 -e "import table t from '/tmp/$BACKUP_NAME/data/t/t'

mysql -P9306 -h0 -e "show tables"

Executing this command makes all the table files of the specified table copied to data_dir. All the external table files such as wordforms, exceptions and stopwords are also copied to the same data_dir. IMPORT TABLE has the following limitations:

  • paths to the external files that were originally specified in the config file must be absolute
  • only real-time and percolate tables are supported
  • plain tables need to be preliminarily (in the plain mode) converted to real-time tables via ATTACH TABLE

indexer --print-rt

If the above method for migrating a plain table to an RT table is not possible, you may use indexer --print-rt to dump data from a plain table directly without the need to convert it to an RT type table and then import the dump into an RT table right from the command line.

This method has a few limitations though:

  • Only SQL-based sources are supported
  • MVAs are not supported
‹›
  • bash
bash
📋
/usr/bin/indexer --rotate --config /etc/manticoresearch/manticore.conf --print-rt my_rt_index my_plain_index > /tmp/dump_regular.sql

mysql -P $9306 -h0 -e "truncate table my_rt_index"

mysql -P 9306 -h0 < /tmp/dump_regular.sql

rm /tmp/dump_regular.sql