Killlist in plain indexes

When using plain indexes there is a problem generated by the need of having the data in the index as fresh as possible.

In this case one or more secondary (also know as delta) indexes are used to capture the modified data between the time the main index was created and and current time. The modified data can mean new, updated or deleted documents. The search becomes a search over the main index and the delta index. This works with no obstacle when you just add new documents to the delta index, but when it comes to updated or deleted documents there remains the following issue.

If a document is present in both main and delta indexes it can cause issues at searching as the engine will see two versions of a document and won't know how to pick the right one. So the delta needs to tell somehow to the search that there are deleted documents in the main index that should be forgotten. Here comes kill lists.

Index kill-list

Index can maintain a list of document ids that can be used to suppress records in other indexes. This feature is available for plain indexes using database sources or plain indexes using XML sources. In case of database sources, the source needs to provide an additional query defined by sql_query_killlist. It will store in the index a list of documents that can be used by the server to remove documents from other plain indexes.

This query is expected to return a number of 1-column rows, each containing just the document ID.

In many cases the query is a union between a query that gets a list of updated documents and a list of deleted documents, e.g.:

    sql_query_killlist = \
        SELECT id FROM documents WHERE updated_ts>=@last_reindex UNION \
        SELECT id FROM documents_deleted WHERE deleted_ts>=@last_reindex

Removing documents in a plain index

A plain index can contain a directive called killlist_target that will tell the server it can provide a list of document ids that should be removed from certain existing indexes. The index can use either it's document ids as the source for this list or provide a separate list.

killlist_target

Sets the index(es) that the kill-list will be applied to. Optional, default value is empty.

When you use plain_indexes you often need to maintain not a single index, but a set of them to be able to add/update/delete new documents sooner (read delta_index_updates). In order to suppress matches in the previous (main) index that were updated or deleted in the next (delta) index you need to:

  1. Create a kill-list in the delta index using sql_query_killlist
  2. Specify main index as killlist_target in delta index settings:
CONFIG
📋
index products {
  killlist_target = main:kl

  path = products
  source = src_base
}

When killlist_target is specified, kill-list is applied to all the indexes listed in it on searchd startup. If any of the indexes from killlist_target are rotated, kill-list is reapplied to these indexes. When kill-list is applied, indexes that were affected save these changes to disk.

killlist_target has 3 modes of operation:

  1. killlist_target = main:kl. Document ids from the kill-list of the delta index are suppressed in the main index (see sql_query_killlist).
  2. killlist_target = main:id. All document ids from delta index are suppressed in the main index. Kill-list is ignored.
  3. killlist_target = main. Both document ids from delta index and its kill-list are suppressed in the main index.

Multiple targets can be specified separated by comma like

killlist_target = index_one:kl,index_two:kl

You can change killlist_target settings for an index without reindexing it by using ALTER.

But since the 'old' main index has already written the changes to disk, the documents that were deleted in it will remain deleted even if it is no longer in the killlist_target of the delta index.

📋
ALTER TABLE delta KILLLIST_TARGET='new_main_index:kl'

Attaching a plain index to a real-time index

A plain index can be converted into a real-time index or added to an existing real-time index.

The first case is useful when you need to regenerated a real-time index completely which may be needed for example if tokenization settings need an update. Then preparing a plain index and converting it into a real-time index may be easier than preparing a batch job to perform INSERTs for adding all the data into a real-time index.

In the second you normally want to add a large bulk of new data to a real-time index and again creating a plain index with that data is easier than populating the existing real-time index.

General syntax

The ATTACH statement allows to convert a plain index to be attached to an existing real-time index.

ATTACH INDEX diskindex TO RTINDEX rtindex [WITH TRUNCATE]

ATTACH INDEX statement lets you move data from a plain index to an RT index.

After a successful ATTACH the data originally stored in the source plain index becomes a part of the target RT index, and the source plain index becomes unavailable (until the next rebuild). ATTACH does not result in any index data changes. Basically, it just renames the files (making the source index a new disk chunk of the target RT index) and updates the metadata. So it is a generally quick operation which might (frequently) complete as fast as under a second.

Note that when an index is attached to an empty RT index the fields, attributes and text processing settings (tokenizer, wordforms, etc) from the source index are copied over and take effect. The respective parts of the RT index definition from the configuration file will be ignored.

When TRUNCATE option is used RT index gets truncated prior to attaching source plain index. This allows to make operation atomic or make sure that the attached source plain index will be the only data in the target RT index.

ATTACH INDEX comes with a number of restrictions. Most notably, the target RT index is currently required to be either empty or have the same settings as the source plain index. In case the source plain index gets attached to a non-empty RT index the RT index data collected so far gets stored as a regular disk chunk and index being attached becomes the newest disk chunk and documents with same IDs get killed. The complete list is as follows:

Example
📋

Before ATTACH the RT index is empty and has 3 fields:

mysql> DESC rt;
Empty set (0.00 sec)

mysql> SELECT * FROM rt;
+-----------+---------+
| Field     | Type    |
+-----------+---------+
| id        | integer |
| testfield | field   |
| testattr  | uint    |
+-----------+---------+
3 rows in set (0.00 sec)

The plain index is not empty:

mysql> SELECT * FROM disk WHERE MATCH('test');
+------+--------+----------+------------+
| id   | weight | group_id | date_added |
+------+--------+----------+------------+
|    1 |   1304 |        1 | 1313643256 |
|    2 |   1304 |        1 | 1313643256 |
|    3 |   1304 |        1 | 1313643256 |
|    4 |   1304 |        1 | 1313643256 |
+------+--------+----------+------------+
4 rows in set (0.00 sec)

Attaching:

mysql> ATTACH INDEX disk TO RTINDEX rt;
Query OK, 0 rows affected (0.00 sec)

The RT index now has 5 fields:

mysql> DESC rt;
+------------+-----------+
| Field      | Type      |
+------------+-----------+
| id         | integer   |
| title      | field     |
| content    | field     |
| group_id   | uint      |
| date_added | timestamp |
+------------+-----------+
5 rows in set (0.00 sec)

And it's not empty:

mysql> SELECT * FROM rt WHERE MATCH('test');
+------+--------+----------+------------+
| id   | weight | group_id | date_added |
+------+--------+----------+------------+
|    1 |   1304 |        1 | 1313643256 |
|    2 |   1304 |        1 | 1313643256 |
|    3 |   1304 |        1 | 1313643256 |
|    4 |   1304 |        1 | 1313643256 |
+------+--------+----------+------------+
4 rows in set (0.00 sec)

The plain index was removed:

mysql> SELECT * FROM disk WHERE MATCH('test');
ERROR 1064 (42000): no enabled local indexes to search

Importing real-time index

When searchd is switched from Plain mode to RT mode, real-time indexes can be imported using the IMPORT TABLE statement. The general syntax is as follows:

IMPORT TABLE table_name FROM 'path'

All index files are copied to data_dir. All external index files such as wordforms, exceptions and stopwords are also copied to data_dir. IMPORT TABLE has the following limitations:

  • paths to external files that were originally specified in the config file must be absolute
  • only real-time indexes are supported
  • plain indexes need to be converted to real-time indexes via ATTACH INDEX