Mirroring

Agent mirrors can be used interchangeably when processing a search query. The Manticore instance(s) hosting the distributed table where the mirrored agents are defined keeps track of mirror status (alive or dead) and response times, and performs automatic failover and load balancing based on this information.

Agent mirrors

agent = node1|node2|node3:9312:shard2

The above example declares that node1:9312, node2:9312, and node3:9312 all have a table called shard2, and can be used as interchangeable mirrors. If any of these servers go down, the queries will be distributed between the remaining two. When the server comes back online, the master will detect it and begin routing queries to all three nodes again.

A mirror may also include an individual table list, as follows:

agent = node1:9312:node1shard2|node2:9312:node2shard2

This works similarly to the previous example, but different table names will be used when querying different servers. For example, node1shard2 will be used when querying node1:9312, and node2shard will be used when querying node2:9312.

By default, all queries are routed to the best of the mirrors. The best mirror is selected based on recent statistics, as controlled by the ha_period_karma config directive. The master stores metrics (total query count, error count, response time, etc.) for each agent and groups these by time spans. The karma is the length of the time span. The best agent mirror is then determined dynamically based on the last two such time spans. The specific algorithm used to pick a mirror can be configured with the ha_strategy directive.

The karma period is in seconds and defaults to 60 seconds. The master stores up to 15 karma spans with per-agent statistics for instrumentation purposes (see SHOW AGENT STATUS statement). However, only the last two spans out of these are used for HA/LB logic.

When there are no queries, the master sends a regular ping command every ha_ping_interval milliseconds in order to collect statistics and check if the remote host is still alive. The ha_ping_interval defaults to 1000 msec. Setting it to 0 disables pings, and statistics will only be accumulated based on actual queries.

Example:

# sharding table over 4 servers total
# in just 2 shards but with 2 failover mirrors for each shard
# node1, node2 carry shard1 as local
# node3, node4 carry shard2 as local

# config on node1, node2
agent = node3:9312|node4:9312:shard2

# config on node3, node4
agent = node1:9312|node2:9312:shard1

Load balancing

Load balancing is turned on by default for any distributed table that uses mirroring. By default, queries are distributed randomly among the mirrors. You can change this behavior by using the ha_strategy.

ha_strategy

ha_strategy = {random|nodeads|noerrors|roundrobin}

The mirror selection strategy for load balancing is optional and is set to random by default.

The strategy used for mirror selection, or in other words, choosing a specific agent mirror in a distributed table, is controlled by this directive. Essentially, this directive controls how master performs the load balancing between the configured mirror agent nodes. The following strategies are implemented:

Simple random balancing

The default balancing mode is simple linear random distribution among the mirrors. This means that equal selection probabilities are assigned to each mirror. This is similar to round-robin (RR), but does not impose a strict selection order.

‹›
  • Example
Example
📋
ha_strategy = random

Adaptive randomized balancing

The default simple random strategy does not take into account the status of mirrors, error rates, and most importantly, actual response latencies. To address heterogeneous clusters and temporary spikes in agent node load, there are a group of balancing strategies that dynamically adjust the probabilities based on the actual query latencies observed by the master.

The adaptive strategies based on latency-weighted probabilities work as follows:

  1. Latency stats are accumulated in blocks of ha_period_karma seconds.
  2. Latency-weighted probabilities are recomputed once per karma period.
  3. The "dead or alive" flag is adjusted once per request, including ping requests.

Initially, the probabilities are equal. On every step, they are scaled by the inverse of the latencies observed during the last karma period, and then renormalized. For example, if during the first 60 seconds after the master startup, 4 mirrors had latencies of 10 ms, 5 ms, 30 ms, and 3 ms respectively, the first adjustment step would go as follows:

  1. Initial percentages: 0.25, 0.25, 0.25, 0.25.
  2. Observed latencies: 10 ms, 5 ms, 30 ms, 3 ms.
  3. Inverse latencies: 0.1, 0.2, 0.0333, 0.333.
  4. Scaled percentages: 0.025, 0.05, 0.008333, 0.0833.
  5. Renormalized percentages: 0.15, 0.30, 0.05, 0.50.

This means that the first mirror would have a 15% chance of being chosen during the next karma period, the second one a 30% chance, the third one (slowest at 30 ms) only a 5% chance, and the fourth and fastest one (at 3 ms) a 50% chance. After that period, the second adjustment step would update those chances again, and so on.

The idea is that once the observed latencies stabilize, the latency weighted probabilities will stabilize as well. All these adjustment iterations are meant to converge at a point where the average latencies are roughly equal across all mirrors.

nodeads

Latency-weighted probabilities, but dead mirrors are excluded from the selection. A "dead" mirror is defined as a mirror that has resulted in multiple hard errors (e.g. network failure, or no answer, etc) in a row.

‹›
  • Example
Example
📋
ha_strategy = nodeads

noerrors

Latency-weighted probabilities, but mirrors with a worse error/success ratio are excluded from selection.

‹›
  • Example
Example
📋
ha_strategy = noerrors

Round-robin balancing

Simple round-robin selection, that is, selecting the first mirror in the list, then the second one, then the third one, etc, and then repeating the process once the last mirror in the list is reached. Unlike with the randomized strategies, RR imposes a strict querying order (1, 2, 3, ..., N-1, N, 1, 2, 3, ..., and so on) and guarantees that no two consecutive queries will be sent to the same mirror.

‹›
  • Example
Example
📋
ha_strategy = roundrobin

Instance-wide options

ha_period_karma

ha_period_karma = 2m

ha_period_karma defines the size of the agent mirror statistics window, in seconds (or a time suffix). Optional, the default is 60.

For a distributed table with agent mirrors, the server tracks several different per-mirror counters. These counters are then used for failover and balancing. (The server picks the best mirror to use based on the counters.) Counters are accumulated in blocks of ha_period_karma seconds.

After beginning a new block, the master may still use the accumulated values from the previous one until the new one is half full. Thus, any previous history stops affecting the mirror choice after at most 1.5 times ha_period_karma seconds.

Although at most 2 blocks are used for mirror selection, up to 15 last blocks are actually stored for instrumentation purposes. They can be inspected using the SHOW AGENT STATUS statement.

ha_ping_interval

ha_ping_interval = 3s

ha_ping_interval directive defines the interval between pings sent to the agent mirrors, in milliseconds (or with a time suffix). This option is optional and its default value is 1000.

For a distributed table with agent mirrors, the server sends all mirrors a ping command during idle periods to track their current status (whether they are alive or dead, network roundtrip time, etc.). The interval between pings is determined by the ha_ping_interval setting.

If you want to disable pings, set ha_ping_interval to 0.

Setting up replication

With Manticore, write transactions (such as INSERT, REPLACE, DELETE, TRUNCATE, UPDATE, COMMIT) can be replicated to other cluster nodes before the transaction is fully applied on the current node. Currently, replication is supported for percolate and rt tables in Linux and macOS. However, Manticore Search packages for Windows do not provide replication support.

Manticore's replication is powered by the Galera library and boasts several impressive features:

  • True Multi-Master: Read and write to any node at any time.
  • virtually synchronous replication No slave lag and no data loss after a node crash.
  • Hot Standby: No downtime during failover (since there is no failover).
  • Tightly Coupled: All nodes hold the same state and no diverged data between nodes is allowed.
  • Automatic Node Provisioning: No need to manually backup the database and restore it on a new node.
  • Easy to Use and Deploy.
  • Detection and Automatic Eviction of Unreliable Nodes.
  • Certification-based Replication.

To set up replication in Manticore Search:

  • The data_dir option must be set in the "searchd" section of the configuration file. Replication is not supported in plain mode.
  • A listen directive must be specified, containing an IP address accessible by other nodes, or a node_address with an accessible IP address.
  • Optionally, you can set unique values for server_id on each cluster node. If no value is set, the node will attempt to use the MAC address or a random number to generate the server_id.

If there is no replication listen directive set, Manticore will use the first two free ports in the range of 200 ports after the default protocol listening port for each created cluster. To set replication ports manually, the listen directive (of replication type) port range must be defined and the address/port range pairs must not intersect between different nodes on the same server. As a rule of thumb, the port range should specify at least two ports per cluster.

Replication cluster

A replication cluster is a group of nodes in which a write transaction is replicated. Replication is set up on a per-table basis, meaning that one table can only belong to one cluster. There is no limit on the number of tables that a cluster can have. All transactions such as INSERT, REPLACE, DELETE, TRUNCATE on any percolate or real-time table that belongs to a cluster are replicated to all the other nodes in that cluster. Replication is multi-master, so writes to any node or multiple nodes simultaneously will work just as well.

To create a cluster, you can typically use the command create cluster with CREATE CLUSTER <cluster name>, and to join a cluster, you can use join cluster with JOIN CLUSTER <cluster name> at 'host:port'. However, in some rare cases, you may want to fine-tune the behavior of CREATE/JOIN CLUSTER. The available options are:

name

This option specifies the name of the cluster. It should be unique among all the clusters in the system.

Note: The maximum allowable hostname length for the JOIN command is 253 characters. If you exceed this limit, searchd will generate an error.

path

The path option specifies the data directory for write-set cache replication and incoming tables from other nodes. This value should be unique among all the clusters in the system and should be specified as a relative path to the data_dir. directory. By default, it is set to the value of data_dir.

nodes

The nodes option is a list of address:port pairs for all the nodes in the cluster, separated by commas. This list should be obtained using the node's API interface and can include the address of the current node as well. It is used to join the node to the cluster and to rejoin it after a restart.

options

The options option allows you to pass additional options directly to the Galera replication plugin, as described in the Galera Documentation Parameters

Write statements

When working with a replication cluster, all write statements such as INSERT, REPLACE, DELETE, TRUNCATE, UPDATE that modify the content of a cluster's table must use thecluster_name:index_name expression instead of the table name. This ensures that the changes are propagated to all replicas in the cluster. If the correct expression is not used, an error will be triggered.

In the HTTP interface, the cluster property must be set along with the table name for all write statements to a cluster's table. Failure to set the cluster property will result in an error.

The Auto ID for a table in a cluster should be valid as long as the server_id is correctly configured.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • C#
📋
INSERT INTO posts:weekly_index VALUES ( 'iphone case' )
TRUNCATE RTINDEX click_query:weekly_index
UPDATE INTO posts:rt_tags SET tags=(101, 302, 304) WHERE MATCH ('use') AND id IN (1,101,201)
DELETE FROM clicks:rt WHERE MATCH ('dumy') AND gid>206

Read statements

Read statements such as SELECT, CALL PQ, DESCRIBE can either use regular table names that are not prepended with a cluster name, or they can use the cluster_name:index_nameformat. If the latter is used, the cluster_name component is ignored.

When using the HTTP endpoint json/search, the cluster property can be specified if desired, but it can also be omitted.

‹›
  • SQL
  • JSON
📋
SELECT * FROM weekly_index
CALL PQ('posts:weekly_index', 'document is here')

Cluster parameters

Replication plugin options can be adjusted using the SET statement.

A list of available options can be found in the Galera Documentation Parameters .

‹›
  • SQL
  • JSON
📋
SET CLUSTER click_query GLOBAL 'pc.bootstrap' = 1

Cluster with diverged nodes

It's possible for replicated nodes to diverge from one another, leading to a state where all nodes are labeled as non-primary. This can occur as a result of a network split between nodes, a cluster crash, or if the replication plugin experiences an exception when determining the primary component. In such a scenario, it's necessary to select a node and promote it to the role of primary component.

To identify the node that needs to be promoted, you should compare the last_committed cluster status variable value on all nodes. If all the servers are currently running, there's no need to restart the cluster. Instead, you can simply promote the node with the highest last_committed value to the primary component using the SET statement (as demonstrated in the example).

The other nodes will then reconnect to the primary component and resynchronize their data based on this node.

‹›
  • SQL
  • JSON
📋
SET CLUSTER posts GLOBAL 'pc.bootstrap' = 1

Replication and cluster

To use replication, you need to define one listen port for SphinxAPI protocol and one listen for replication address and port range in the configuration file. Also, specify the data_dir folder to receive incoming tables.

‹›
  • ini
ini
📋
searchd {
  listen   = 9312
  listen   = 192.168.1.101:9360-9370:replication
  data_dir = /var/lib/manticore/
  ...
 }

To replicate tables, you must create a cluster on the server that has the local tables to be replicated.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • C#
📋
CREATE CLUSTER posts

Add these local tables to the cluster

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • C#
📋
ALTER CLUSTER posts ADD pq_title
ALTER CLUSTER posts ADD pq_clicks

All other nodes that wish to receive a replica of the cluster's tables should join the cluster as follows:

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • C#
📋
JOIN CLUSTER posts AT '192.168.1.101:9312'

When running queries, prepend the table name with the cluster name posts: or use the cluster property for HTTP request object.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • C#
📋
INSERT INTO posts:pq_title VALUES ( 3, 'test me' )

All queries that modify tables in the cluster are now replicated to all nodes in the cluster.