Adding a distributed table with remote agents

To understand how to add a distributed table with remote agents, it is important to first have a basic understanding of distributed tables In this article, we will focus on how to use a distributed table as the basis for creating a cluster of Manticore instances.

Here is an example of how to split data over 4 servers, each serving one of the shards:

‹›
  • ini
ini
📋
table mydist {
          type  = distributed
          agent = box1:9312:shard1
          agent = box2:9312:shard2
          agent = box3:9312:shard3
          agent = box4:9312:shard4
}

In the event of a server failure, the distributed table will still work, but the results from the failed shard will be missing.

Now that we've added mirrors, each shard is found on 2 servers. By default, the master (the searchd instance with the distributed table) will randomly pick one of the mirrors.

The mode used for picking mirrors can be set using the ha_strategy setting. In addition to the default random mode there's also ha_strategy = roundrobin.

More advanced strategies based on latency-weighted probabilities include noerrors and nodeads. These not only take out mirrors with issues but also monitor response times and do balancing. If a mirror responds slower (for example, due to some operations running on it), it will receive fewer requests. When the mirror recovers and provides better times, it will receive more requests.

‹›
  • ini
ini
📋
table mydist {
          type  = distributed
          agent = box1:9312|box5:9312:shard1
          agent = box2:9312:|box6:9312:shard2
          agent = box3:9312:|box7:9312:shard3
          agent = box4:9312:|box8:9312:shard4
}

Mirroring

Agent mirrors can be used interchangeably when processing a search query. The Manticore instance(s) hosting the distributed table where the mirrored agents are defined keeps track of mirror status (alive or dead) and response times, and performs automatic failover and load balancing based on this information.

Agent mirrors

agent = node1|node2|node3:9312:shard2

The above example declares that node1:9312, node2:9312, and node3:9312 all have a table called shard2, and can be used as interchangeable mirrors. If any of these servers go down, the queries will be distributed between the remaining two. When the server comes back online, the master will detect it and begin routing queries to all three nodes again.

A mirror may also include an individual table list, as follows:

agent = node1:9312:node1shard2|node2:9312:node2shard2

This works similarly to the previous example, but different table names will be used when querying different servers. For example, node1shard2 will be used when querying node1:9312, and node2shard will be used when querying node2:9312.

By default, all queries are routed to the best of the mirrors. The best mirror is selected based on recent statistics, as controlled by the ha_period_karma config directive. The master stores metrics (total query count, error count, response time, etc.) for each agent and groups these by time spans. The karma is the length of the time span. The best agent mirror is then determined dynamically based on the last two such time spans. The specific algorithm used to pick a mirror can be configured with the ha_strategy directive.

The karma period is in seconds and defaults to 60 seconds. The master stores up to 15 karma spans with per-agent statistics for instrumentation purposes (see SHOW AGENT STATUS statement). However, only the last two spans out of these are used for HA/LB logic.

When there are no queries, the master sends a regular ping command every ha_ping_interval milliseconds in order to collect statistics and check if the remote host is still alive. The ha_ping_interval defaults to 1000 msec. Setting it to 0 disables pings, and statistics will only be accumulated based on actual queries.

Example:

# sharding table over 4 servers total
# in just 2 shards but with 2 failover mirrors for each shard
# node1, node2 carry shard1 as local
# node3, node4 carry shard2 as local

# config on node1, node2
agent = node3:9312|node4:9312:shard2

# config on node3, node4
agent = node1:9312|node2:9312:shard1

Load balancing

Load balancing is turned on by default for any distributed table that uses mirroring. By default, queries are distributed randomly among the mirrors. You can change this behavior by using the ha_strategy.

ha_strategy

ha_strategy = {random|nodeads|noerrors|roundrobin}

The mirror selection strategy for load balancing is optional and is set to random by default.

The strategy used for mirror selection, or in other words, choosing a specific agent mirror in a distributed table, is controlled by this directive. Essentially, this directive controls how master performs the load balancing between the configured mirror agent nodes. The following strategies are implemented:

Simple random balancing

The default balancing mode is simple linear random distribution among the mirrors. This means that equal selection probabilities are assigned to each mirror. This is similar to round-robin (RR), but does not impose a strict selection order.

‹›
  • Example
Example
📋
ha_strategy = random

Adaptive randomized balancing

The default simple random strategy does not take into account the status of mirrors, error rates, and most importantly, actual response latencies. To address heterogeneous clusters and temporary spikes in agent node load, there are a group of balancing strategies that dynamically adjust the probabilities based on the actual query latencies observed by the master.

The adaptive strategies based on latency-weighted probabilities work as follows:

  1. Latency stats are accumulated in blocks of ha_period_karma seconds.
  2. Latency-weighted probabilities are recomputed once per karma period.
  3. The "dead or alive" flag is adjusted once per request, including ping requests.

Initially, the probabilities are equal. On every step, they are scaled by the inverse of the latencies observed during the last karma period, and then renormalized. For example, if during the first 60 seconds after the master startup, 4 mirrors had latencies of 10 ms, 5 ms, 30 ms, and 3 ms respectively, the first adjustment step would go as follows:

  1. Initial percentages: 0.25, 0.25, 0.25, 0.25.
  2. Observed latencies: 10 ms, 5 ms, 30 ms, 3 ms.
  3. Inverse latencies: 0.1, 0.2, 0.0333, 0.333.
  4. Scaled percentages: 0.025, 0.05, 0.008333, 0.0833.
  5. Renormalized percentages: 0.15, 0.30, 0.05, 0.50.

This means that the first mirror would have a 15% chance of being chosen during the next karma period, the second one a 30% chance, the third one (slowest at 30 ms) only a 5% chance, and the fourth and fastest one (at 3 ms) a 50% chance. After that period, the second adjustment step would update those chances again, and so on.

The idea is that once the observed latencies stabilize, the latency weighted probabilities will stabilize as well. All these adjustment iterations are meant to converge at a point where the average latencies are roughly equal across all mirrors.

nodeads

Latency-weighted probabilities, but dead mirrors are excluded from the selection. A "dead" mirror is defined as a mirror that has resulted in multiple hard errors (e.g. network failure, or no answer, etc) in a row.

‹›
  • Example
Example
📋
ha_strategy = nodeads

noerrors

Latency-weighted probabilities, but mirrors with a worse error/success ratio are excluded from selection.

‹›
  • Example
Example
📋
ha_strategy = noerrors

Round-robin balancing

Simple round-robin selection, that is, selecting the first mirror in the list, then the second one, then the third one, etc, and then repeating the process once the last mirror in the list is reached. Unlike with the randomized strategies, RR imposes a strict querying order (1, 2, 3, ..., N-1, N, 1, 2, 3, ..., and so on) and guarantees that no two consecutive queries will be sent to the same mirror.

‹›
  • Example
Example
📋
ha_strategy = roundrobin

Instance-wide options

ha_period_karma

ha_period_karma = 2m

ha_period_karma defines the size of the agent mirror statistics window, in seconds (or a time suffix). Optional, the default is 60.

For a distributed table with agent mirrors, the server tracks several different per-mirror counters. These counters are then used for failover and balancing. (The server picks the best mirror to use based on the counters.) Counters are accumulated in blocks of ha_period_karma seconds.

After beginning a new block, the master may still use the accumulated values from the previous one until the new one is half full. Thus, any previous history stops affecting the mirror choice after at most 1.5 times ha_period_karma seconds.

Although at most 2 blocks are used for mirror selection, up to 15 last blocks are actually stored for instrumentation purposes. They can be inspected using the SHOW AGENT STATUS statement.

ha_ping_interval

ha_ping_interval = 3s

ha_ping_interval directive defines the interval between pings sent to the agent mirrors, in milliseconds (or with a time suffix). This option is optional and its default value is 1000.

For a distributed table with agent mirrors, the server sends all mirrors a ping command during idle periods to track their current status (whether they are alive or dead, network roundtrip time, etc.). The interval between pings is determined by the ha_ping_interval setting.

If you want to disable pings, set ha_ping_interval to 0.