Manticore Search is a highly distributed system that provides all the necessary components to create a highly available and scalable database for search. This includes:
- distributed table for sharding
- Mirroring for high availability
- Load balancing for scalability
- Replication for data safety
Manticore Search offers great flexibility in terms of how you set up your cluster. There are no limitations, so it's up to you to design your cluster according to your needs. Simply learn about the tools mentioned above and use them to achieve your desired goal.
To add a new node to a cluster, simply start another instance of Manticore and ensure that it is accessible by the other nodes in the cluster. Connect the new node to the rest of the cluster using a distributed table and ensure data safety with replication.
To understand how to add a distributed table with remote agents, it is important to first have a basic understanding of distributed tables In this article, we will focus on how to use a distributed table as the basis for creating a cluster of Manticore instances.
Here is an example of how to split data over 4 servers, each serving one of the shards:
- ini
table mydist {
type = distributed
agent = box1:9312:shard1
agent = box2:9312:shard2
agent = box3:9312:shard3
agent = box4:9312:shard4
}In the event of a server failure, the distributed table will still work, but the results from the failed shard will be missing.
Now that we've added mirrors, each shard is found on 2 servers. By default, the master (the searchd instance with the distributed table) will randomly pick one of the mirrors.
The mode used for picking mirrors can be set using the ha_strategy setting. In addition to the default random mode there's also ha_strategy = roundrobin.
More advanced strategies based on latency-weighted probabilities include noerrors and nodeads. These not only take out mirrors with issues but also monitor response times and do balancing. If a mirror responds slower (for example, due to some operations running on it), it will receive fewer requests. When the mirror recovers and provides better times, it will receive more requests.
- ini
table mydist {
type = distributed
agent = box1:9312|box5:9312:shard1
agent = box2:9312:|box6:9312:shard2
agent = box3:9312:|box7:9312:shard3
agent = box4:9312:|box8:9312:shard4
}Load balancing is turned on by default for any distributed table that uses mirroring. By default, queries are distributed randomly among mirrors. You can change this behavior by using ha_strategy.
ha_strategy = {random|nodeads|noerrors|roundrobin}
The mirror selection strategy for load balancing is optional and is set to random by default.
ha_strategy can be set globally in the searchd configuration or per distributed table. The strategy used for mirror selection, or in other words, choosing a specific agent mirror in a distributed table, is controlled by this directive. In practice, this directive controls how the master balances requests between mirrored agent nodes. The following strategies are implemented:
The default balancing mode is simple linear random distribution among mirrors. Equal selection probabilities are assigned to each mirror. This is similar to round-robin (RR), but does not impose a strict selection order.
- Config
- SQL
ha_strategy = randomCREATE TABLE products_dist type='distributed'
agent='127.0.0.1:9312:products|127.0.0.1:9313:products'
ha_strategy='random';The default simple random strategy does not take into account the status of mirrors, error rates, and, most importantly, actual response latencies. To address heterogeneous clusters and temporary spikes in agent node load, there is a group of balancing strategies that dynamically adjust the probabilities based on the query latencies observed by the master.
The adaptive strategies based on latency-weighted probabilities work as follows. The size of that statistics window is controlled by ha_period_karma, while idle mirror health and round-trip tracking are driven by ha_ping_interval. Both are described later on this page and in Searchd settings.
- Latency stats are accumulated in blocks of
ha_period_karmaseconds. - Latency-weighted probabilities are recomputed once per karma period.
- The "dead or alive" flag is adjusted once per request, including ping requests.
Initially, the probabilities are equal. On every step, they are scaled by the inverse of the latencies observed during the last karma period, and then renormalized. For example, if during the first 60 seconds after the master startup, 4 mirrors had latencies of 10 ms, 5 ms, 30 ms, and 3 ms respectively, the first adjustment step would go as follows:
- Initial percentages: 0.25, 0.25, 0.25, 0.25.
- Observed latencies: 10 ms, 5 ms, 30 ms, 3 ms.
- Inverse latencies: 0.1, 0.2, 0.0333, 0.333.
- Scaled percentages: 0.025, 0.05, 0.008333, 0.0833.
- Renormalized percentages: 0.15, 0.30, 0.05, 0.50.
This means that the first mirror would have a 15% chance of being chosen during the next karma period, the second one a 30% chance, the third one (slowest at 30 ms) only a 5% chance, and the fourth and fastest one (at 3 ms) a 50% chance. After that period, the second adjustment step would update those chances again, and so on.
The idea is that once the observed latencies stabilize, the latency-weighted probabilities will stabilize as well. All these adjustment iterations are meant to converge at a point where the average latencies are roughly equal across all mirrors.
Latency-weighted probabilities, but mirrors with too many consecutive failed attempts are excluded from selection. The consecutive-error counter is reset only by a fully successful query. Warnings do not reset it. A mirror starts being treated as dead after more than 3 consecutive non-success outcomes, so the first 3 failures in a row are tolerated but the 4th makes that mirror ineligible for nodeads selection until it answers successfully again.
- Config
- SQL
ha_strategy = nodeadsCREATE TABLE products_dist type='distributed'
agent='127.0.0.1:9312:products|127.0.0.1:9313:products'
ha_strategy='nodeads';Latency-weighted probabilities, but mirrors with a worse recent error/success ratio are deprioritized and may be excluded from selection.
noerrors looks at recent per-mirror counters collected over the last ha_period_karma windows. It first compares the ratio of critical transport failures (connect_timeouts, connect_failures, network_errors, wrong_replies, unexpected_closings) to total recent activity. Ratios at or below 3% are treated as effectively zero. If several mirrors tie there, it then compares the broader error ratio that also includes warning-bearing replies. Mirrors with no successful queries at all in the recent window are skipped entirely.
noerrors is therefore most useful for intermittent transport-level failures and degraded mirrors, not as a guarantee that every logical remote-table problem disappears immediately. For example, if one mirror points to a missing table, you may still see query failures before that mirror accumulates enough bad history to stop being preferred. Use SHOW AGENT STATUS to inspect recent successes, warnings, and network errors for each mirror.
- Config
- SQL
ha_strategy = noerrorsCREATE TABLE products_dist type='distributed'
agent='127.0.0.1:9312:products|127.0.0.1:9313:products'
ha_strategy='noerrors';Simple round-robin selection, that is, selecting the first mirror in the list, then the second one, then the third one, etc, and then repeating the process once the last mirror in the list is reached. Unlike with the randomized strategies, RR imposes a strict querying order (1, 2, 3, ..., N-1, N, 1, 2, 3, ..., and so on) and guarantees that no two consecutive queries will be sent to the same mirror while all mirrors are healthy.
- Config
- SQL
ha_strategy = roundrobinCREATE TABLE products_dist type='distributed'
agent='127.0.0.1:9312:products|127.0.0.1:9313:products'
ha_strategy='roundrobin';Assume the mirrored distributed tables already exist, for example:
CREATE TABLE products_random type='distributed'
agent='127.0.0.1:9312:products|127.0.0.1:9313:products';
CREATE TABLE products_rr type='distributed'
agent='127.0.0.1:9312:products|127.0.0.1:9313:products'
ha_strategy='roundrobin';
The difference between random and roundrobin becomes visible when you query mirrored distributed tables repeatedly.
- SQL
SELECT node FROM products_random;
SELECT node FROM products_random;
SELECT node FROM products_rr;
SELECT node FROM products_rr;+-------+
| node |
+-------+
| node2 |
+-------+
+-------+
| node |
+-------+
| node1 |
+-------+
+-------+
| node |
+-------+
| node1 |
+-------+
+-------+
| node |
+-------+
| node2 |
+-------+SHOW AGENT STATUS is the main debugging tool for mirrored distributed tables. For the full statement syntax and all output formats, see SHOW AGENT STATUS. For load balancing specifically, the most useful fields are:
ag_N_pingtripmsec- current ping round-trip time for mirrorNag_N_errorsarow- consecutive non-success outcomes; this is whatnodeadseffectively watchesag_N_1periods_connect_timeouts,connect_failures,network_errors,wrong_replies,unexpected_closings- recent transport failuresag_N_1periods_warnings- replies that succeeded but returned warningsag_N_1periods_succeeded_queries- recent successful queriesag_N_1periods_msecsperquery- recent average query time
These per-period counters are reported for 1, 5, and 15 rolling windows, where each window is ha_period_karma seconds long.
Use these commands on the master to inspect what was stored and how the mirrors are behaving.
- SQL
SHOW CREATE TABLE products_rr;
SHOW AGENT STATUS;
SHOW AGENT STATUS LIKE '%errorsarow%';
SHOW AGENT STATUS LIKE '%1periods_%';+---------------------------------+-------+
| Key | Value |
+---------------------------------+-------+
| ag_0_pingtripmsec | 0.233 |
| ag_0_errorsarow | 0 |
| ag_0_1periods_network_errors | 0 |
| ag_0_1periods_warnings | 0 |
| ag_0_1periods_succeeded_queries | 11 |
| ag_0_1periods_msecsperquery | n/a |
+---------------------------------+-------+Practical interpretation:
- If
errorsarowclimbs above 3,nodeadsstarts treating that mirror as dead. - If a mirror has no recent
succeeded_queries,noerrorswill skip it entirely. - If several mirrors have low error ratios,
noerrorsstill prefers the ones with lower latency because the normal weight rebalancing stays in play. pingtripmsechelps you distinguish "reachable but slow" from "failing" mirrors.
The options related to mirroring, timeouts, and retries are not all supported at the same scope. This page focuses on how they affect load balancing. For full syntax and daemon-level behavior, use the linked reference pages.
| Option | Instance-wide | Per table | Per query | Per agent | Full details |
|---|---|---|---|---|---|
ha_strategy |
yes | yes | no | yes | Load balancing, Remote tables |
ha_period_karma |
yes | no | no | no | Searchd: ha_period_karma |
ha_ping_interval |
yes | no | no | no | Searchd: ha_ping_interval |
agent_connect_timeout |
yes | yes | no | no | Searchd: agent_connect_timeout, Remote tables: agent_connect_timeout |
agent_query_timeout |
yes | yes | yes | no | Searchd: agent_query_timeout, Remote tables: agent_query_timeout |
agent_retry_count / mirror_retry_count / retry_count |
yes (agent_retry_count) |
yes (agent_retry_count or mirror_retry_count) |
yes (OPTION retry_count=...) |
yes (agent=...[retry_count=...]) |
Searchd: agent_retry_count, Remote tables: agent_retry_count, Remote tables: mirror_retry_count, Remote tables: agent |
agent_retry_delay |
yes | no | yes | no | Searchd: agent_retry_delay |
ha_strategy controls how mirrored agents are selected. It can be set globally in searchd, per distributed table, or per agent inside agent = ... [ha_strategy=...]. A narrower scope overrides a broader one.
The strategy examples above already show both the global and per-table forms. For the per-agent override syntax, see Remote tables.
If you want one distributed table to use a different balancing strategy than the global default, set it on that table.
- Config
- SQL
ha_strategy = random
table products_rr {
type = distributed
agent = 127.0.0.1:9312|127.0.0.1:9313:products
ha_strategy = roundrobin
}CREATE TABLE products_rr type='distributed'
agent='127.0.0.1:9312:products|127.0.0.1:9313:products'
ha_strategy='roundrobin';ha_period_karma defines the agent mirror statistics window used by the adaptive balancing strategies above and is supported only as an instance-wide searchd setting. Full daemon-level details are in Searchd: ha_period_karma.
- Config
ha_period_karma = 2mha_ping_interval defines how often idle mirrors are pinged and is supported only as an instance-wide searchd setting. Full daemon-level details are in Searchd: ha_ping_interval.
- Config
ha_ping_interval = 3sagent_connect_timeout defines how long Manticore waits to establish a connection to a remote agent. It is supported as an instance-wide default and per distributed table. Full details are in Searchd: agent_connect_timeout and Remote tables: agent_connect_timeout.
- Config
- SQL
agent_connect_timeout = 300msCREATE TABLE products_dist type='distributed'
agent='127.0.0.1:9312:products|127.0.0.1:9313:products'
agent_connect_timeout='300ms';agent_query_timeout defines how long Manticore waits for a connected remote agent to finish the query. It is supported as an instance-wide default, per distributed table, and per query as OPTION agent_query_timeout=.... Full details are in Searchd: agent_query_timeout and Remote tables: agent_query_timeout.
If agent_query_timeout is reached, the query is not retried automatically; a warning is produced instead.
- Config
- SQL
agent_query_timeout = 500msSELECT * FROM products_dist OPTION agent_query_timeout=750;agent_retry_count defines how many times Manticore retries remote-agent work before reporting a fatal query error. The name varies by scope: use agent_retry_count as the instance-wide setting, agent_retry_count or its alias mirror_retry_count on a distributed table, OPTION retry_count=... per query, and [retry_count=...] inside an individual agent=... declaration. Full details are in Searchd: agent_retry_count, Remote tables: agent_retry_count, and Remote tables: mirror_retry_count.
If you use agent mirrors, the retry count is aggregated across mirrors. The per-agent [retry_count=...] option acts as an absolute cap for that specific agent declaration.
- Config
- SQL
table products_dist {
type = distributed
agent = 127.0.0.1:9312|127.0.0.1:9313:products[retry_count=2]
}SELECT * FROM products_dist OPTION retry_count=1;agent_retry_delay defines the delay between retry attempts. It is supported as an instance-wide default and per query as OPTION retry_delay=..., but not per distributed table. Full details are in Searchd: agent_retry_delay.
This option is only relevant when retries are enabled through agent_retry_count or OPTION retry_count=....
- Config
- SQL
agent_retry_delay = 500msSELECT * FROM products_dist OPTION retry_count=2, retry_delay=300;