Manticore 通过其分布式搜索能力设计得以有效扩展。分布式搜索有助于提升查询延迟(即搜索时间)和吞吐量(即最大查询次数/秒),适用于多服务器、多CPU或多核环境。这对于需要搜索海量数据(即数十亿记录和数TB文本)的应用至关重要。
其主要理念是将被搜索的数据水平划分到不同的搜索节点中,并行处理。
分区需要手动完成。配置步骤如下:
- 在不同服务器上部署多个 Manticore 实例
- 将数据集的不同部分分发给不同实例
- 在一些
searchd实例上配置特殊的分布式表 - 将查询路由到该分布式表
这种类型的表仅包含对其他本地和远程表的引用,因此不能直接重新索引。相反,应当重新索引它所引用的表。
当 Manticore 接收到针对分布式表的查询时,会执行以下步骤:
- 连接到配置的远程代理
- 将查询发送给远程代理
- 同时搜索配置的本地表(在远程代理搜索的同时)
- 从远程代理检索搜索结果
- 合并所有结果,去除重复项
- 将合并后的结果发送给客户端
从应用程序角度看,搜索常规表和分布式表没有区别。换言之,分布式表对应用程序是完全透明的,无法判断查询的是分布式表还是本地表。
了解更多关于远程节点的信息。
多查询,或查询批处理,允许您在一个网络请求中向 Manticore 发送多个搜索查询。
👍 为什么使用多查询?
主要原因在于性能。通过将请求以批处理的形式发送到 Manticore 而不是逐个发送,可以节省时间并减少网络往返次数。此外,以批处理的形式发送查询可以让 Manticore 执行某些内部优化。如果无法应用批处理优化,查询将逐个处理。
⛔ 何时不使用多查询?
多查询要求批处理中的所有搜索查询都是独立的,但这并不总是成立。有时查询 B 依赖于查询 A 的结果,这意味着查询 B 只能在执行查询 A 之后才能设置。例如,您可能只想在主表中未找到结果时显示次级索引的结果,或者您可能需要根据第一个结果集中的匹配数量指定第二个结果集中的偏移量。在这种情况下,您将需要使用单独的查询(或单独的批处理)。
当使用连接器库时,例如 PHP 中的 mysqli,您可以添加多个查询,然后将它们作为一个批处理运行。这将作为一个单个多查询批处理工作。
注意:如果您使用控制台 MySQL 客户端,默认情况下它会将分号(;)解释为分隔符本身,并逐个将每个查询发送到服务器;这不是一个多查询批处理。要覆盖此行为,可以在客户端侧使用内部命令 delimiter 重新定义分隔符。在做出此更改后,客户端将发送整个包含分号的字符串,允许“多查询魔法”生效。
控制台客户端的这种行为有时会令人困惑,因为您可能会注意到相同的命令序列在 MySQL 客户端控制台中与 SQL-over-HTTP 等其他协议的行为有所不同。这是因为 MySQL 控制台客户端本身使用分号来划分查询,但其他协议可能会将整个序列作为一个批处理发送。
您可以使用 SQL 通过分号分隔多个搜索查询。当 Manticore 从客户端接收到格式为这样的查询时,所有语句间的优化都将被应用。
多查询不支持带有 FACET 的查询。一个批处理中的多查询数量不应超过 max_batch_queries。
- SQL
SELECT id, price FROM products WHERE MATCH('remove hair') ORDER BY price DESC; SELECT id, price FROM products WHERE MATCH('remove hair') ORDER BY price ASC从控制台 MySQL/MariaDB 客户端:
DELIMITER _
SELECT id, price FROM products WHERE MATCH('remove hair') ORDER BY price DESC; SELECT id, price FROM products WHERE MATCH('remove hair') ORDER BY price ASC_
有两种主要的优化需要了解:公共查询优化和公共子树优化。
公共查询优化意味着 searchd 将识别批处理中所有仅排序和分组设置不同的查询,并且 只执行一次搜索。例如,如果批处理包含 3 个查询,它们都是针对“ipod nano”的,但第一个查询请求按价格排序的前 10 个结果,第二个查询按供应商 ID 分组并请求按评分排序的前 5 个供应商,第三个查询请求最高价格,全文搜索“ipod nano”将只执行一次,并且其结果将被重用以构建 3 个不同的结果集。
分面搜索 是特别重要的一种情况,可以从这种优化中受益。确实,分面搜索可以通过运行几个查询来实现,一个查询用于检索搜索结果本身,几个其他查询带有相同的全文查询但不同的分组设置,以检索所有所需的分组结果(前 3 位作者,前 5 位供应商等)。只要全文查询和过滤设置保持不变,公共查询优化就会触发,并大大提高性能。
公共子树优化更加有趣。它允许 searchd 利用批处理中的全文查询之间的相似性。它会识别所有查询中的公共全文查询部分(子树),并在查询之间缓存它们。例如,考虑以下查询批处理:
donald trump president
donald trump barack obama john mccain
donald trump speech
有一个公共的两词部分 donald trump,它只需要计算一次,然后缓存并共享给所有查询。公共子树优化正是这样做的。每个查询的缓存大小由 subtree_docs_cache 和 subtree_hits_cache 指令严格控制(以确保缓存所有 160 亿个匹配“i am”的文档不会耗尽内存并立即杀死您的服务器)。
如何知道批处理中的查询是否实际进行了优化?如果进行了优化,相应的查询日志将有一个“倍数”字段,指定了一起处理了多少个查询:
注意“x3”字段。这意味着此查询被优化并作为一个子批处理中的 3 个查询一起处理。
- log
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/rel 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/ext 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/ext 747541 (0,20)] [lj] the作为参考,如果查询未批处理,常规日志将如下所示:
- log
[Sun Jul 12 15:18:17.062 2009] 0.059 sec [ext/0/rel 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.156 2009] 0.091 sec [ext/0/ext 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.250 2009] 0.092 sec [ext/0/ext 747541 (0,20)] [lj] the注意,在多查询情况下,每个查询的时间提高了 1.5 到 2.3 倍,具体取决于特定的排序模式。
多查询主要用于批处理查询并接收此类批处理的元数据。由于这一限制,批处理中只允许一小部分语句。在一个批处理中,您可以组合 SELECT、SHOW 和 SET 语句。
您可以像平常一样使用 SELECT;然而,请注意,所有查询将在一次通过中运行。如果查询之间没有关联,多查询就没有好处。守护进程会检测到这一点,并逐个运行查询。
你可以使用 SHOW 来处理 警告, 状态, 代理状态, 元数据, 配置文件 和 计划。所有其他在批次中的 SHOW 语句将被静默忽略,没有任何输出。例如,你不能执行 SHOW TABLES, SHOW THREADS, 或 SHOW VARIABLES,或任何其他未提及的语句进行批次处理。
你可以仅使用 SET 来设置 SET PROFILING。所有其他 SET ... 命令将被静默忽略。
执行的顺序也不同。守护进程在两轮中处理批次。
首先,它收集所有 SELECT 语句,并同时运行它看到的所有 SET PROFILING 语句。作为副作用,只有最后一个 SET PROFILING 语句有效。如果你执行一个类似的多查询语句,如 SET PROFILING=1; SELECT...; SHOW META; SHOW PROFILE; SET PROFILING=0,你将看不到任何配置文件,因为在第一轮中,守护进程执行了 SET PROFILING=1,然后立即执行了 SET PROFILING=0。
第二轮,守护进程尝试使用收集的所有 SELECT 语句执行单个批次查询。如果语句不相关,它将依次执行它们。
最后,它遍历初始批次序列,并返回结果集中的每个 SELECT 和 SHOW 的子结果数据和元数据。由于所有 SET PROFILING 语句在第一轮中已执行,因此在第二轮中将被跳过。
each SELECT and SHOW. Since all SET PROFILING statements were executed in the first pass, they are skipped on this second pass.
Manticore 通过以下格式支持 SQL 中的 SELECT 子查询:
SELECT * FROM (SELECT ... ORDER BY cond1 LIMIT X) ORDER BY cond2 LIMIT Y
外层 select 只允许包含 ORDER BY 和 LIMIT 子句。子查询目前有两个使用场景:
-
当你有一个查询包含两个排序 UDF,一个非常快,另一个较慢,并且执行全文搜索时匹配结果集很大。没有子查询,查询看起来像这样:
SELECT id,slow_rank() as slow,fast_rank() as fast FROM index WHERE MATCH(‘some common query terms’) ORDER BY fast DESC, slow DESC LIMIT 20 OPTION max_matches=1000;使用子查询,查询可以重写为:
SELECT * FROM (SELECT id,slow_rank() as slow,fast_rank() as fast FROM index WHERE MATCH(‘some common query terms’) ORDER BY fast DESC LIMIT 100 OPTION max_matches=1000) ORDER BY slow DESC LIMIT 20;在初始查询中,
slow_rank()UDF 会对整个匹配结果集计算。使用 SELECT 子查询,仅对整个匹配结果集计算fast_rank(),而只对有限集合计算slow_rank()。 -
第二种情况适用于来自分布式表的大型结果集。
对于此查询:
SELECT * FROM my_dist_index WHERE some_conditions LIMIT 50000;如果你有 20 个节点,每个节点最多可以返回 50K 条记录给主节点,结果是 20 x 50K = 100万条记录。然而,由于主节点只返回 50K 条(从 100万条中),节点只发送前 10K 条记录可能就足够了。使用子查询,你可以将查询重写为:
SELECT * FROM (SELECT * FROM my_dist_index WHERE some_conditions LIMIT 10000) ORDER by some_attr LIMIT 50000;在这种情况下,节点仅接收内部查询并执行它。这意味着主节点将仅收到 20x10K=20万 条记录。主节点会对接收到的所有记录按外层子句重新排序,并返回最佳的 50K 条记录。子查询有助于减少主节点和节点之间的流量,同时减少主节点的计算时间(因为它只处理 20万 条记录而非 100万 条)。
分组搜索结果通常有助于获取每个组的匹配计数或其他聚合。例如,它对于创建一个图表来说明每个月匹配的博客文章数量或按站点分组网页搜索结果或按作者分组论坛帖子等非常有用。
Manticore 支持按单个或多个列和计算表达式对搜索结果进行分组。结果可以:
- 在组内排序
- 每个组返回多于一行
- 过滤组
- 对组进行排序
- 使用聚合函数进行聚合
- SQL
- JSON
通用语法
SELECT {* | SELECT_expr [, SELECT_expr ...]}
...
GROUP BY {field_name | alias } [, ...]
[HAVING where_condition]
[WITHIN GROUP ORDER BY field_name {ASC | DESC} [, ...]]
...
SELECT_expr: { field_name | function_name(...) }
where_condition: {aggregation expression alias | COUNT(*)}JSON 查询格式目前支持基本的分组,可以检索聚合值及其 count(*)。
{
"table": "<table_name>",
"limit": 0,
"aggs": {
"<aggr_name>": {
"terms": {
"field": "<attribute>",
"size": <int value>
}
}
}
}标准查询输出返回未分组的结果集,可以使用 limit(或 size)隐藏。
聚合需要为组的结果集大小设置 size。
分组非常简单 - 只需在 SELECT 查询的末尾添加 "GROUP BY smth"。smth 可以是:
- 表中的任何非全文字段:整数、浮点数、字符串、MVA(多值属性)
- 或者如果你在
SELECT列表中使用了别名,你也可以按它分组
在 SELECT 列表中可以省略任何 聚合函数,它仍然会工作:
- SQL
SELECT release_year FROM films GROUP BY release_year LIMIT 5;+--------------+
| release_year |
+--------------+
| 2004 |
| 2002 |
| 2001 |
| 2005 |
| 2000 |
+--------------+然而,在大多数情况下,你可能希望为每个组获取一些聚合数据,例如:
COUNT(*)以简单地获取每个组中的元素数量- 或
AVG(field)计算组内字段的平均值
对于 HTTP JSON 请求,使用主查询级别的单个 aggs 桶并设置 limit=0 与 SQL 查询中的 GROUP BY 和 COUNT(*) 具有类似的行为和性能。
- SQL1
- SQL2
- JSON
- PHP
- Python
- Python-asyncio
- Javascript
- Java
- C#
- Rust
- TypeScript
- Go
SELECT release_year, count(*) FROM films GROUP BY release_year LIMIT 5;SELECT release_year, AVG(rental_rate) FROM films GROUP BY release_year LIMIT 5;POST /search -d '
{
"table" : "films",
"limit": 0,
"aggs" :
{
"release_year" :
{
"terms" :
{
"field":"release_year",
"size":100
}
}
}
}
'$index->setName('films');
$search = $index->search('');
$search->limit(0);
$search->facet('release_year','release_year',100);
$results = $search->get();
print_r($results->getFacets());res =searchApi.search({"table":"films","limit":0,"aggs":{"release_year":{"terms":{"field":"release_year","size":100}}}})res = await searchApi.search({"table":"films","limit":0,"aggs":{"release_year":{"terms":{"field":"release_year","size":100}}}})res = await searchApi.search({"table":"films","limit":0,"aggs":{"release_year":{"terms":{"field":"release_year","size":100}}}});HashMap<String,Object> aggs = new HashMap<String,Object>(){{
put("release_year", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","release_year");
put("size",100);
}});
}});
}};
searchRequest = new SearchRequest();
searchRequest.setIndex("films");
searchRequest.setLimit(0);
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setAggs(aggs);
searchResponse = searchApi.search(searchRequest);var agg = new Aggregation("release_year", "release_year");
agg.Size = 100;
object query = new { match_all=null };
var searchRequest = new SearchRequest("films", query);
searchRequest.Aggs = new List<Aggregation> {agg};
var searchResponse = searchApi.Search(searchRequest);let query = SearchQuery::new();
let aggTerms1 = AggTerms::new {
fields: "release_year".to_string(),
size: Some(100),
};
let agg1 = Aggregation {
terms: Some(Box::new(aggTerms1)),
..Default::default(),
};
let mut aggs = HashMap::new();
aggs.insert("release_year".to_string(), agg1);
let search_req = SearchRequest {
table: "films".to_string(),
query: Some(Box::new(query)),
aggs: serde_json::json!(aggs),
..Default::default(),
};
let search_res = search_api.search(search_req).await;res = await searchApi.search({
index: 'test',
limit: 0,
aggs: {
cat_id: {
terms: { field: "cat", size: 1 }
}
}
});query := map[string]interface{} {};
searchRequest.SetQuery(query);
aggTerms := manticoreclient.NewAggregationTerms()
aggTerms.SetField("cat")
aggTerms.SetSize(1)
aggregation := manticoreclient.NewAggregation()
aggregation.setTerms(aggTerms)
searchRequest.SetAggregation(aggregation)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| 2005 | 93 |
| 2000 | 97 |
+--------------+----------++--------------+------------------+
| release_year | avg(rental_rate) |
+--------------+------------------+
| 2004 | 2.78629661 |
| 2002 | 3.08259249 |
| 2001 | 3.09989142 |
| 2005 | 2.90397978 |
| 2000 | 3.17556739 |
+--------------+------------------+{
"took": 2,
"timed_out": false,
"hits": {
"total": 10000,
"hits": [
]
},
"release_year": {
"group_brand_id": {
"buckets": [
{
"key": 2004,
"doc_count": 108
},
{
"key": 2002,
"doc_count": 108
},
{
"key": 2000,
"doc_count": 97
},
{
"key": 2005,
"doc_count": 93
},
{
"key": 2001,
"doc_count": 91
}
]
}
}
}Array
(
[release_year] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 2009
[doc_count] => 99
)
[1] => Array
(
[key] => 2008
[doc_count] => 102
)
[2] => Array
(
[key] => 2007
[doc_count] => 93
)
[3] => Array
(
[key] => 2006
[doc_count] => 103
)
[4] => Array
(
[key] => 2005
[doc_count] => 93
)
[5] => Array
(
[key] => 2004
[doc_count] => 108
)
[6] => Array
(
[key] => 2003
[doc_count] => 106
)
[7] => Array
(
[key] => 2002
[doc_count] => 108
)
[8] => Array
(
[key] => 2001
[doc_count] => 91
)
[9] => Array
(
[key] => 2000
[doc_count] => 97
)
)
)
){'aggregations': {u'release_year': {u'buckets': [{u'doc_count': 99,
u'key': 2009},
{u'doc_count': 102,
u'key': 2008},
{u'doc_count': 93,
u'key': 2007},
{u'doc_count': 103,
u'key': 2006},
{u'doc_count': 93,
u'key': 2005},
{u'doc_count': 108,
u'key': 2004},
{u'doc_count': 106,
u'key': 2003},
{u'doc_count': 108,
u'key': 2002},
{u'doc_count': 91,
u'key': 2001},
{u'doc_count': 97,
u'key': 2000}]}},
'hits': {'hits': [], 'max_score': None, 'total': 1000},
'profile': None,
'timed_out': False,
'took': 0}{'aggregations': {u'release_year': {u'buckets': [{u'doc_count': 99,
u'key': 2009},
{u'doc_count': 102,
u'key': 2008},
{u'doc_count': 93,
u'key': 2007},
{u'doc_count': 103,
u'key': 2006},
{u'doc_count': 93,
u'key': 2005},
{u'doc_count': 108,
u'key': 2004},
{u'doc_count': 106,
u'key': 2003},
{u'doc_count': 108,
u'key': 2002},
{u'doc_count': 91,
u'key': 2001},
{u'doc_count': 97,
u'key': 2000}]}},
'hits': {'hits': [], 'max_score': None, 'total': 1000},
'profile': None,
'timed_out': False,
'took': 0}{"took":0,"timed_out":false,"aggregations":{"release_year":{"buckets":[{"key":2009,"doc_count":99},{"key":2008,"doc_count":102},{"key":2007,"doc_count":93},{"key":2006,"doc_count":103},{"key":2005,"doc_count":93},{"key":2004,"doc_count":108},{"key":2003,"doc_count":106},{"key":2002,"doc_count":108},{"key":2001,"doc_count":91},{"key":2000,"doc_count":97}]}},"hits":{"total":1000,"hits":[]}}class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=2009, doc_count=99}, {key=2008, doc_count=102}, {key=2007, doc_count=93}, {key=2006, doc_count=103}, {key=2005, doc_count=93}, {key=2004, doc_count=108}, {key=2003, doc_count=106}, {key=2002, doc_count=108}, {key=2001, doc_count=91}, {key=2000, doc_count=97}]}}
hits: class SearchResponseHits {
maxScore: null
total: 1000
hits: []
}
profile: null
}class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=2009, doc_count=99}, {key=2008, doc_count=102}, {key=2007, doc_count=93}, {key=2006, doc_count=103}, {key=2005, doc_count=93}, {key=2004, doc_count=108}, {key=2003, doc_count=106}, {key=2002, doc_count=108}, {key=2001, doc_count=91}, {key=2000, doc_count=97}]}}
hits: class SearchResponseHits {
maxScore: null
total: 1000
hits: []
}
profile: null
}class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=2009, doc_count=99}, {key=2008, doc_count=102}, {key=2007, doc_count=93}, {key=2006, doc_count=103}, {key=2005, doc_count=93}, {key=2004, doc_count=108}, {key=2003, doc_count=106}, {key=2002, doc_count=108}, {key=2001, doc_count=91}, {key=2000, doc_count=97}]}}
hits: class SearchResponseHits {
maxScore: null
total: 1000
hits: []
}
profile: null
}{
"took":0,
"timed_out":false,
"aggregations":
{
"cat_id":
{
"buckets":
[{
"key":1,
"doc_count":1
}]
}
},
"hits":
{
"total":5,
"hits":[]
}
}{
"took":0,
"timed_out":false,
"aggregations":
{
"cat_id":
{
"buckets":
[{
"key":1,
"doc_count":1
}]
}
},
"hits":
{
"total":5,
"hits":[]
}
}- SQL
SELECT release_year, count(*) from films GROUP BY release_year ORDER BY release_year asc limit 5;+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2000 | 97 |
| 2001 | 91 |
| 2002 | 108 |
| 2003 | 106 |
| 2004 | 108 |
+--------------+----------+或者,你可以按聚合进行排序:
- 按
count(*)显示具有最多元素的组首先 - 按
avg(rental_rate)展示评分最高的电影首先。注意,在示例中,它是通过别名完成的:avg(rental_rate)首先映射到avg在SELECT列表中,然后我们只需做ORDER BY avg
- SQL1
- SQL2
SELECT release_year, count(*) FROM films GROUP BY release_year ORDER BY count(*) desc LIMIT 5;SELECT release_year, AVG(rental_rate) avg FROM films GROUP BY release_year ORDER BY avg desc LIMIT 5;+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2003 | 106 |
| 2006 | 103 |
| 2008 | 102 |
+--------------+----------++--------------+------------+
| release_year | avg |
+--------------+------------+
| 2006 | 3.26184368 |
| 2000 | 3.17556739 |
| 2001 | 3.09989142 |
| 2002 | 3.08259249 |
| 2008 | 2.99000049 |
+--------------+------------+- SQL
- JSON
SELECT category_id, release_year, count(*) FROM films GROUP BY category_id, release_year ORDER BY category_id ASC, release_year ASC;POST /search -d '
{
"size": 0,
"table": "films",
"aggs": {
"cat_release": {
"composite": {
"size":5,
"sources": [
{ "category": { "terms": { "field": "category_id" } } },
{ "release year": { "terms": { "field": "release_year" } } }
]
}
}
}
}
'+-------------+--------------+----------+
| category_id | release_year | count(*) |
+-------------+--------------+----------+
| 1 | 2000 | 5 |
| 1 | 2001 | 2 |
| 1 | 2002 | 6 |
| 1 | 2003 | 6 |
| 1 | 2004 | 5 |
| 1 | 2005 | 10 |
| 1 | 2006 | 4 |
| 1 | 2007 | 5 |
| 1 | 2008 | 7 |
| 1 | 2009 | 14 |
| 2 | 2000 | 10 |
| 2 | 2001 | 5 |
| 2 | 2002 | 6 |
| 2 | 2003 | 6 |
| 2 | 2004 | 10 |
| 2 | 2005 | 4 |
| 2 | 2006 | 5 |
| 2 | 2007 | 8 |
| 2 | 2008 | 8 |
| 2 | 2009 | 4 |
+-------------+--------------+----------+{
"took": 0,
"timed_out": false,
"hits": {
"total": 1000,
"total_relation": "eq",
"hits": []
},
"aggregations": {
"cat_release": {
"after_key": {
"category": 1,
"release year": 2007
},
"buckets": [
{
"key": {
"category": 1,
"release year": 2008
},
"doc_count": 7
},
{
"key": {
"category": 1,
"release year": 2009
},
"doc_count": 14
},
{
"key": {
"category": 1,
"release year": 2005
},
"doc_count": 10
},
{
"key": {
"category": 1,
"release year": 2004
},
"doc_count": 5
},
{
"key": {
"category": 1,
"release year": 2007
},
"doc_count": 5
}
]
}
}
}有时查看每个组的不止一行是有用的。这可以通过 GROUP N BY 辅助轻松实现。例如,在以下情况下,我们将为每一年获取两部电影而不是只有一部,而简单的 GROUP BY release_year 将返回。
- SQL
SELECT release_year, title FROM films GROUP 2 BY release_year ORDER BY release_year DESC LIMIT 6;+--------------+-----------------------------+
| release_year | title |
+--------------+-----------------------------+
| 2009 | ALICE FANTASIA |
| 2009 | ALIEN CENTER |
| 2008 | AMADEUS HOLY |
| 2008 | ANACONDA CONFESSIONS |
| 2007 | ANGELS LIFE |
| 2007 | ARACHNOPHOBIA ROLLERCOASTER |
+--------------+-----------------------------+另一个关键的分析需求是在组内对元素进行排序。要实现这一点,请使用 WITHIN GROUP ORDER BY ... {ASC|DESC} 子句。例如,让我们获取每一年评分最高的电影。请注意,它与仅 ORDER BY 并行工作:
WITHIN GROUP ORDER BY对组内的结果进行排序- 而仅
GROUP BY对组本身进行排序
这两个完全独立地工作。
- SQL
SELECT release_year, title, rental_rate FROM films GROUP BY release_year WITHIN GROUP ORDER BY rental_rate DESC ORDER BY release_year DESC LIMIT 5;+--------------+------------------+-------------+
| release_year | title | rental_rate |
+--------------+------------------+-------------+
| 2009 | AMERICAN CIRCUS | 4.990000 |
| 2008 | ANTHEM LUKE | 4.990000 |
| 2007 | ATTACKS HATE | 4.990000 |
| 2006 | ALADDIN CALENDAR | 4.990000 |
| 2005 | AIRPLANE SIERRA | 4.990000 |
+--------------+------------------+-------------+HAVING expression 是一个有用的子句,用于过滤组。虽然 WHERE 在分组之前应用,但 HAVING 作用于组。例如,让我们保留那些当年电影平均租赁价格高于 3 的年份。我们只得到四个年份:
- SQL
SELECT release_year, avg(rental_rate) avg FROM films GROUP BY release_year HAVING avg > 3;+--------------+------------+
| release_year | avg |
+--------------+------------+
| 2002 | 3.08259249 |
| 2001 | 3.09989142 |
| 2000 | 3.17556739 |
| 2006 | 3.26184368 |
+--------------+------------+注意: 在 搜索查询元信息 中,total_found 值反映了满足 HAVING 条件的组的数量。当使用 HAVING 子句与 GROUP BY 结合时,这使得分页更加准确。
有一个函数 GROUPBY() 返回当前分组的键。它在许多情况下都很有用,特别是当你对 MVA 进行分组 或者对 JSON 值进行分组。
它也可以用在 HAVING 中,例如,仅保留2000年和2002年的数据。
注意,当你一次对多个字段进行 GROUP BY 时,不推荐使用 GROUPBY()。它仍然可以工作,但此时分组键是字段值的复合体,可能不会以你期望的方式出现。
- SQL
SELECT release_year, count(*) FROM films GROUP BY release_year HAVING GROUPBY() IN (2000, 2002);+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2002 | 108 |
| 2000 | 97 |
+--------------+----------+Manticore 支持按MVA进行分组。为了演示其工作原理,让我们创建一个名为 "shoes"、包含 MVA 字段 "sizes" 的表,并插入一些文档:
create table shoes(title text, sizes multi);
insert into shoes values(0,'nike',(40,41,42)),(0,'adidas',(41,43)),(0,'reebook',(42,43));
所以我们有:
SELECT * FROM shoes;
+---------------------+----------+---------+
| id | sizes | title |
+---------------------+----------+---------+
| 1657851069130080265 | 40,41,42 | nike |
| 1657851069130080266 | 41,43 | adidas |
| 1657851069130080267 | 42,43 | reebook |
+---------------------+----------+---------+
如果现在对 "sizes" 进行 GROUP BY,它将处理我们所有的多值属性,并为每个大小返回一次聚合,这里仅有计数:
- SQL
- JSON
- PHP
- Python
- Javascript
- Python-asyncio
- Java
- C#
- Rust
- Go
SELECT groupby() gb, count(*) FROM shoes GROUP BY sizes ORDER BY gb asc;POST /search -d '
{
"table" : "shoes",
"limit": 0,
"aggs" :
{
"sizes" :
{
"terms" :
{
"field":"sizes",
"size":100
}
}
}
}
'$index->setName('shoes');
$search = $index->search('');
$search->limit(0);
$search->facet('sizes','sizes',100);
$results = $search->get();
print_r($results->getFacets());res =searchApi.search({"table":"shoes","limit":0,"aggs":{"sizes":{"terms":{"field":"sizes","size":100}}}})res = await searchApi.search({"table":"shoes","limit":0,"aggs":{"sizes":{"terms":{"field":"sizes","size":100}}}});res = await searchApi.search({"table":"shoes","limit":0,"aggs":{"sizes":{"terms":{"field":"sizes","size":100}}}})HashMap<String,Object> aggs = new HashMap<String,Object>(){{
put("release_year", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","release_year");
put("size",100);
}});
}});
}};
searchRequest = new SearchRequest();
searchRequest.setIndex("films");
searchRequest.setLimit(0);
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setAggs(aggs);
searchResponse = searchApi.search(searchRequest);var agg = new Aggregation("release_year", "release_year");
agg.Size = 100;
object query = new { match_all=null };
var searchRequest = new SearchRequest("films", query);
searchRequest.Limit = 0;
searchRequest.Aggs = new List<Aggregation> {agg};
var searchResponse = searchApi.Search(searchRequest);let query = SearchQuery::new();
let aggTerms1 = AggTerms::new {
fields: "release_year".to_string(),
size: Some(100),
};
let agg1 = Aggregation {
terms: Some(Box::new(aggTerms1)),
..Default::default(),
};
let mut aggs = HashMap::new();
aggs.insert("release_year".to_string(), agg1);
let search_req = SearchRequest {
table: "films".to_string(),
query: Some(Box::new(query)),
aggs: serde_json::json!(aggs),
limit: serde_json::json!(0),
..Default::default(),
};
let search_res = search_api.search(search_req).await;query := map[string]interface{} {};
searchRequest.SetQuery(query);
aggTerms := manticoreclient.NewAggregationTerms()
aggTerms.SetField("mva_field")
aggTerms.SetSize(2)
aggregation := manticoreclient.NewAggregation()
aggregation.setTerms(aggTerms)
searchRequest.SetAggregation(aggregation)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()+------+----------+
| gb | count(*) |
+------+----------+
| 40 | 1 |
| 41 | 2 |
| 42 | 2 |
| 43 | 2 |
+------+----------+{
"took": 0,
"timed_out": false,
"hits": {
"total": 3,
"hits": [
]
},
"aggregations": {
"sizes": {
"buckets": [
{
"key": 43,
"doc_count": 2
},
{
"key": 42,
"doc_count": 2
},
{
"key": 41,
"doc_count": 2
},
{
"key": 40,
"doc_count": 1
}
]
}
}
}Array
(
[sizes] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 43
[doc_count] => 2
)
[1] => Array
(
[key] => 42
[doc_count] => 2
)
[2] => Array
(
[key] => 41
[doc_count] => 2
)
[3] => Array
(
[key] => 40
[doc_count] => 1
)
)
)
){'aggregations': {u'sizes': {u'buckets': [{u'doc_count': 2, u'key': 43},
{u'doc_count': 2, u'key': 42},
{u'doc_count': 2, u'key': 41},
{u'doc_count': 1, u'key': 40}]}},
'hits': {'hits': [], 'max_score': None, 'total': 3},
'profile': None,
'timed_out': False,
'took': 0}{'aggregations': {u'sizes': {u'buckets': [{u'doc_count': 2, u'key': 43},
{u'doc_count': 2, u'key': 42},
{u'doc_count': 2, u'key': 41},
{u'doc_count': 1, u'key': 40}]}},
'hits': {'hits': [], 'max_score': None, 'total': 3},
'profile': None,
'timed_out': False,
'took': 0}{"took":0,"timed_out":false,"aggregations":{"sizes":{"buckets":[{"key":43,"doc_count":2},{"key":42,"doc_count":2},{"key":41,"doc_count":2},{"key":40,"doc_count":1}]}},"hits":{"total":3,"hits":[]}}class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=43, doc_count=2}, {key=42, doc_count=2}, {key=41, doc_count=2}, {key=40, doc_count=1}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=43, doc_count=2}, {key=42, doc_count=2}, {key=41, doc_count=2}, {key=40, doc_count=1}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=43, doc_count=2}, {key=42, doc_count=2}, {key=41, doc_count=2}, {key=40, doc_count=1}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}
<!-- request TypeScript -->
``` typescript
res = await searchApi.search({
index: 'test',
aggs: {
mva_agg: {
terms: { field: "mva_field", size: 2 }
}
}
});{
"took":0,
"timed_out":false,
"aggregations":
{
"mva_agg":
{
"buckets":
[{
"key":1,
"doc_count":4
},
{
"key":2,
"doc_count":2
}]
}
},
"hits":
{
"total":4,
"hits":[]
}
}{
"took":0,
"timed_out":false,
"aggregations":
{
"mva_agg":
{
"buckets":
[{
"key":1,
"doc_count":4
},
{
"key":2,
"doc_count":2
}]
}
},
"hits":
{
"total":5,
"hits":[]
}
}If you have a field of type JSON, you can GROUP BY any node from it. To demonstrate this, let's create a table "products" with a few documents, each having a color in the "meta" JSON field:
create table products(title text, meta json);
insert into products values(0,'nike','{"color":"red"}'),(0,'adidas','{"color":"red"}'),(0,'puma','{"color":"green"}');
This gives us:
SELECT * FROM products;
+---------------------+-------------------+--------+
| id | meta | title |
+---------------------+-------------------+--------+
| 1657851069130080268 | {"color":"red"} | nike |
| 1657851069130080269 | {"color":"red"} | adidas |
| 1657851069130080270 | {"color":"green"} | puma |
+---------------------+-------------------+--------+
To group the products by color, we can simply use GROUP BY meta.color, and to display the corresponding group key in the SELECT list, we can use GROUPBY():
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
- C#
- Rust
- TypeScript
- Go
SELECT groupby() color, count(*) from products GROUP BY meta.color;POST /search -d '
{
"table" : "products",
"limit": 0,
"aggs" :
{
"color" :
{
"terms" :
{
"field":"meta.color",
"size":100
}
}
}
}
'$index->setName('products');
$search = $index->search('');
$search->limit(0);
$search->facet('meta.color','color',100);
$results = $search->get();
print_r($results->getFacets());res =searchApi.search({"table":"products","limit":0,"aggs":{"color":{"terms":{"field":"meta.color","size":100}}}})res = await searchApi.search({"table":"products","limit":0,"aggs":{"color":{"terms":{"field":"meta.color","size":100}}}});HashMap<String,Object> aggs = new HashMap<String,Object>(){{
put("color", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","meta.color");
put("size",100);
}});
}});
}};
searchRequest = new SearchRequest();
searchRequest.setIndex("products");
searchRequest.setLimit(0);
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setAggs(aggs);
searchResponse = searchApi.search(searchRequest);var agg = new Aggregation("color", "meta.color");
agg.Size = 100;
object query = new { match_all=null };
var searchRequest = new SearchRequest("products", query);
searchRequest.Limit = 0;
searchRequest.Aggs = new List<Aggregation> {agg};
var searchResponse = searchApi.Search(searchRequest);let query = SearchQuery::new();
let aggTerms1 = AggTerms::new {
fields: "meta.color".to_string(),
size: Some(100),
};
let agg1 = Aggregation {
terms: Some(Box::new(aggTerms1)),
..Default::default(),
};
let mut aggs = HashMap::new();
aggs.insert("color".to_string(), agg1);
let search_req = SearchRequest {
table: "products".to_string(),
query: Some(Box::new(query)),
aggs: serde_json::json!(aggs),
limit: serde_json::json!(0),
..Default::default(),
};
let search_res = search_api.search(search_req).await;res = await searchApi.search({
index: 'test',
aggs: {
json_agg: {
terms: { field: "json_field.year", size: 1 }
}
}
});query := map[string]interface{} {};
searchRequest.SetQuery(query);
aggTerms := manticoreclient.NewAggregationTerms()
aggTerms.SetField("json_field.year")
aggTerms.SetSize(2)
aggregation := manticoreclient.NewAggregation()
aggregation.setTerms(aggTerms)
searchRequest.SetAggregation(aggregation)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()+-------+----------+
| color | count(*) |
+-------+----------+
| red | 2 |
| green | 1 |
+-------+----------+{
"took": 0,
"timed_out": false,
"hits": {
"total": 3,
"hits": [
]
},
"aggregations": {
"color": {
"buckets": [
{
"key": "green",
"doc_count": 1
},
{
"key": "red",
"doc_count": 2
}
]
}
}
}Array
(
[color] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => green
[doc_count] => 1
)
[1] => Array
(
[key] => red
[doc_count] => 2
)
)
)
){'aggregations': {u'color': {u'buckets': [{u'doc_count': 1,
u'key': u'green'},
{u'doc_count': 2, u'key': u'red'}]}},
'hits': {'hits': [], 'max_score': None, 'total': 3},
'profile': None,
'timed_out': False,
'took': 0}<!-- request Python-asyncio -->
``` python
res = await searchApi.search({"table":"products","limit":0,"aggs":{"color":{"terms":{"field":"meta.color","size":100}}}}){'aggregations': {u'color': {u'buckets': [{u'doc_count': 1,
u'key': u'green'},
{u'doc_count': 2, u'key': u'red'}]}},
'hits': {'hits': [], 'max_score': None, 'total': 3},
'profile': None,
'timed_out': False,
'took': 0}{"took":0,"timed_out":false,"aggregations":{"color":{"buckets":[{"key":"green","doc_count":1},{"key":"red","doc_count":2}]}},"hits":{"total":3,"hits":[]}}class SearchResponse {
took: 0
timedOut: false
aggregations: {color={buckets=[{key=green, doc_count=1}, {key=red, doc_count=2}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}
class SearchResponse {
took: 0
timedOut: false
aggregations: {color={buckets=[{key=green, doc_count=1}, {key=red, doc_count=2}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}
class SearchResponse {
took: 0
timedOut: false
aggregations: {color={buckets=[{key=green, doc_count=1}, {key=red, doc_count=2}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}
{
"took":0,
"timed_out":false,
"aggregations":
{
"json_agg":
{
"buckets":
[{
"key":2000,
"doc_count":2
},
{
"key":2001,
"doc_count":2
}]
}
},
"hits":
{
"total":4,
"hits":[]
}
}{
"took":0,
"timed_out":false,
"aggregations":
{
"json_agg":
{
"buckets":
[{
"key":2000,
"doc_count":2
},
{
"key":2001,
"doc_count":2
}]
}
},
"hits":
{
"total":4,
"hits":[]
}
}除了返回每组元素数量的 COUNT(*),你还可以使用各种其他聚合函数:
虽然 COUNT(*) 返回组中所有元素的数量,COUNT(DISTINCT field) 返回组中特定字段的唯一值数量,这可能与总数完全不同。例如,你可以在组中有100个元素,但某字段的值全都相同。COUNT(DISTINCT field) 能帮助确定这一点。为演示此功能,我们创建一个名为 "students" 的表,包含学生姓名、年龄和专业:
CREATE TABLE students(name text, age int, major string);
INSERT INTO students values(0,'John',21,'arts'),(0,'William',22,'business'),(0,'Richard',21,'cs'),(0,'Rebecca',22,'cs'),(0,'Monica',21,'arts');
所以我们有:
MySQL [(none)]> SELECT * from students;
+---------------------+------+----------+---------+
| id | age | major | name |
+---------------------+------+----------+---------+
| 1657851069130080271 | 21 | arts | John |
| 1657851069130080272 | 22 | business | William |
| 1657851069130080273 | 21 | cs | Richard |
| 1657851069130080274 | 22 | cs | Rebecca |
| 1657851069130080275 | 21 | arts | Monica |
+---------------------+------+----------+---------+
在示例中,你可以看到如果按 major 分组并同时显示 COUNT(*) 和 COUNT(DISTINCT age),就会清楚地知道选择专业为 "cs" 的有两名学生且年龄各不相同,但对于专业为 "arts" 的也有两名学生,却只有一个唯一年龄。
每个查询中最多只能有一个 COUNT(DISTINCT)。
默认情况下,计数是近似的
实际上,有些计数是精确的,有些则是近似的。下面会详细说明。
Manticore 支持两种计算唯一值计数的算法。一种是使用大量内存通常较慢的传统算法,它收集 {group; value} 对,排序后周期性去重。该方法的优点是保证在普通表中的计数精确。你可以通过将 distinct_precision_threshold 选项设置为 0 来启用它。
另一种算法(默认启用)将计数加载到哈希表中并返回其大小。如果哈希表过大,其内容会被转移到 HyperLogLog。此时计数变为近似,因为 HyperLogLog 是一种概率算法。其优点是每组最大内存使用固定,依赖于 HyperLogLog 的精度。总体内存使用还受 max_matches 设置影响,该值反映组数。
distinct_precision_threshold 选项设定计数保证准确的阈值。HyperLogLog 的精度设置以及“从哈希表到 HyperLogLog”的转换阈值均基于该参数。使用此选项需谨慎,因为将其翻倍会使计数计算所需最大内存翻倍。最大内存使用大致可用公式估算:64 * max_matches * distinct_precision_threshold。注意,这是最坏情况,实际计数计算通常占用明显更少的内存。
对于包含多个磁盘块的分布式表或实时表,COUNT(DISTINCT) 可能返回不准确结果,但对于由相同模式(相同字段集/顺序但分词设置可不同)的本地普通表或实时表组成的分布式表,结果应准确。
- SQL
SELECT major, count(*), count(distinct age) FROM students GROUP BY major;+----------+----------+---------------------+
| major | count(*) | count(distinct age) |
+----------+----------+---------------------+
| arts | 2 | 1 |
| business | 1 | 1 |
| cs | 2 | 2 |
+----------+----------+---------------------+你通常想更好地了解每个分组的内容。你可以使用 GROUP N BY 实现,但它会返回额外的行,可能不想出现在结果中。GROUP_CONCAT() 通过连接组中特定字段的数值来丰富分组内容。继续使用前面的例子,这里改进为显示每组中的所有年龄。
GROUP_CONCAT(field) 返回逗号分隔的值列表。
- SQL
SELECT major, count(*), count(distinct age), group_concat(age) FROM students GROUP BY major+----------+----------+---------------------+-------------------+
| major | count(*) | count(distinct age) | group_concat(age) |
+----------+----------+---------------------+-------------------+
| arts | 2 | 1 | 21,21 |
| business | 1 | 1 | 22 |
| cs | 2 | 2 | 21,22 |
+----------+----------+---------------------+-------------------+- SQL
SELECT release_year year, sum(rental_rate) sum, min(rental_rate) min, max(rental_rate) max, avg(rental_rate) avg FROM films GROUP BY release_year ORDER BY year asc LIMIT 5;+------+------------+----------+----------+------------+
| year | sum | min | max | avg |
+------+------------+----------+----------+------------+
| 2000 | 308.030029 | 0.990000 | 4.990000 | 3.17556739 |
| 2001 | 282.090118 | 0.990000 | 4.990000 | 3.09989142 |
| 2002 | 332.919983 | 0.990000 | 4.990000 | 3.08259249 |
| 2003 | 310.940063 | 0.990000 | 4.990000 | 2.93339682 |
| 2004 | 300.920044 | 0.990000 | 4.990000 | 2.78629661 |
+------+------------+----------+----------+------------+分组使用固定内存,该内存大小依赖于 max_matches 设置。如果 max_matches 足以存储所有找到的组,结果将是 100% 精确的。但若 max_matches 较小,结果的准确性会降低。
涉及并行处理时情况会更复杂。启用 pseudo_sharding 和/或使用包含多个磁盘块的 RT 表时,每个块或伪分片得到的结果集大小都不超过 max_matches。当不同线程结果集合并时,聚合和分组计数可能不准确。为解决这个问题,可以使用更大的 max_matches 值或禁用并行处理。
如果检测到 groupby 可能返回不准确结果,Manticore 会尝试将 max_matches 增加到 max_matches_increase_threshold。检测基于从次级索引(如有)检索到的 groupby 属性的唯一值数量。
为确保在使用 RT 表或 pseudo_sharding 时聚合和分组计数准确,可以启用 accurate_aggregation。这将尝试提升 max_matches 到阈值,如果阈值不足,Manticore 会禁用查询的并行处理。
- SQL
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| 2005 | 93 |
| 2000 | 97 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=1;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 76 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=2;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 76 |
| 2002 | 74 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=3;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
+------+----------+