Manticore 设计为通过其分布式搜索功能实现有效扩展。分布式搜索有助于提高查询延迟(即搜索时间)和吞吐量(即最大查询数/秒),适用于多服务器、多CPU或多核环境。这对于需要搜索海量数据(即数十亿条记录和数TB文本)的应用至关重要。
其主要理念是将被搜索的数据水平分区到多个搜索节点,并行处理。
分区是手动完成的。设置步骤如下:
- 在不同服务器上设置多个 Manticore 实例
- 将数据集的不同部分分配给不同实例
- 在某些
searchd实例上配置特殊的分布式表 - 将查询路由到分布式表
这种类型的表仅包含对其他本地和远程表的引用,因此不能直接重新索引。相反,应重新索引它所引用的表。
当 Manticore 收到针对分布式表的查询时,会执行以下步骤:
- 连接到配置的远程代理
- 向它们发送查询
- 同时搜索配置的本地表(远程代理搜索的同时)
- 从远程代理检索搜索结果
- 合并所有结果,去除重复项
- 将合并后的结果发送给客户端
从应用程序的角度来看,通过常规表或分布式表搜索没有区别。换句话说,分布式表对应用程序完全透明,无法判断查询的表是分布式还是本地。
了解更多关于远程节点的信息。
多查询,或称查询批处理,允许您在一次网络请求中向 Manticore 发送多个搜索查询。
👍 为什么使用多查询?
主要原因是性能。通过批量发送请求给 Manticore,而不是逐个发送,您可以通过减少网络往返时间来节省时间。此外,批量发送查询允许 Manticore 执行某些内部优化。如果无法应用批量优化,查询将被单独处理。
⛔ 何时不使用多查询?
多查询要求批处理中的所有搜索查询相互独立,但情况并非总是如此。有时查询 B 依赖于查询 A 的结果,这意味着查询 B 只能在执行查询 A 后设置。例如,您可能只想在主表中未找到结果时显示来自辅助索引的结果,或者您可能想根据第一个结果集中的匹配数指定第二个结果集的偏移量。在这些情况下,您需要使用单独的查询(或单独的批处理)。
您可以通过用分号分隔多个搜索查询来运行多查询。当 Manticore 从客户端接收到这样格式的查询时,将应用所有语句间的优化。
多查询不支持带有 FACET 的查询。单个批处理中的多查询数量不应超过 max_batch_queries。
- SQL
SELECT id, price FROM products WHERE MATCH('remove hair') ORDER BY price DESC; SELECT id, price FROM products WHERE MATCH('remove hair') ORDER BY price ASC需要注意两种主要优化:公共查询优化和公共子树优化。
公共查询优化 意味着 searchd 会识别批处理中所有仅排序和分组设置不同的查询,并且只执行一次搜索。例如,如果一个批处理包含 3 个查询,它们都是针对“ipod nano”的,但第一个查询请求按价格排序的前 10 个结果,第二个查询按供应商 ID 分组并请求按评分排序的前 5 个供应商,第三个查询请求最大价格,则全文搜索“ipod nano”只会执行一次,其结果将被重用以构建 3 个不同的结果集。
分面搜索 是特别受益于此优化的一个重要案例。实际上,分面搜索可以通过运行多个查询来实现,一个用于检索搜索结果本身,其他几个使用相同的全文查询但不同的分组设置来检索所有所需的结果组(前 3 名作者,前 5 名供应商等)。只要全文查询和过滤设置保持不变,公共查询优化就会触发,并大大提高性能。
公共子树优化 更加有趣。它允许 searchd 利用批量全文查询之间的相似性。它识别所有查询中的公共全文查询部分(子树)并在查询之间缓存它们。例如,考虑以下查询批处理:
donald trump president
donald trump barack obama john mccain
donald trump speech
有一个公共的两词部分 donald trump,只需计算一次,然后缓存并在查询间共享。公共子树优化正是这样做的。每个查询的缓存大小由 subtree_docs_cache 和 subtree_hits_cache 指令严格控制(以防缓存所有匹配“i am”的十六亿亿文档耗尽内存并立即导致服务器崩溃)。
如何判断批处理中的查询是否真的被优化了?如果是,相关的查询日志将有一个“multiplier”字段,指定一起处理了多少个查询:
注意“x3”字段。这意味着该查询被优化并在一个包含 3 个查询的子批处理中处理。
- log
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/rel 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/ext 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.000 2009] 0.040 sec x3 [ext/0/ext 747541 (0,20)] [lj] the作为参考,如果查询未批处理,常规日志将如下所示:
- log
[Sun Jul 12 15:18:17.062 2009] 0.059 sec [ext/0/rel 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.156 2009] 0.091 sec [ext/0/ext 747541 (0,20)] [lj] the
[Sun Jul 12 15:18:17.250 2009] 0.092 sec [ext/0/ext 747541 (0,20)] [lj] the注意多查询情况下每个查询的时间提升了 1.5 倍到 2.3 倍,具体取决于排序模式。
Manticore 支持通过 SQL 以以下格式进行 SELECT 子查询:
SELECT * FROM (SELECT ... ORDER BY cond1 LIMIT X) ORDER BY cond2 LIMIT Y
外层 select 只允许 ORDER BY 和 LIMIT 子句。子查询目前有两个使用场景:
-
当你有一个包含两个排序 UDF 的查询,一个非常快,另一个很慢,并且执行全文搜索时匹配结果集很大。没有子查询,查询看起来像:
SELECT id,slow_rank() as slow,fast_rank() as fast FROM index WHERE MATCH(‘some common query terms’) ORDER BY fast DESC, slow DESC LIMIT 20 OPTION max_matches=1000;使用子查询,查询可以重写为:
SELECT * FROM (SELECT id,slow_rank() as slow,fast_rank() as fast FROM index WHERE MATCH(‘some common query terms’) ORDER BY fast DESC LIMIT 100 OPTION max_matches=1000) ORDER BY slow DESC LIMIT 20;在初始查询中,
slow_rank()UDF 会对整个匹配结果集进行计算。使用 SELECT 子查询时,只有fast_rank()会对整个匹配结果集计算,而slow_rank()只对有限的集合计算。 -
第二种情况适用于来自分布式表的大型结果集。
对于此查询:
SELECT * FROM my_dist_index WHERE some_conditions LIMIT 50000;如果你有 20 个节点,每个节点最多可以向主节点发送 50K 条记录,结果是 20 x 50K = 1M 条记录。然而,由于主节点只返回 50K(从 1M 中),节点只发送前 10K 条记录可能就足够了。使用子查询,你可以将查询重写为:
SELECT * FROM (SELECT * FROM my_dist_index WHERE some_conditions LIMIT 10000) ORDER by some_attr LIMIT 50000;在这种情况下,节点只接收内部查询并执行它。这意味着主节点只会收到 20x10K=200K 条记录。主节点将接收的所有记录重新排序(根据外层子句),并返回最好的 50K 条记录。子查询有助于减少主节点和节点之间的流量,同时减少主节点的计算时间(因为它只处理 200K 条记录,而不是 1M 条记录)。
结果分组搜索通常有助于获取每个分组的匹配计数或其他聚合信息。例如,它对生成按月份显示匹配博客文章数量的图表有用,或者将网页搜索结果按网站分组,论坛帖子按作者分组等。
Manticore 支持按单列或多列以及计算表达式进行搜索结果分组。结果可以:
- 在分组内排序
- 每个分组返回多行
- 对分组进行筛选
- 对分组进行排序
- 使用聚合函数进行聚合
- SQL
- JSON
通用语法
SELECT {* | SELECT_expr [, SELECT_expr ...]}
...
GROUP BY {field_name | alias } [, ...]
[HAVING where_condition]
[WITHIN GROUP ORDER BY field_name {ASC | DESC} [, ...]]
...
SELECT_expr: { field_name | function_name(...) }
where_condition: {aggregation expression alias | COUNT(*)}JSON 查询格式目前支持基本分组,可以检索聚合值及其 count(*)。
{
"table": "<table_name>",
"limit": 0,
"aggs": {
"<aggr_name>": {
"terms": {
"field": "<attribute>",
"size": <int value>
}
}
}
}标准查询输出返回未分组的结果集,可以使用 limit (或 size) 隐藏。
聚合操作需要设置分组结果集的 size。
分组非常简单,只需在 SELECT 查询结尾添加 "GROUP BY smth" 。这里的 "something" 可以是:
- 表中的任何非全文字段:整数、浮点数、字符串、多值属性 (MVA)
- 或者,如果您在
SELECT列表中使用了别名,也可以按别名进行分组
您可以省略 SELECT 列表中的任何聚合函数,依然有效:
- SQL
SELECT release_year FROM films GROUP BY release_year LIMIT 5;+--------------+
| release_year |
+--------------+
| 2004 |
| 2002 |
| 2001 |
| 2005 |
| 2000 |
+--------------+然而,大多数情况下,您可能想要获取每个分组的一些聚合数据,例如:
COUNT(*)用于简单获取每个分组中的元素数量- 或者
AVG(field)用于计算组内某字段的平均值
对于 HTTP JSON 请求,在主查询级别使用带有 limit=0 的单个 aggs 桶,效果类似于带有 GROUP BY 和 COUNT(*) 的 SQL 查询,具有等效的行为和性能。
- SQL1
- SQL2
- JSON
- PHP
- Python
- Python-asyncio
- Javascript
- Java
- C#
- Rust
- TypeScript
- Go
SELECT release_year, count(*) FROM films GROUP BY release_year LIMIT 5;SELECT release_year, AVG(rental_rate) FROM films GROUP BY release_year LIMIT 5;POST /search -d '
{
"table" : "films",
"limit": 0,
"aggs" :
{
"release_year" :
{
"terms" :
{
"field":"release_year",
"size":100
}
}
}
}
'$index->setName('films');
$search = $index->search('');
$search->limit(0);
$search->facet('release_year','release_year',100);
$results = $search->get();
print_r($results->getFacets());res =searchApi.search({"table":"films","limit":0,"aggs":{"release_year":{"terms":{"field":"release_year","size":100}}}})res = await searchApi.search({"table":"films","limit":0,"aggs":{"release_year":{"terms":{"field":"release_year","size":100}}}})res = await searchApi.search({"table":"films","limit":0,"aggs":{"release_year":{"terms":{"field":"release_year","size":100}}}});HashMap<String,Object> aggs = new HashMap<String,Object>(){{
put("release_year", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","release_year");
put("size",100);
}});
}});
}};
searchRequest = new SearchRequest();
searchRequest.setIndex("films");
searchRequest.setLimit(0);
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setAggs(aggs);
searchResponse = searchApi.search(searchRequest);var agg = new Aggregation("release_year", "release_year");
agg.Size = 100;
object query = new { match_all=null };
var searchRequest = new SearchRequest("films", query);
searchRequest.Aggs = new List<Aggregation> {agg};
var searchResponse = searchApi.Search(searchRequest);let query = SearchQuery::new();
let aggTerms1 = AggTerms::new {
fields: "release_year".to_string(),
size: Some(100),
};
let agg1 = Aggregation {
terms: Some(Box::new(aggTerms1)),
..Default::default(),
};
let mut aggs = HashMap::new();
aggs.insert("release_year".to_string(), agg1);
let search_req = SearchRequest {
table: "films".to_string(),
query: Some(Box::new(query)),
aggs: serde_json::json!(aggs),
..Default::default(),
};
let search_res = search_api.search(search_req).await;res = await searchApi.search({
index: 'test',
limit: 0,
aggs: {
cat_id: {
terms: { field: "cat", size: 1 }
}
}
});query := map[string]interface{} {};
searchRequest.SetQuery(query);
aggTerms := manticoreclient.NewAggregationTerms()
aggTerms.SetField("cat")
aggTerms.SetSize(1)
aggregation := manticoreclient.NewAggregation()
aggregation.setTerms(aggTerms)
searchRequest.SetAggregation(aggregation)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| 2005 | 93 |
| 2000 | 97 |
+--------------+----------++--------------+------------------+
| release_year | avg(rental_rate) |
+--------------+------------------+
| 2004 | 2.78629661 |
| 2002 | 3.08259249 |
| 2001 | 3.09989142 |
| 2005 | 2.90397978 |
| 2000 | 3.17556739 |
+--------------+------------------+{
"took": 2,
"timed_out": false,
"hits": {
"total": 10000,
"hits": [
]
},
"release_year": {
"group_brand_id": {
"buckets": [
{
"key": 2004,
"doc_count": 108
},
{
"key": 2002,
"doc_count": 108
},
{
"key": 2000,
"doc_count": 97
},
{
"key": 2005,
"doc_count": 93
},
{
"key": 2001,
"doc_count": 91
}
]
}
}
}Array
(
[release_year] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 2009
[doc_count] => 99
)
[1] => Array
(
[key] => 2008
[doc_count] => 102
)
[2] => Array
(
[key] => 2007
[doc_count] => 93
)
[3] => Array
(
[key] => 2006
[doc_count] => 103
)
[4] => Array
(
[key] => 2005
[doc_count] => 93
)
[5] => Array
(
[key] => 2004
[doc_count] => 108
)
[6] => Array
(
[key] => 2003
[doc_count] => 106
)
[7] => Array
(
[key] => 2002
[doc_count] => 108
)
[8] => Array
(
[key] => 2001
[doc_count] => 91
)
[9] => Array
(
[key] => 2000
[doc_count] => 97
)
)
)
){'aggregations': {u'release_year': {u'buckets': [{u'doc_count': 99,
u'key': 2009},
{u'doc_count': 102,
u'key': 2008},
{u'doc_count': 93,
u'key': 2007},
{u'doc_count': 103,
u'key': 2006},
{u'doc_count': 93,
u'key': 2005},
{u'doc_count': 108,
u'key': 2004},
{u'doc_count': 106,
u'key': 2003},
{u'doc_count': 108,
u'key': 2002},
{u'doc_count': 91,
u'key': 2001},
{u'doc_count': 97,
u'key': 2000}]}},
'hits': {'hits': [], 'max_score': None, 'total': 1000},
'profile': None,
'timed_out': False,
'took': 0}{'aggregations': {u'release_year': {u'buckets': [{u'doc_count': 99,
u'key': 2009},
{u'doc_count': 102,
u'key': 2008},
{u'doc_count': 93,
u'key': 2007},
{u'doc_count': 103,
u'key': 2006},
{u'doc_count': 93,
u'key': 2005},
{u'doc_count': 108,
u'key': 2004},
{u'doc_count': 106,
u'key': 2003},
{u'doc_count': 108,
u'key': 2002},
{u'doc_count': 91,
u'key': 2001},
{u'doc_count': 97,
u'key': 2000}]}},
'hits': {'hits': [], 'max_score': None, 'total': 1000},
'profile': None,
'timed_out': False,
'took': 0}{"took":0,"timed_out":false,"aggregations":{"release_year":{"buckets":[{"key":2009,"doc_count":99},{"key":2008,"doc_count":102},{"key":2007,"doc_count":93},{"key":2006,"doc_count":103},{"key":2005,"doc_count":93},{"key":2004,"doc_count":108},{"key":2003,"doc_count":106},{"key":2002,"doc_count":108},{"key":2001,"doc_count":91},{"key":2000,"doc_count":97}]}},"hits":{"total":1000,"hits":[]}}class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=2009, doc_count=99}, {key=2008, doc_count=102}, {key=2007, doc_count=93}, {key=2006, doc_count=103}, {key=2005, doc_count=93}, {key=2004, doc_count=108}, {key=2003, doc_count=106}, {key=2002, doc_count=108}, {key=2001, doc_count=91}, {key=2000, doc_count=97}]}}
hits: class SearchResponseHits {
maxScore: null
total: 1000
hits: []
}
profile: null
}class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=2009, doc_count=99}, {key=2008, doc_count=102}, {key=2007, doc_count=93}, {key=2006, doc_count=103}, {key=2005, doc_count=93}, {key=2004, doc_count=108}, {key=2003, doc_count=106}, {key=2002, doc_count=108}, {key=2001, doc_count=91}, {key=2000, doc_count=97}]}}
hits: class SearchResponseHits {
maxScore: null
total: 1000
hits: []
}
profile: null
}class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=2009, doc_count=99}, {key=2008, doc_count=102}, {key=2007, doc_count=93}, {key=2006, doc_count=103}, {key=2005, doc_count=93}, {key=2004, doc_count=108}, {key=2003, doc_count=106}, {key=2002, doc_count=108}, {key=2001, doc_count=91}, {key=2000, doc_count=97}]}}
hits: class SearchResponseHits {
maxScore: null
total: 1000
hits: []
}
profile: null
}{
"took":0,
"timed_out":false,
"aggregations":
{
"cat_id":
{
"buckets":
[{
"key":1,
"doc_count":1
}]
}
},
"hits":
{
"total":5,
"hits":[]
}
}{
"took":0,
"timed_out":false,
"aggregations":
{
"cat_id":
{
"buckets":
[{
"key":1,
"doc_count":1
}]
}
},
"hits":
{
"total":5,
"hits":[]
}
}- SQL
SELECT release_year, count(*) from films GROUP BY release_year ORDER BY release_year asc limit 5;+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2000 | 97 |
| 2001 | 91 |
| 2002 | 108 |
| 2003 | 106 |
| 2004 | 108 |
+--------------+----------+另一种方式是按聚合排序:
- 按
count(*)显示元素最多的分组优先 - 按
avg(rental_rate)显示评分最高的电影优先。示例中通过别名实现:avg(rental_rate)先映射为avg在SELECT列表中,然后直接使用ORDER BY avg
- SQL1
- SQL2
SELECT release_year, count(*) FROM films GROUP BY release_year ORDER BY count(*) desc LIMIT 5;SELECT release_year, AVG(rental_rate) avg FROM films GROUP BY release_year ORDER BY avg desc LIMIT 5;+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2003 | 106 |
| 2006 | 103 |
| 2008 | 102 |
+--------------+----------++--------------+------------+
| release_year | avg |
+--------------+------------+
| 2006 | 3.26184368 |
| 2000 | 3.17556739 |
| 2001 | 3.09989142 |
| 2002 | 3.08259249 |
| 2008 | 2.99000049 |
+--------------+------------+- SQL
- JSON
SELECT category_id, release_year, count(*) FROM films GROUP BY category_id, release_year ORDER BY category_id ASC, release_year ASC;POST /search -d '
{
"size": 0,
"table": "films",
"aggs": {
"cat_release": {
"composite": {
"size":5,
"sources": [
{ "category": { "terms": { "field": "category_id" } } },
{ "release year": { "terms": { "field": "release_year" } } }
]
}
}
}
}
'+-------------+--------------+----------+
| category_id | release_year | count(*) |
+-------------+--------------+----------+
| 1 | 2000 | 5 |
| 1 | 2001 | 2 |
| 1 | 2002 | 6 |
| 1 | 2003 | 6 |
| 1 | 2004 | 5 |
| 1 | 2005 | 10 |
| 1 | 2006 | 4 |
| 1 | 2007 | 5 |
| 1 | 2008 | 7 |
| 1 | 2009 | 14 |
| 2 | 2000 | 10 |
| 2 | 2001 | 5 |
| 2 | 2002 | 6 |
| 2 | 2003 | 6 |
| 2 | 2004 | 10 |
| 2 | 2005 | 4 |
| 2 | 2006 | 5 |
| 2 | 2007 | 8 |
| 2 | 2008 | 8 |
| 2 | 2009 | 4 |
+-------------+--------------+----------+{
"took": 0,
"timed_out": false,
"hits": {
"total": 1000,
"total_relation": "eq",
"hits": []
},
"aggregations": {
"cat_release": {
"after_key": {
"category": 1,
"release year": 2007
},
"buckets": [
{
"key": {
"category": 1,
"release year": 2008
},
"doc_count": 7
},
{
"key": {
"category": 1,
"release year": 2009
},
"doc_count": 14
},
{
"key": {
"category": 1,
"release year": 2005
},
"doc_count": 10
},
{
"key": {
"category": 1,
"release year": 2004
},
"doc_count": 5
},
{
"key": {
"category": 1,
"release year": 2007
},
"doc_count": 5
}
]
}
}
}- SQL
SELECT release_year, title FROM films GROUP 2 BY release_year ORDER BY release_year DESC LIMIT 6;+--------------+-----------------------------+
| release_year | title |
+--------------+-----------------------------+
| 2009 | ALICE FANTASIA |
| 2009 | ALIEN CENTER |
| 2008 | AMADEUS HOLY |
| 2008 | ANACONDA CONFESSIONS |
| 2007 | ANGELS LIFE |
| 2007 | ARACHNOPHOBIA ROLLERCOASTER |
+--------------+-----------------------------+另一个重要的分析需求是对组内元素排序。实现方式是使用 WITHIN GROUP ORDER BY ... {ASC|DESC} 子句。例如,获取每年评分最高的电影。注意它和普通的 ORDER BY 是并行工作的:
WITHIN GROUP ORDER BY排序组内的结果- 而普通的
GROUP BY排序分组本身
这两者相互独立地工作。
- SQL
SELECT release_year, title, rental_rate FROM films GROUP BY release_year WITHIN GROUP ORDER BY rental_rate DESC ORDER BY release_year DESC LIMIT 5;+--------------+------------------+-------------+
| release_year | title | rental_rate |
+--------------+------------------+-------------+
| 2009 | AMERICAN CIRCUS | 4.990000 |
| 2008 | ANTHEM LUKE | 4.990000 |
| 2007 | ATTACKS HATE | 4.990000 |
| 2006 | ALADDIN CALENDAR | 4.990000 |
| 2005 | AIRPLANE SIERRA | 4.990000 |
+--------------+------------------+-------------+- SQL
SELECT release_year, avg(rental_rate) avg FROM films GROUP BY release_year HAVING avg > 3;+--------------+------------+
| release_year | avg |
+--------------+------------+
| 2002 | 3.08259249 |
| 2001 | 3.09989142 |
| 2000 | 3.17556739 |
| 2006 | 3.26184368 |
+--------------+------------+注意: 搜索查询元信息中的total_found值反映了符合HAVING条件的分组数量。这使得在使用HAVING子句和GROUP BY时能够实现正确的分页。
有一个函数GROUPBY(),返回当前分组的键。在许多情况下都很有用,尤其是在你按MVA分组或按JSON值分组时。
它也可以用于HAVING,例如,只保留年份2000和2002。
注意,当你同时按多个字段GROUP BY时,不建议使用GROUPBY()。它仍然可以工作,但由于此时分组键是字段值的复合,可能不会以你期望的方式展示。
- SQL
SELECT release_year, count(*) FROM films GROUP BY release_year HAVING GROUPBY() IN (2000, 2002);+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2002 | 108 |
| 2000 | 97 |
+--------------+----------+Manticore支持按MVA分组。为了演示其工作方式,我们创建一个名为"shoes"且带有MVA "sizes"的表,并插入几个文档:
create table shoes(title text, sizes multi);
insert into shoes values(0,'nike',(40,41,42)),(0,'adidas',(41,43)),(0,'reebook',(42,43));
所以我们有:
SELECT * FROM shoes;
+---------------------+----------+---------+
| id | sizes | title |
+---------------------+----------+---------+
| 1657851069130080265 | 40,41,42 | nike |
| 1657851069130080266 | 41,43 | adidas |
| 1657851069130080267 | 42,43 | reebook |
+---------------------+----------+---------+
如果现在我们按"sizes"分组,它会处理所有我们的多值属性,并为每个返回聚合结果,在本例中仅是计数:
- SQL
- JSON
- PHP
- Python
- Javascript
- Python-asyncio
- Java
- C#
- Rust
- Go
SELECT groupby() gb, count(*) FROM shoes GROUP BY sizes ORDER BY gb asc;POST /search -d '
{
"table" : "shoes",
"limit": 0,
"aggs" :
{
"sizes" :
{
"terms" :
{
"field":"sizes",
"size":100
}
}
}
}
'$index->setName('shoes');
$search = $index->search('');
$search->limit(0);
$search->facet('sizes','sizes',100);
$results = $search->get();
print_r($results->getFacets());res =searchApi.search({"table":"shoes","limit":0,"aggs":{"sizes":{"terms":{"field":"sizes","size":100}}}})res = await searchApi.search({"table":"shoes","limit":0,"aggs":{"sizes":{"terms":{"field":"sizes","size":100}}}});res = await searchApi.search({"table":"shoes","limit":0,"aggs":{"sizes":{"terms":{"field":"sizes","size":100}}}})HashMap<String,Object> aggs = new HashMap<String,Object>(){{
put("release_year", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","release_year");
put("size",100);
}});
}});
}};
searchRequest = new SearchRequest();
searchRequest.setIndex("films");
searchRequest.setLimit(0);
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setAggs(aggs);
searchResponse = searchApi.search(searchRequest);var agg = new Aggregation("release_year", "release_year");
agg.Size = 100;
object query = new { match_all=null };
var searchRequest = new SearchRequest("films", query);
searchRequest.Limit = 0;
searchRequest.Aggs = new List<Aggregation> {agg};
var searchResponse = searchApi.Search(searchRequest);let query = SearchQuery::new();
let aggTerms1 = AggTerms::new {
fields: "release_year".to_string(),
size: Some(100),
};
let agg1 = Aggregation {
terms: Some(Box::new(aggTerms1)),
..Default::default(),
};
let mut aggs = HashMap::new();
aggs.insert("release_year".to_string(), agg1);
let search_req = SearchRequest {
table: "films".to_string(),
query: Some(Box::new(query)),
aggs: serde_json::json!(aggs),
limit: serde_json::json!(0),
..Default::default(),
};
let search_res = search_api.search(search_req).await;query := map[string]interface{} {};
searchRequest.SetQuery(query);
aggTerms := manticoreclient.NewAggregationTerms()
aggTerms.SetField("mva_field")
aggTerms.SetSize(2)
aggregation := manticoreclient.NewAggregation()
aggregation.setTerms(aggTerms)
searchRequest.SetAggregation(aggregation)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()+------+----------+
| gb | count(*) |
+------+----------+
| 40 | 1 |
| 41 | 2 |
| 42 | 2 |
| 43 | 2 |
+------+----------+{
"took": 0,
"timed_out": false,
"hits": {
"total": 3,
"hits": [
]
},
"aggregations": {
"sizes": {
"buckets": [
{
"key": 43,
"doc_count": 2
},
{
"key": 42,
"doc_count": 2
},
{
"key": 41,
"doc_count": 2
},
{
"key": 40,
"doc_count": 1
}
]
}
}
}Array
(
[sizes] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => 43
[doc_count] => 2
)
[1] => Array
(
[key] => 42
[doc_count] => 2
)
[2] => Array
(
[key] => 41
[doc_count] => 2
)
[3] => Array
(
[key] => 40
[doc_count] => 1
)
)
)
){'aggregations': {u'sizes': {u'buckets': [{u'doc_count': 2, u'key': 43},
{u'doc_count': 2, u'key': 42},
{u'doc_count': 2, u'key': 41},
{u'doc_count': 1, u'key': 40}]}},
'hits': {'hits': [], 'max_score': None, 'total': 3},
'profile': None,
'timed_out': False,
'took': 0}{'aggregations': {u'sizes': {u'buckets': [{u'doc_count': 2, u'key': 43},
{u'doc_count': 2, u'key': 42},
{u'doc_count': 2, u'key': 41},
{u'doc_count': 1, u'key': 40}]}},
'hits': {'hits': [], 'max_score': None, 'total': 3},
'profile': None,
'timed_out': False,
'took': 0}{"took":0,"timed_out":false,"aggregations":{"sizes":{"buckets":[{"key":43,"doc_count":2},{"key":42,"doc_count":2},{"key":41,"doc_count":2},{"key":40,"doc_count":1}]}},"hits":{"total":3,"hits":[]}}class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=43, doc_count=2}, {key=42, doc_count=2}, {key=41, doc_count=2}, {key=40, doc_count=1}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=43, doc_count=2}, {key=42, doc_count=2}, {key=41, doc_count=2}, {key=40, doc_count=1}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}class SearchResponse {
took: 0
timedOut: false
aggregations: {release_year={buckets=[{key=43, doc_count=2}, {key=42, doc_count=2}, {key=41, doc_count=2}, {key=40, doc_count=1}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}
<!-- request TypeScript -->
``` typescript
res = await searchApi.search({
index: 'test',
aggs: {
mva_agg: {
terms: { field: "mva_field", size: 2 }
}
}
});{
"took":0,
"timed_out":false,
"aggregations":
{
"mva_agg":
{
"buckets":
[{
"key":1,
"doc_count":4
},
{
"key":2,
"doc_count":2
}]
}
},
"hits":
{
"total":4,
"hits":[]
}
}{
"took":0,
"timed_out":false,
"aggregations":
{
"mva_agg":
{
"buckets":
[{
"key":1,
"doc_count":4
},
{
"key":2,
"doc_count":2
}]
}
},
"hits":
{
"total":5,
"hits":[]
}
}If you have a field of type JSON, you can GROUP BY any node from it. To demonstrate this, let's create a table "products" with a few documents, each having a color in the "meta" JSON field:
create table products(title text, meta json);
insert into products values(0,'nike','{"color":"red"}'),(0,'adidas','{"color":"red"}'),(0,'puma','{"color":"green"}');
This gives us:
SELECT * FROM products;
+---------------------+-------------------+--------+
| id | meta | title |
+---------------------+-------------------+--------+
| 1657851069130080268 | {"color":"red"} | nike |
| 1657851069130080269 | {"color":"red"} | adidas |
| 1657851069130080270 | {"color":"green"} | puma |
+---------------------+-------------------+--------+
To group the products by color, we can simply use GROUP BY meta.color, and to display the corresponding group key in the SELECT list, we can use GROUPBY():
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
- C#
- Rust
- TypeScript
- Go
SELECT groupby() color, count(*) from products GROUP BY meta.color;POST /search -d '
{
"table" : "products",
"limit": 0,
"aggs" :
{
"color" :
{
"terms" :
{
"field":"meta.color",
"size":100
}
}
}
}
'$index->setName('products');
$search = $index->search('');
$search->limit(0);
$search->facet('meta.color','color',100);
$results = $search->get();
print_r($results->getFacets());res =searchApi.search({"table":"products","limit":0,"aggs":{"color":{"terms":{"field":"meta.color","size":100}}}})res = await searchApi.search({"table":"products","limit":0,"aggs":{"color":{"terms":{"field":"meta.color","size":100}}}});HashMap<String,Object> aggs = new HashMap<String,Object>(){{
put("color", new HashMap<String,Object>(){{
put("terms", new HashMap<String,Object>(){{
put("field","meta.color");
put("size",100);
}});
}});
}};
searchRequest = new SearchRequest();
searchRequest.setIndex("products");
searchRequest.setLimit(0);
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setAggs(aggs);
searchResponse = searchApi.search(searchRequest);var agg = new Aggregation("color", "meta.color");
agg.Size = 100;
object query = new { match_all=null };
var searchRequest = new SearchRequest("products", query);
searchRequest.Limit = 0;
searchRequest.Aggs = new List<Aggregation> {agg};
var searchResponse = searchApi.Search(searchRequest);let query = SearchQuery::new();
let aggTerms1 = AggTerms::new {
fields: "meta.color".to_string(),
size: Some(100),
};
let agg1 = Aggregation {
terms: Some(Box::new(aggTerms1)),
..Default::default(),
};
let mut aggs = HashMap::new();
aggs.insert("color".to_string(), agg1);
let search_req = SearchRequest {
table: "products".to_string(),
query: Some(Box::new(query)),
aggs: serde_json::json!(aggs),
limit: serde_json::json!(0),
..Default::default(),
};
let search_res = search_api.search(search_req).await;res = await searchApi.search({
index: 'test',
aggs: {
json_agg: {
terms: { field: "json_field.year", size: 1 }
}
}
});query := map[string]interface{} {};
searchRequest.SetQuery(query);
aggTerms := manticoreclient.NewAggregationTerms()
aggTerms.SetField("json_field.year")
aggTerms.SetSize(2)
aggregation := manticoreclient.NewAggregation()
aggregation.setTerms(aggTerms)
searchRequest.SetAggregation(aggregation)
res, _, _ := apiClient.SearchAPI.Search(context.Background()).SearchRequest(*searchRequest).Execute()+-------+----------+
| color | count(*) |
+-------+----------+
| red | 2 |
| green | 1 |
+-------+----------+{
"took": 0,
"timed_out": false,
"hits": {
"total": 3,
"hits": [
]
},
"aggregations": {
"color": {
"buckets": [
{
"key": "green",
"doc_count": 1
},
{
"key": "red",
"doc_count": 2
}
]
}
}
}Array
(
[color] => Array
(
[buckets] => Array
(
[0] => Array
(
[key] => green
[doc_count] => 1
)
[1] => Array
(
[key] => red
[doc_count] => 2
)
)
)
){'aggregations': {u'color': {u'buckets': [{u'doc_count': 1,
u'key': u'green'},
{u'doc_count': 2, u'key': u'red'}]}},
'hits': {'hits': [], 'max_score': None, 'total': 3},
'profile': None,
'timed_out': False,
'took': 0}<!-- request Python-asyncio -->
``` python
res = await searchApi.search({"table":"products","limit":0,"aggs":{"color":{"terms":{"field":"meta.color","size":100}}}}){'aggregations': {u'color': {u'buckets': [{u'doc_count': 1,
u'key': u'green'},
{u'doc_count': 2, u'key': u'red'}]}},
'hits': {'hits': [], 'max_score': None, 'total': 3},
'profile': None,
'timed_out': False,
'took': 0}{"took":0,"timed_out":false,"aggregations":{"color":{"buckets":[{"key":"green","doc_count":1},{"key":"red","doc_count":2}]}},"hits":{"total":3,"hits":[]}}class SearchResponse {
took: 0
timedOut: false
aggregations: {color={buckets=[{key=green, doc_count=1}, {key=red, doc_count=2}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}
class SearchResponse {
took: 0
timedOut: false
aggregations: {color={buckets=[{key=green, doc_count=1}, {key=red, doc_count=2}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}
class SearchResponse {
took: 0
timedOut: false
aggregations: {color={buckets=[{key=green, doc_count=1}, {key=red, doc_count=2}]}}
hits: class SearchResponseHits {
maxScore: null
total: 3
hits: []
}
profile: null
}
{
"took":0,
"timed_out":false,
"aggregations":
{
"json_agg":
{
"buckets":
[{
"key":2000,
"doc_count":2
},
{
"key":2001,
"doc_count":2
}]
}
},
"hits":
{
"total":4,
"hits":[]
}
}{
"took":0,
"timed_out":false,
"aggregations":
{
"json_agg":
{
"buckets":
[{
"key":2000,
"doc_count":2
},
{
"key":2001,
"doc_count":2
}]
}
},
"hits":
{
"total":4,
"hits":[]
}
}除了返回每个分组元素个数的 COUNT(*) 外,你还可以使用各种其他聚合函数:
虽然 COUNT(*) 返回组内所有元素的数量,COUNT(DISTINCT field) 返回该组内字段的唯一值数量,这可能与总数完全不同。例如,你可能有100个元素,但某个字段的值全部相同。COUNT(DISTINCT field) 有助于判断这一点。为了说明这一点,我们创建一个包含学生姓名、年龄和专业的表 "students":
CREATE TABLE students(name text, age int, major string);
INSERT INTO students values(0,'John',21,'arts'),(0,'William',22,'business'),(0,'Richard',21,'cs'),(0,'Rebecca',22,'cs'),(0,'Monica',21,'arts');
所以我们有:
MySQL [(none)]> SELECT * from students;
+---------------------+------+----------+---------+
| id | age | major | name |
+---------------------+------+----------+---------+
| 1657851069130080271 | 21 | arts | John |
| 1657851069130080272 | 22 | business | William |
| 1657851069130080273 | 21 | cs | Richard |
| 1657851069130080274 | 22 | cs | Rebecca |
| 1657851069130080275 | 21 | arts | Monica |
+---------------------+------+----------+---------+
在这个例子中,你可以看到,如果我们按照专业分组并显示 COUNT(*) 和 COUNT(DISTINCT age),很明显,有两个选择了专业 "cs" 的学生,且年龄各不相同,但对专业 "arts" 来说,同样是两个学生,却只有一个唯一的年龄。
每个查询最多只能有一个 COUNT(DISTINCT)。
默认情况下,计数是近似的
实际上,有些计数是精确的,有些是近似的。下面会详细介绍。
Manticore 支持两种计算不同值计数的算法。一种是传统算法,使用大量内存且通常较慢。它收集 {group; value} 对,进行排序,并定期去除重复项。这种方法的优点是在普通表中保证计数的精确。你可以通过将 distinct_precision_threshold 选项设置为 0 来启用它。
另一种算法(默认启用)将计数加载到哈希表中,并返回其大小。如果哈希表过大,其内容会被转移到 HyperLogLog 中。此时计数变为近似,因为 HyperLogLog 是一种概率算法。优势是每个组的最大内存使用量是固定的,取决于 HyperLogLog 的精度。总体内存使用也取决于反映组数的 max_matches 设置。
distinct_precision_threshold 选项设置了保证计数精确的阈值。HyperLogLog 的精度设置和“从哈希表转为 HyperLogLog”的阈值均由该设置决定。此选项需谨慎使用,因为其值加倍会使计算计数的最大内存需求加倍。内存最大使用量可以通过公式大致估算:64 * max_matches * distinct_precision_threshold。注意,这是最坏情况,通常计数计算所用内存远低于此值。
对于分布式表或多个磁盘块组成的实时表,COUNT(DISTINCT) 可能返回不准确结果 ,但对于由本地普通表或实时表(具有相同架构,即字段集/顺序相同,但可能分词设置不同)组成的分布式表,应返回准确结果。
- SQL
SELECT major, count(*), count(distinct age) FROM students GROUP BY major;+----------+----------+---------------------+
| major | count(*) | count(distinct age) |
+----------+----------+---------------------+
| arts | 2 | 1 |
| business | 1 | 1 |
| cs | 2 | 2 |
+----------+----------+---------------------+通常,你希望更好地理解每个分组的内容。你可以使用 GROUP N BY 实现,但会返回你可能不想要的额外行。GROUP_CONCAT() 通过将组内指定字段的值连接,丰富了分组信息。我们以上一个例子为例,改进为显示每个组内所有年龄。
GROUP_CONCAT(field) 返回以逗号分隔的值列表。
- SQL
SELECT major, count(*), count(distinct age), group_concat(age) FROM students GROUP BY major+----------+----------+---------------------+-------------------+
| major | count(*) | count(distinct age) | group_concat(age) |
+----------+----------+---------------------+-------------------+
| arts | 2 | 1 | 21,21 |
| business | 1 | 1 | 22 |
| cs | 2 | 2 | 21,22 |
+----------+----------+---------------------+-------------------+- SQL
SELECT release_year year, sum(rental_rate) sum, min(rental_rate) min, max(rental_rate) max, avg(rental_rate) avg FROM films GROUP BY release_year ORDER BY year asc LIMIT 5;+------+------------+----------+----------+------------+
| year | sum | min | max | avg |
+------+------------+----------+----------+------------+
| 2000 | 308.030029 | 0.990000 | 4.990000 | 3.17556739 |
| 2001 | 282.090118 | 0.990000 | 4.990000 | 3.09989142 |
| 2002 | 332.919983 | 0.990000 | 4.990000 | 3.08259249 |
| 2003 | 310.940063 | 0.990000 | 4.990000 | 2.93339682 |
| 2004 | 300.920044 | 0.990000 | 4.990000 | 2.78629661 |
+------+------------+----------+----------+------------+分组操作使用固定内存,大小由 max_matches 设置决定。如果 max_matches 允许存储所有找到的组,结果将是100%准确的。但如果 max_matches 过小,结果将不够准确。
当涉及并行处理时,情况会更复杂。当启用 pseudo_sharding 和/或使用多个磁盘块的 RT 表时,每个块或伪分片返回的结果集大小均不超过 max_matches,这可能导致多个线程的结果集合并时聚合和分组计数不准确。解决方法是增大 max_matches 或禁用并行处理。
如果 Manticore 侦测到分组可能返回不准确结果,会尝试将 max_matches 增加到 max_matches_increase_threshold。侦测基于从辅助索引(如存在)获取的分组属性唯一值数量。
当使用 RT 表或 pseudo_sharding 时,为确保聚合和分组计数准确,可启用 accurate_aggregation。该选项将尝试将 max_matches 提升到阈值,若阈值不足以保证准确,则禁用查询的并行处理。
- SQL
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| 2005 | 93 |
| 2000 | 97 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=1;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 76 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=2;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 76 |
| 2002 | 74 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=3;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
+------+----------+