结果分组搜索通常有助于获取每个分组的匹配计数或其他聚合信息。例如,它对生成按月份显示匹配博客文章数量的图表有用,或者将网页搜索结果按网站分组,论坛帖子按作者分组等。
Manticore 支持按单列或多列以及计算表达式进行搜索结果分组。结果可以:
- 在分组内排序
- 每个分组返回多行
- 对分组进行筛选
- 对分组进行排序
- 使用聚合函数进行聚合
- SQL
- JSON
通用语法
SELECT {* | SELECT_expr [, SELECT_expr ...]}
...
GROUP BY {field_name | alias } [, ...]
[HAVING where_condition]
[WITHIN GROUP ORDER BY field_name {ASC | DESC} [, ...]]
...
SELECT_expr: { field_name | function_name(...) }
where_condition: {aggregation expression alias | COUNT(*)}分组非常简单,只需在 SELECT 查询结尾添加 "GROUP BY smth" 。这里的 "something" 可以是:
- 表中的任何非全文字段:整数、浮点数、字符串、多值属性 (MVA)
- 或者,如果您在
SELECT列表中使用了别名,也可以按别名进行分组
您可以省略 SELECT 列表中的任何聚合函数,依然有效:
- SQL
SELECT release_year FROM films GROUP BY release_year LIMIT 5;+--------------+
| release_year |
+--------------+
| 2004 |
| 2002 |
| 2001 |
| 2005 |
| 2000 |
+--------------+然而,大多数情况下,您可能想要获取每个分组的一些聚合数据,例如:
COUNT(*)用于简单获取每个分组中的元素数量- 或者
AVG(field)用于计算组内某字段的平均值
对于 HTTP JSON 请求,在主查询级别使用带有 limit=0 的单个 aggs 桶,效果类似于带有 GROUP BY 和 COUNT(*) 的 SQL 查询,具有等效的行为和性能。
- SQL1
- SQL2
- JSON
- PHP
- Python
- Python-asyncio
- Javascript
- Java
- C#
- Rust
- TypeScript
- Go
SELECT release_year, count(*) FROM films GROUP BY release_year LIMIT 5;+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| 2005 | 93 |
| 2000 | 97 |
+--------------+----------+- SQL
SELECT release_year, count(*) from films GROUP BY release_year ORDER BY release_year asc limit 5;+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2000 | 97 |
| 2001 | 91 |
| 2002 | 108 |
| 2003 | 106 |
| 2004 | 108 |
+--------------+----------+另一种方式是按聚合排序:
- 按
count(*)显示元素最多的分组优先 - 按
avg(rental_rate)显示评分最高的电影优先。示例中通过别名实现:avg(rental_rate)先映射为avg在SELECT列表中,然后直接使用ORDER BY avg
- SQL1
- SQL2
SELECT release_year, count(*) FROM films GROUP BY release_year ORDER BY count(*) desc LIMIT 5;+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2003 | 106 |
| 2006 | 103 |
| 2008 | 102 |
+--------------+----------+- SQL
- JSON
SELECT category_id, release_year, count(*) FROM films GROUP BY category_id, release_year ORDER BY category_id ASC, release_year ASC;+-------------+--------------+----------+
| category_id | release_year | count(*) |
+-------------+--------------+----------+
| 1 | 2000 | 5 |
| 1 | 2001 | 2 |
| 1 | 2002 | 6 |
| 1 | 2003 | 6 |
| 1 | 2004 | 5 |
| 1 | 2005 | 10 |
| 1 | 2006 | 4 |
| 1 | 2007 | 5 |
| 1 | 2008 | 7 |
| 1 | 2009 | 14 |
| 2 | 2000 | 10 |
| 2 | 2001 | 5 |
| 2 | 2002 | 6 |
| 2 | 2003 | 6 |
| 2 | 2004 | 10 |
| 2 | 2005 | 4 |
| 2 | 2006 | 5 |
| 2 | 2007 | 8 |
| 2 | 2008 | 8 |
| 2 | 2009 | 4 |
+-------------+--------------+----------+- SQL
SELECT release_year, title FROM films GROUP 2 BY release_year ORDER BY release_year DESC LIMIT 6;+--------------+-----------------------------+
| release_year | title |
+--------------+-----------------------------+
| 2009 | ALICE FANTASIA |
| 2009 | ALIEN CENTER |
| 2008 | AMADEUS HOLY |
| 2008 | ANACONDA CONFESSIONS |
| 2007 | ANGELS LIFE |
| 2007 | ARACHNOPHOBIA ROLLERCOASTER |
+--------------+-----------------------------+另一个重要的分析需求是对组内元素排序。实现方式是使用 WITHIN GROUP ORDER BY ... {ASC|DESC} 子句。例如,获取每年评分最高的电影。注意它和普通的 ORDER BY 是并行工作的:
WITHIN GROUP ORDER BY排序组内的结果- 而普通的
GROUP BY排序分组本身
这两者相互独立地工作。
- SQL
SELECT release_year, title, rental_rate FROM films GROUP BY release_year WITHIN GROUP ORDER BY rental_rate DESC ORDER BY release_year DESC LIMIT 5;+--------------+------------------+-------------+
| release_year | title | rental_rate |
+--------------+------------------+-------------+
| 2009 | AMERICAN CIRCUS | 4.990000 |
| 2008 | ANTHEM LUKE | 4.990000 |
| 2007 | ATTACKS HATE | 4.990000 |
| 2006 | ALADDIN CALENDAR | 4.990000 |
| 2005 | AIRPLANE SIERRA | 4.990000 |
+--------------+------------------+-------------+- SQL
SELECT release_year, avg(rental_rate) avg FROM films GROUP BY release_year HAVING avg > 3;+--------------+------------+
| release_year | avg |
+--------------+------------+
| 2002 | 3.08259249 |
| 2001 | 3.09989142 |
| 2000 | 3.17556739 |
| 2006 | 3.26184368 |
+--------------+------------+注意: 搜索查询元信息中的total_found值反映了符合HAVING条件的分组数量。这使得在使用HAVING子句和GROUP BY时能够实现正确的分页。
有一个函数GROUPBY(),返回当前分组的键。在许多情况下都很有用,尤其是在你按MVA分组或按JSON值分组时。
它也可以用于HAVING,例如,只保留年份2000和2002。
注意,当你同时按多个字段GROUP BY时,不建议使用GROUPBY()。它仍然可以工作,但由于此时分组键是字段值的复合,可能不会以你期望的方式展示。
- SQL
SELECT release_year, count(*) FROM films GROUP BY release_year HAVING GROUPBY() IN (2000, 2002);+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2002 | 108 |
| 2000 | 97 |
+--------------+----------+Manticore支持按MVA分组。为了演示其工作方式,我们创建一个名为"shoes"且带有MVA "sizes"的表,并插入几个文档:
create table shoes(title text, sizes multi);
insert into shoes values(0,'nike',(40,41,42)),(0,'adidas',(41,43)),(0,'reebook',(42,43));
所以我们有:
SELECT * FROM shoes;
+---------------------+----------+---------+
| id | sizes | title |
+---------------------+----------+---------+
| 1657851069130080265 | 40,41,42 | nike |
| 1657851069130080266 | 41,43 | adidas |
| 1657851069130080267 | 42,43 | reebook |
+---------------------+----------+---------+
如果现在我们按"sizes"分组,它会处理所有我们的多值属性,并为每个返回聚合结果,在本例中仅是计数:
- SQL
- JSON
- PHP
- Python
- Javascript
- Python-asyncio
- Java
- C#
- Rust
- Go
SELECT groupby() gb, count(*) FROM shoes GROUP BY sizes ORDER BY gb asc;+------+----------+
| gb | count(*) |
+------+----------+
| 40 | 1 |
| 41 | 2 |
| 42 | 2 |
| 43 | 2 |
+------+----------+If you have a field of type JSON, you can GROUP BY any node from it. To demonstrate this, let's create a table "products" with a few documents, each having a color in the "meta" JSON field:
create table products(title text, meta json);
insert into products values(0,'nike','{"color":"red"}'),(0,'adidas','{"color":"red"}'),(0,'puma','{"color":"green"}');
This gives us:
SELECT * FROM products;
+---------------------+-------------------+--------+
| id | meta | title |
+---------------------+-------------------+--------+
| 1657851069130080268 | {"color":"red"} | nike |
| 1657851069130080269 | {"color":"red"} | adidas |
| 1657851069130080270 | {"color":"green"} | puma |
+---------------------+-------------------+--------+
To group the products by color, we can simply use GROUP BY meta.color, and to display the corresponding group key in the SELECT list, we can use GROUPBY():
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
- C#
- Rust
- TypeScript
- Go
SELECT groupby() color, count(*) from products GROUP BY meta.color;+-------+----------+
| color | count(*) |
+-------+----------+
| red | 2 |
| green | 1 |
+-------+----------+除了返回每个分组元素个数的 COUNT(*) 外,你还可以使用各种其他聚合函数:
虽然 COUNT(*) 返回组内所有元素的数量,COUNT(DISTINCT field) 返回该组内字段的唯一值数量,这可能与总数完全不同。例如,你可能有100个元素,但某个字段的值全部相同。COUNT(DISTINCT field) 有助于判断这一点。为了说明这一点,我们创建一个包含学生姓名、年龄和专业的表 "students":
CREATE TABLE students(name text, age int, major string);
INSERT INTO students values(0,'John',21,'arts'),(0,'William',22,'business'),(0,'Richard',21,'cs'),(0,'Rebecca',22,'cs'),(0,'Monica',21,'arts');
所以我们有:
MySQL [(none)]> SELECT * from students;
+---------------------+------+----------+---------+
| id | age | major | name |
+---------------------+------+----------+---------+
| 1657851069130080271 | 21 | arts | John |
| 1657851069130080272 | 22 | business | William |
| 1657851069130080273 | 21 | cs | Richard |
| 1657851069130080274 | 22 | cs | Rebecca |
| 1657851069130080275 | 21 | arts | Monica |
+---------------------+------+----------+---------+
在这个例子中,你可以看到,如果我们按照专业分组并显示 COUNT(*) 和 COUNT(DISTINCT age),很明显,有两个选择了专业 "cs" 的学生,且年龄各不相同,但对专业 "arts" 来说,同样是两个学生,却只有一个唯一的年龄。
每个查询最多只能有一个 COUNT(DISTINCT)。
默认情况下,计数是近似的
实际上,有些计数是精确的,有些是近似的。下面会详细介绍。
Manticore 支持两种计算不同值计数的算法。一种是传统算法,使用大量内存且通常较慢。它收集 {group; value} 对,进行排序,并定期去除重复项。这种方法的优点是在普通表中保证计数的精确。你可以通过将 distinct_precision_threshold 选项设置为 0 来启用它。
另一种算法(默认启用)将计数加载到哈希表中,并返回其大小。如果哈希表过大,其内容会被转移到 HyperLogLog 中。此时计数变为近似,因为 HyperLogLog 是一种概率算法。优势是每个组的最大内存使用量是固定的,取决于 HyperLogLog 的精度。总体内存使用也取决于反映组数的 max_matches 设置。
distinct_precision_threshold 选项设置了保证计数精确的阈值。HyperLogLog 的精度设置和“从哈希表转为 HyperLogLog”的阈值均由该设置决定。此选项需谨慎使用,因为其值加倍会使计算计数的最大内存需求加倍。内存最大使用量可以通过公式大致估算:64 * max_matches * distinct_precision_threshold。注意,这是最坏情况,通常计数计算所用内存远低于此值。
对于分布式表或多个磁盘块组成的实时表,COUNT(DISTINCT) 可能返回不准确结果 ,但对于由本地普通表或实时表(具有相同架构,即字段集/顺序相同,但可能分词设置不同)组成的分布式表,应返回准确结果。
- SQL
SELECT major, count(*), count(distinct age) FROM students GROUP BY major;+----------+----------+---------------------+
| major | count(*) | count(distinct age) |
+----------+----------+---------------------+
| arts | 2 | 1 |
| business | 1 | 1 |
| cs | 2 | 2 |
+----------+----------+---------------------+通常,你希望更好地理解每个分组的内容。你可以使用 GROUP N BY 实现,但会返回你可能不想要的额外行。GROUP_CONCAT() 通过将组内指定字段的值连接,丰富了分组信息。我们以上一个例子为例,改进为显示每个组内所有年龄。
GROUP_CONCAT(field) 返回以逗号分隔的值列表。
- SQL
SELECT major, count(*), count(distinct age), group_concat(age) FROM students GROUP BY major+----------+----------+---------------------+-------------------+
| major | count(*) | count(distinct age) | group_concat(age) |
+----------+----------+---------------------+-------------------+
| arts | 2 | 1 | 21,21 |
| business | 1 | 1 | 22 |
| cs | 2 | 2 | 21,22 |
+----------+----------+---------------------+-------------------+- SQL
SELECT release_year year, sum(rental_rate) sum, min(rental_rate) min, max(rental_rate) max, avg(rental_rate) avg FROM films GROUP BY release_year ORDER BY year asc LIMIT 5;+------+------------+----------+----------+------------+
| year | sum | min | max | avg |
+------+------------+----------+----------+------------+
| 2000 | 308.030029 | 0.990000 | 4.990000 | 3.17556739 |
| 2001 | 282.090118 | 0.990000 | 4.990000 | 3.09989142 |
| 2002 | 332.919983 | 0.990000 | 4.990000 | 3.08259249 |
| 2003 | 310.940063 | 0.990000 | 4.990000 | 2.93339682 |
| 2004 | 300.920044 | 0.990000 | 4.990000 | 2.78629661 |
+------+------------+----------+----------+------------+分组操作使用固定内存,大小由 max_matches 设置决定。如果 max_matches 允许存储所有找到的组,结果将是100%准确的。但如果 max_matches 过小,结果将不够准确。
当涉及并行处理时,情况会更复杂。当启用 pseudo_sharding 和/或使用多个磁盘块的 RT 表时,每个块或伪分片返回的结果集大小均不超过 max_matches,这可能导致多个线程的结果集合并时聚合和分组计数不准确。解决方法是增大 max_matches 或禁用并行处理。
如果 Manticore 侦测到分组可能返回不准确结果,会尝试将 max_matches 增加到 max_matches_increase_threshold。侦测基于从辅助索引(如存在)获取的分组属性唯一值数量。
当使用 RT 表或 pseudo_sharding 时,为确保聚合和分组计数准确,可启用 accurate_aggregation。该选项将尝试将 max_matches 提升到阈值,若阈值不足以保证准确,则禁用查询的并行处理。
- SQL
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| 2005 | 93 |
| 2000 | 97 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=1;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 76 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=2;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 76 |
| 2002 | 74 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=3;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
+------+----------+