It's often useful to group search results to get per-group match counts or other aggregations. For instance, to draw a nice graph of how much matching blog posts were there per month or to group web search results by site or forum posts by author etc.
Manticore supports grouping of search results by one or multiple columns and computed expressions. The results can:
- be sorted inside a group
- have more than one row returned per group
- have groups filtered
- have groups sorted
- be aggregated with help of the aggregation functions
- SQL
- JSON
General syntax
SELECT {* | SELECT_expr [, SELECT_expr ...]}
...
GROUP BY {field_name | alias } [, ...]
[HAVING where_condition]
[WITHIN GROUP ORDER BY field_name {ASC | DESC} [, ...]]
...
SELECT_expr: { field_name | function_name(...) }
where_condition: {aggregation expression alias | COUNT(*)}
Grouping is very simple - just add "GROUP BY smth" to the end of your SELECT
query. The something can be:
- any non-full-text field from the table: integer, float, string, MVA (multi-value attribute)
- or if you used an alias in the
SELECT
list - you can GROUP BY it too
You can omit any aggregation functions in the SELECT
list and it will work too:
- SQL
SELECT release_year FROM films GROUP BY release_year LIMIT 5;
+--------------+
| release_year |
+--------------+
| 2004 |
| 2002 |
| 2001 |
| 2005 |
| 2000 |
+--------------+
But in most cases you want to get something aggregated for each group, for example:
COUNT(*)
to just get number of elements in each groups- or
AVG(field)
to get an average value of the field in the group
- SQL1
- SQL2
- JSON
- PHP
- Python
- Javascript
- Java
SELECT release_year, count(*) FROM films GROUP BY release_year LIMIT 5;
+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| 2005 | 93 |
| 2000 | 97 |
+--------------+----------+
By default the groups are not sorted and the next thing you normally want to do is to order them by something. For example the field you are grouping by:
- SQL
SELECT release_year, count(*) from films GROUP BY release_year ORDER BY release_year asc limit 5;
+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2000 | 97 |
| 2001 | 91 |
| 2002 | 108 |
| 2003 | 106 |
| 2004 | 108 |
+--------------+----------+
Or vice-versa - by the aggregation:
- by
count(*)
to see those groups that have most elements first - by
avg(rental_rate)
to see most rated movies first. Note that in the example it's done via an alias:avg(rental_rate)
is first mapped toavg
in theSELECT
list and then we just doORDER BY avg
- SQL1
- SQL2
SELECT release_year, count(*) FROM films GROUP BY release_year ORDER BY count(*) desc LIMIT 5;
+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2003 | 106 |
| 2006 | 103 |
| 2008 | 102 |
+--------------+----------+
In some cases you might want to group not by a single, but by multiple fields at once, for example movie's category and year:
- SQL
SELECT category_id, release_year, count(*) FROM films GROUP BY category_id, release_year ORDER BY category_id ASC, release_year ASC;
+-------------+--------------+----------+
| category_id | release_year | count(*) |
+-------------+--------------+----------+
| 1 | 2000 | 5 |
| 1 | 2001 | 2 |
| 1 | 2002 | 6 |
| 1 | 2003 | 6 |
| 1 | 2004 | 5 |
| 1 | 2005 | 10 |
| 1 | 2006 | 4 |
| 1 | 2007 | 5 |
| 1 | 2008 | 7 |
| 1 | 2009 | 14 |
| 2 | 2000 | 10 |
| 2 | 2001 | 5 |
| 2 | 2002 | 6 |
| 2 | 2003 | 6 |
| 2 | 2004 | 10 |
| 2 | 2005 | 4 |
| 2 | 2006 | 5 |
| 2 | 2007 | 8 |
| 2 | 2008 | 8 |
| 2 | 2009 | 4 |
+-------------+--------------+----------+
Sometimes it's useful to see not a single element per group, but multiple. This can be easily done with help of GROUP N BY
. For example here you can see that we get 2 movies for each year rather than one which would simple GROUP BY release_year
returned.
- SQL
SELECT release_year, title FROM films GROUP 2 BY release_year ORDER BY release_year DESC LIMIT 6;
+--------------+-----------------------------+
| release_year | title |
+--------------+-----------------------------+
| 2009 | ALICE FANTASIA |
| 2009 | ALIEN CENTER |
| 2008 | AMADEUS HOLY |
| 2008 | ANACONDA CONFESSIONS |
| 2007 | ANGELS LIFE |
| 2007 | ARACHNOPHOBIA ROLLERCOASTER |
+--------------+-----------------------------+
Another essential analytics demand is to sort elements within a group. For that there's WITHIN GROUP ORDER BY ... {ASC|DESC}
clause. For example let's get the most rated film for each year. Note it works in parallel with just ORDER BY
:
WITHIN GROUP ORDER BY
sorts results inside a group- while just
GROUP BY
sorts the groups themselves
The work absolutely independently.
- SQL
SELECT release_year, title, rental_rate FROM films GROUP BY release_year WITHIN GROUP ORDER BY rental_rate DESC ORDER BY release_year DESC LIMIT 5;
+--------------+------------------+-------------+
| release_year | title | rental_rate |
+--------------+------------------+-------------+
| 2009 | AMERICAN CIRCUS | 4.990000 |
| 2008 | ANTHEM LUKE | 4.990000 |
| 2007 | ATTACKS HATE | 4.990000 |
| 2006 | ALADDIN CALENDAR | 4.990000 |
| 2005 | AIRPLANE SIERRA | 4.990000 |
+--------------+------------------+-------------+
HAVING expression
is a useful clause to filter groups. If WHERE
is applied before grouping HAVING
works with the groups. For example let's leave only those years when an average rental rate of the films of that year was higher than 3. We get only 4 years:
- SQL
SELECT release_year, avg(rental_rate) avg FROM films GROUP BY release_year HAVING avg > 3;
+--------------+------------+
| release_year | avg |
+--------------+------------+
| 2002 | 3.08259249 |
| 2001 | 3.09989142 |
| 2000 | 3.17556739 |
| 2006 | 3.26184368 |
+--------------+------------+
There is a function GROUPBY()
which returns the key of the current group. It's useful in many cases especially when you GROUP BY an MVA or a JSON value.
It can be also used in HAVING
to for example leave only years 2000 and 2002.
Note that GROUPBY()
is not recommended for use when you GROUP BY multiple fields at once. It will still work, but since the group key in this case is compound of field values it may look not the way you expect.
- SQL
SELECT release_year, count(*) FROM films GROUP BY release_year HAVING GROUPBY() IN (2000, 2002);
+--------------+----------+
| release_year | count(*) |
+--------------+----------+
| 2002 | 108 |
| 2000 | 97 |
+--------------+----------+
Manticore supports grouping by MVA. To show how it works let's create a table "shoes" with MVA "sizes" and insert few documents into it:
create table shoes(title text, sizes multi);
insert into shoes values(0,'nike',(40,41,42)),(0,'adidas',(41,43)),(0,'reebook',(42,43));
so we have:
SELECT * FROM shoes;
+---------------------+----------+---------+
| id | sizes | title |
+---------------------+----------+---------+
| 1657851069130080265 | 40,41,42 | nike |
| 1657851069130080266 | 41,43 | adidas |
| 1657851069130080267 | 42,43 | reebook |
+---------------------+----------+---------+
If we now GROUP BY "sizes" it will process all our multi-value attributes and will return aggregation for each, in this case just count:
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
SELECT groupby() gb, count(*) FROM shoes GROUP BY sizes ORDER BY gb asc;
+------+----------+
| gb | count(*) |
+------+----------+
| 40 | 1 |
| 41 | 2 |
| 42 | 2 |
| 43 | 2 |
+------+----------+
If you have a field of type JSON you can GROUP BY any node from it. To demonstrate it let's create a table "products" with few documents each having color in the "meta" JSON field:
create table products(title text, meta json);
insert into products values(0,'nike','{"color":"red"}'),(0,'adidas','{"color":"red"}'),(0,'puma','{"color":"green"}');
This gives us:
SELECT * FROM products;
+---------------------+-------------------+--------+
| id | meta | title |
+---------------------+-------------------+--------+
| 1657851069130080268 | {"color":"red"} | nike |
| 1657851069130080269 | {"color":"red"} | adidas |
| 1657851069130080270 | {"color":"green"} | puma |
+---------------------+-------------------+--------+
To group the products by color we can just GROUP BY meta.color
and to show the corresponding group key in the SELECT
list we can use GROUPBY()
:
- SQL
- JSON
- PHP
- Python
- Javascript
- Java
SELECT groupby() color, count(*) from products GROUP BY meta.color;
+-------+----------+
| color | count(*) |
+-------+----------+
| red | 2 |
| green | 1 |
+-------+----------+
Besides COUNT(*)
which returns number of elements in each group you can use different other aggregation functions:
While COUNT(*)
returns number of all elements in the group COUNT( DISTINCT field)
returns number of different values of the field in the group which may be absolutely different from the total count: you can have 100 elements in the group, but all with the same value of some field. COUNT(DISTINCT field)
helps to figure that out. To demonstrate it let's create table "students" with student's name, age and major:
CREATE TABLE students(name text, age int, major string);
INSERT INTO students values(0,'John',21,'arts'),(0,'William',22,'business'),(0,'Richard',21,'cs'),(0,'Rebecca',22,'cs'),(0,'Monica',21,'arts');
so we have:
MySQL [(none)]> SELECT * from students;
+---------------------+------+----------+---------+
| id | age | major | name |
+---------------------+------+----------+---------+
| 1657851069130080271 | 21 | arts | John |
| 1657851069130080272 | 22 | business | William |
| 1657851069130080273 | 21 | cs | Richard |
| 1657851069130080274 | 22 | cs | Rebecca |
| 1657851069130080275 | 21 | arts | Monica |
+---------------------+------+----------+---------+
In the example you can see that if we GROUP BY major and show both COUNT(*)
and COUNT(DISTINCT age)
it gets clear that there are 2 students that chose major "cs" and 2 unique ages, but for the major "arts" there are also 2 students, but only one unique age.
There can be at most one COUNT(DISTINCT)
per query.
COUNT(DISTINCT)
against a distributed table or a real-time table consisting of multiple disk chunks may return inaccurate results, but the result should be accurate for a distributed table consisting of local plain or real-time tables with the same schema (identical set/order of fields, but may be different tokenization settings).
- SQL
SELECT major, count(*), count(distinct age) FROM students GROUP BY major;
+----------+----------+---------------------+
| major | count(*) | count(distinct age) |
+----------+----------+---------------------+
| arts | 2 | 1 |
| business | 1 | 1 |
| cs | 2 | 2 |
+----------+----------+---------------------+
Often you want to understand better the contents of each group. You can use GROUP N BY for that, but it would return additional rows you might not want in the output. GROUP_CONCAT()
enriches your grouping by concatenating values of some field in the group. Let's take the previous example and improve it by getting all the ages in each group.
GROUP_CONCAT(field)
returns the list comma-separated.
- SQL
SELECT major, count(*), count(distinct age), group_concat(age) FROM students GROUP BY major
+----------+----------+---------------------+-------------------+
| major | count(*) | count(distinct age) | group_concat(age) |
+----------+----------+---------------------+-------------------+
| arts | 2 | 1 | 21,21 |
| business | 1 | 1 | 22 |
| cs | 2 | 2 | 21,22 |
+----------+----------+---------------------+-------------------+
- SQL
SELECT release_year year, sum(rental_rate) sum, min(rental_rate) min, max(rental_rate) max, avg(rental_rate) avg FROM films GROUP BY release_year ORDER BY year asc LIMIT 5;
+------+------------+----------+----------+------------+
| year | sum | min | max | avg |
+------+------------+----------+----------+------------+
| 2000 | 308.030029 | 0.990000 | 4.990000 | 3.17556739 |
| 2001 | 282.090118 | 0.990000 | 4.990000 | 3.09989142 |
| 2002 | 332.919983 | 0.990000 | 4.990000 | 3.08259249 |
| 2003 | 310.940063 | 0.990000 | 4.990000 | 2.93339682 |
| 2004 | 300.920044 | 0.990000 | 4.990000 | 2.78629661 |
+------+------------+----------+----------+------------+
Grouping is done in fixed memory, which depends on the max_matches setting. If max_matches
allows for storage of all found groups, the results will be 100% accurate. However, if the value of max_matches
is lower, the results will be less accurate.
When parallel processing is involved, it can become more complicated. When pseudo_sharding
is enabled and/or when using an RT index with several disk chunks, each chunk or pseudo shard gets a result set that is no larger than max_matches
. This can lead to inaccuracies in aggregates and group counts when the results sets from different threads are merged. To fix this, either a larger max_matches
value or disabling parallel processing can be used.
Manticore will try to increase max_matches
up to max_matches_increase_threshold if it detects that groupby may return inaccurate results. Detection is based on the number of unique values of the groupby attribute, which is retrieved from secondary indexes (if present).
To ensure accurate aggregates and/or group counts when using RT indexes or pseudo_sharding
, accurate_aggregation
can be enabled. This will try to increase max_matches
up to the threshold, and if the threshold is not high enough, Manticore will disable parallel processing for the query.
- SQL
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
| 2005 | 93 |
| 2000 | 97 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=1;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 76 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=2;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 76 |
| 2002 | 74 |
+------+----------+
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=3;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 | 108 |
| 2002 | 108 |
| 2001 | 91 |
+------+----------+