Grouping search results

It's often useful to group search results to get per-group match counts or other aggregations. For instance, to draw a nice graph of how much matching blog posts were there per month or to group web search results by site or forum posts by author etc.

Manticore supports grouping of search results by one or multiple columns and computed expressions. The results can:

  • be sorted inside a group
  • have more than one row returned per group
  • have groups filtered
  • have groups sorted
  • be aggregated with help of
‹›
  • SQL
  • HTTP
📋

General syntax

SELECT {* | SELECT_expr [, SELECT_expr ...]}
...
GROUP BY {field_name | alias } [, ...]
[HAVING where_condition]
[WITHIN GROUP ORDER BY field_name {ASC | DESC} [, ...]]
...

SELECT_expr: { field_name | function_name(...) }
where_condition: {aggregation expression alias | COUNT(*)}

Just Grouping

Grouping is very simple - just add "GROUP BY smth" to the end of your SELECT query. The something can be:

  • any non-full-text field from the index: integer, float, string, MVA (multi-value attribute)
  • or if you used an alias in the SELECT list - you can GROUP BY it too

You can omit any aggregation functions in the SELECT list and it will work too:

‹›
  • SQL
SQL
📋
SELECT release_year FROM films GROUP BY release_year LIMIT 5;
‹›
Response
+--------------+
| release_year |
+--------------+
|         2004 |
|         2002 |
|         2001 |
|         2005 |
|         2000 |
+--------------+

But in most cases you want to get something aggregated for each group, for example:

  • COUNT(*) to just get number of elements in each groups
  • or AVG(field) to get an average value of the field in the group
‹›
  • SQL1
  • SQL2
  • HTTP
  • PHP
  • Python
  • Javascript
  • Java
📋
SELECT release_year, count(*) FROM films GROUP BY release_year LIMIT 5;
‹›
Response
+--------------+----------+
| release_year | count(*) |
+--------------+----------+
|         2004 |      108 |
|         2002 |      108 |
|         2001 |       91 |
|         2005 |       93 |
|         2000 |       97 |
+--------------+----------+
Sorting groups

By default the groups are not sorted and the next thing you normally want to do is to order them by something. For example the field you are grouping by:

‹›
  • SQL
SQL
📋
SELECT release_year, count(*) from films GROUP BY release_year ORDER BY release_year asc limit 5;
‹›
Response
+--------------+----------+
| release_year | count(*) |
+--------------+----------+
|         2000 |       97 |
|         2001 |       91 |
|         2002 |      108 |
|         2003 |      106 |
|         2004 |      108 |
+--------------+----------+

Or vice-versa - by the aggregation:

  • by count(*) to see those groups that have most elements first
  • by avg(rental_rate) to see most rated movies first. Note that in the example it's done via an alias: avg(rental_rate) is first mapped to avg in the SELECT list and then we just do ORDER BY avg
‹›
  • SQL1
  • SQL2
📋
SELECT release_year, count(*) FROM films GROUP BY release_year ORDER BY count(*) desc LIMIT 5;
‹›
Response
+--------------+----------+
| release_year | count(*) |
+--------------+----------+
|         2004 |      108 |
|         2002 |      108 |
|         2003 |      106 |
|         2006 |      103 |
|         2008 |      102 |
+--------------+----------+
GROUP BY multiple fields at once

In some cases you might want to group not by a single, but by multiple fields at once, for example movie's category and year:

‹›
  • SQL
SQL
📋
SELECT category_id, release_year, count(*) FROM films GROUP BY category_id, release_year ORDER BY category_id ASC, release_year ASC;
‹›
Response
+-------------+--------------+----------+
| category_id | release_year | count(*) |
+-------------+--------------+----------+
|           1 |         2000 |        5 |
|           1 |         2001 |        2 |
|           1 |         2002 |        6 |
|           1 |         2003 |        6 |
|           1 |         2004 |        5 |
|           1 |         2005 |       10 |
|           1 |         2006 |        4 |
|           1 |         2007 |        5 |
|           1 |         2008 |        7 |
|           1 |         2009 |       14 |
|           2 |         2000 |       10 |
|           2 |         2001 |        5 |
|           2 |         2002 |        6 |
|           2 |         2003 |        6 |
|           2 |         2004 |       10 |
|           2 |         2005 |        4 |
|           2 |         2006 |        5 |
|           2 |         2007 |        8 |
|           2 |         2008 |        8 |
|           2 |         2009 |        4 |
+-------------+--------------+----------+
Give me N rows

Sometimes it's useful to see not a single element per group, but multiple. This can be easily done with help of GROUP N BY. For example here you can see that we get 2 movies for each year rather than one which would simple GROUP BY release_year returned.

‹›
  • SQL
SQL
📋
SELECT release_year, title FROM films GROUP 2 BY release_year ORDER BY release_year DESC LIMIT 6;
‹›
Response
+--------------+-----------------------------+
| release_year | title                       |
+--------------+-----------------------------+
|         2009 | ALICE FANTASIA              |
|         2009 | ALIEN CENTER                |
|         2008 | AMADEUS HOLY                |
|         2008 | ANACONDA CONFESSIONS        |
|         2007 | ANGELS LIFE                 |
|         2007 | ARACHNOPHOBIA ROLLERCOASTER |
+--------------+-----------------------------+
Sorting inside a group

Another essential analytics demand is to sort elements within a group. For that there's WITHIN GROUP ORDER BY ... {ASC|DESC} clause. For example let's get the most rated film for each year. Note it works in parallel with just ORDER BY:

  • WITHIN GROUP ORDER BY sorts results inside a group
  • while just GROUP BY sorts the groups themselves

The work absolutely independently.

‹›
  • SQL
SQL
📋
SELECT release_year, title, rental_rate FROM films GROUP BY release_year WITHIN GROUP ORDER BY rental_rate DESC ORDER BY release_year DESC LIMIT 5;
‹›
Response
+--------------+------------------+-------------+
| release_year | title            | rental_rate |
+--------------+------------------+-------------+
|         2009 | AMERICAN CIRCUS  |    4.990000 |
|         2008 | ANTHEM LUKE      |    4.990000 |
|         2007 | ATTACKS HATE     |    4.990000 |
|         2006 | ALADDIN CALENDAR |    4.990000 |
|         2005 | AIRPLANE SIERRA  |    4.990000 |
+--------------+------------------+-------------+
Filter groups

HAVING expression is a useful clause to filter groups. If WHERE is applied before grouping HAVING works with the groups. For example let's leave only those years when an average rental rate of the films of that year was higher than 3. We get only 4 years:

‹›
  • SQL
SQL
📋
SELECT release_year, avg(rental_rate) avg FROM films GROUP BY release_year HAVING avg > 3;
‹›
Response
+--------------+------------+
| release_year | avg        |
+--------------+------------+
|         2002 | 3.08259249 |
|         2001 | 3.09989142 |
|         2000 | 3.17556739 |
|         2006 | 3.26184368 |
+--------------+------------+
GROUPBY()

There is a function GROUPBY() which returns the key of the current group. It's useful in many cases especially when you GROUP BY an MVA or a JSON value.

It can be also used in HAVING to for example leave only years 2000 and 2002.

Note that GROUPBY()is not recommended for use when you GROUP BY multiple fields at once. It will still work, but since the group key in this case is compound of field values it may look not the way you expect.

‹›
  • SQL
SQL
📋
SELECT release_year, count(*) FROM films GROUP BY release_year HAVING GROUPBY() IN (2000, 2002);
‹›
Response
+--------------+----------+
| release_year | count(*) |
+--------------+----------+
|         2002 |      108 |
|         2000 |       97 |
+--------------+----------+
Grouping by MVA (multi-value attributes)

Manticore supports grouping by MVA. To show how it works let's create a table "shoes" with MVA "sizes" and insert few documents into it:

create table shoes(title text, sizes multi);
insert into shoes values(0,'nike',(40,41,42)),(0,'adidas',(41,43)),(0,'reebook',(42,43));

so we have:

SELECT * FROM shoes;
+---------------------+----------+---------+
| id                  | sizes    | title   |
+---------------------+----------+---------+
| 1657851069130080265 | 40,41,42 | nike    |
| 1657851069130080266 | 41,43    | adidas  |
| 1657851069130080267 | 42,43    | reebook |
+---------------------+----------+---------+

If we now GROUP BY "sizes" it will process all our multi-value attributes and will return aggregation for each, in this case just count:

‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • Javascript
  • Java
📋
SELECT groupby() gb, count(*) FROM shoes GROUP BY sizes ORDER BY gb asc;
‹›
Response
+------+----------+
| gb   | count(*) |
+------+----------+
|   40 |        1 |
|   41 |        2 |
|   42 |        2 |
|   43 |        2 |
+------+----------+
Grouping by a JSON node

If you have a field of type JSON you can GROUP BY any node from it. To demonstrate it let's create a table "products" with few documents each having color in the "meta" JSON field:

create table products(title text, meta json);
insert into products values(0,'nike','{"color":"red"}'),(0,'adidas','{"color":"red"}'),(0,'puma','{"color":"green"}');

This gives us:

SELECT * FROM products;
+---------------------+-------------------+--------+
| id                  | meta              | title  |
+---------------------+-------------------+--------+
| 1657851069130080268 | {"color":"red"}   | nike   |
| 1657851069130080269 | {"color":"red"}   | adidas |
| 1657851069130080270 | {"color":"green"} | puma   |
+---------------------+-------------------+--------+

To group the products by color we can just GROUP BY meta.color and to show the corresponding group key in the SELECT list we can use GROUPBY():

‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • Javascript
  • Java
📋
SELECT groupby() color, count(*) from products GROUP BY meta.color;
‹›
Response
+-------+----------+
| color | count(*) |
+-------+----------+
| red   |        2 |
| green |        1 |
+-------+----------+

Aggregation functions

Besides COUNT(*) which returns number of elements in each group you can use different other aggregation functions:

COUNT(DISTINCT field)

While COUNT(*) returns number of all elements in the group COUNT( DISTINCT field) returns number of different values of the field in the group which may be absolutely different from the total count: you can have 100 elements in the group, but all with the same value of some field. COUNT(DISTINCT field) helps to figure that out. To demonstrate it let's create table "students" with student's name, age and major:

CREATE TABLE students(name text, age int, major string);
INSERT INTO students values(0,'John',21,'arts'),(0,'William',22,'business'),(0,'Richard',21,'cs'),(0,'Rebecca',22,'cs'),(0,'Monica',21,'arts');

so we have:

MySQL [(none)]> SELECT * from students;
+---------------------+------+----------+---------+
| id                  | age  | major    | name    |
+---------------------+------+----------+---------+
| 1657851069130080271 |   21 | arts     | John    |
| 1657851069130080272 |   22 | business | William |
| 1657851069130080273 |   21 | cs       | Richard |
| 1657851069130080274 |   22 | cs       | Rebecca |
| 1657851069130080275 |   21 | arts     | Monica  |
+---------------------+------+----------+---------+

In the example you can see that if we GROUP BY major and show both COUNT(*) and COUNT(DISTINCT age) it gets clear that there are 2 students that chose major "cs" and 2 unique ages, but for the major "arts" there are also 2 students, but only one unique age.

There can be at most one COUNT(DISTINCT) per query.

COUNT(DISTINCT) against a distributed index or a real-time index consisting of multiple disk chunks may return inaccurate results, but the result should be accurate for a distributed index consisting of local plain or real-time indexes with the same schema (identical set/order of fields, but may be different tokenization settings).

‹›
  • SQL
SQL
📋
SELECT major, count(*), count(distinct age) FROM students GROUP BY major;
‹›
Response
+----------+----------+---------------------+
| major    | count(*) | count(distinct age) |
+----------+----------+---------------------+
| arts     |        2 |                   1 |
| business |        1 |                   1 |
| cs       |        2 |                   2 |
+----------+----------+---------------------+
GROUP_CONCAT(field)

Often you want to understand better the contents of each group. You can use GROUP N BY for that, but it would return additional rows you might not want in the output. GROUP_CONCAT() enriches your grouping by concatenating values of some field in the group. Let's take the previous example and improve it by getting all the ages in each group.

GROUP_CONCAT(field) returns the list comma-separated.

‹›
  • SQL
SQL
📋
SELECT major, count(*), count(distinct age), group_concat(age) FROM students GROUP BY major
‹›
Response
+----------+----------+---------------------+-------------------+
| major    | count(*) | count(distinct age) | group_concat(age) |
+----------+----------+---------------------+-------------------+
| arts     |        2 |                   1 | 21,21             |
| business |        1 |                   1 | 22                |
| cs       |        2 |                   2 | 21,22             |
+----------+----------+---------------------+-------------------+
SUM(), MIN(), MAX(), AVG()

And of course you can get sum, average, minimal and maximum values in the group.

‹›
  • SQL
SQL
📋
SELECT release_year year, sum(rental_rate) sum, min(rental_rate) min, max(rental_rate) max, avg(rental_rate) avg FROM films GROUP BY release_year ORDER BY year asc LIMIT 5;
‹›
Response
+------+------------+----------+----------+------------+
| year | sum        | min      | max      | avg        |
+------+------------+----------+----------+------------+
| 2000 | 308.030029 | 0.990000 | 4.990000 | 3.17556739 |
| 2001 | 282.090118 | 0.990000 | 4.990000 | 3.09989142 |
| 2002 | 332.919983 | 0.990000 | 4.990000 | 3.08259249 |
| 2003 | 310.940063 | 0.990000 | 4.990000 | 2.93339682 |
| 2004 | 300.920044 | 0.990000 | 4.990000 | 2.78629661 |
+------+------------+----------+----------+------------+

Grouping accuracy

Grouping is done in fixed memory which depends on the max_matches setting. If the max_matches allows to store all found groups, the results will be 100% correct. The less the value the less accurate will be the results.

‹›
  • SQL
SQL
📋
MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 |      108 |
| 2002 |      108 |
| 2001 |       91 |
| 2005 |       93 |
| 2000 |       97 |
+------+----------+

MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=1;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 |       76 |
+------+----------+

MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=2;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 |       76 |
| 2002 |       74 |
+------+----------+

MySQL [(none)]> SELECT release_year year, count(*) FROM films GROUP BY year limit 5 option max_matches=3;
+------+----------+
| year | count(*) |
+------+----------+
| 2004 |      108 |
| 2002 |      108 |
| 2001 |       91 |
+------+----------+