Aggregation

There are two ways of doing aggregation, one on the result of a whole query and one on the separate groups of a GROUP BY query. Aggregation Function will always return one result value for a single group. You can imagine the matching rows of a SELECT statement without GROUP BY as one group. One row will always be returned.

For a short summary of aggregation functions see Data Aggregation.

count

count(*)

This aggregation function simply returns the number of rows that match the query.

count(columName) is also possible, but currently only works on a primary key column. The semantics are the same. The return value is always of type long.

cr> select count(*) from locations;
+----------+
| count(*) |
+----------+
| 13       |
+----------+
SELECT 1 row in set (... sec)

count(*) can also be used on group by queries:

cr> select count(*), kind from locations group by kind order by kind asc;
+----------+-------------+
| count(*) | kind        |
+----------+-------------+
| 4        | Galaxy      |
| 5        | Planet      |
| 4        | Star System |
+----------+-------------+
SELECT 3 rows in set (... sec)

count(columnName)

In contrast to the count(*) function the count function used with a column name as parameter will return the number of rows with a non-NULL value in that column.

Example:

cr> select count(name), count(*), date from locations group by date
... order by count(name) desc, count(*) desc;
+-------------+----------+---------------+
| count(name) | count(*) | date          |
+-------------+----------+---------------+
| 7           | 8        | 1373932800000 |
| 4           | 4        | 308534400000  |
| 1           | 1        | 1367366400000 |
+-------------+----------+---------------+
SELECT 3 rows in set (... sec)

count(distinct columnName)

The count aggregation function also supports the distinct keyword. This keyword changes the behaviour of the function so that it will only count the number of distinct values in this column that are not NULL:

cr> select count(distinct kind), count(*), date
... from locations group by date
... order by count(distinct kind) desc, count(*) desc;
+----------------------+----------+---------------+
| count(DISTINCT kind) | count(*) | date          |
+----------------------+----------+---------------+
| 3                    | 8        | 1373932800000 |
| 3                    | 4        | 308534400000  |
| 1                    | 1        | 1367366400000 |
+----------------------+----------+---------------+
 SELECT 3 rows in set (... sec)
cr> select count(distinct kind) from locations;
+----------------------+
| count(DISTINCT kind) |
+----------------------+
| 3                    |
+----------------------+
SELECT 1 row in set (... sec)

min

The min aggregation function returns the smallest value in a column that is not NULL. Its single argument is a column name and its return value is always of the type of that column. Example:

cr> select min(position), kind
... from locations
... where name not like 'North %'
... group by kind order by min(position) asc, kind asc;
+---------------+-------------+
| min(position) | kind        |
+---------------+-------------+
| 1             | Planet      |
| 1             | Star System |
| 2             | Galaxy      |
+---------------+-------------+
SELECT 3 rows in set (... sec)
cr> select min(date) from locations;
+--------------+
| min(date)    |
+--------------+
| 308534400000 |
+--------------+
SELECT 1 row in set (... sec)

min returns NULL if the column does not contain any value but NULL. It is allowed on numeric columns (byte, short, integer, long, float, double), on timestamp and string columns. On string columns it will return the lexicographically smallest.:

cr> select min(name), kind from locations
... group by kind order by kind asc;
+------------------------------------+-------------+
| min(name)                          | kind        |
+------------------------------------+-------------+
| Galactic Sector QQ7 Active J Gamma | Galaxy      |
|                                    | Planet      |
| Aldebaran                          | Star System |
+------------------------------------+-------------+
SELECT 3 rows in set (... sec)

max

It behaves exactly like min but returns the biggest value in a column that is not NULL. Some Examples:

cr> select max(position), kind from locations
... group by kind order by kind desc;
+---------------+-------------+
| max(position) | kind        |
+---------------+-------------+
| 4             | Star System |
| 5             | Planet      |
| 6             | Galaxy      |
+---------------+-------------+
SELECT 3 rows in set (... sec)
cr> select max(position) from locations;
+---------------+
| max(position) |
+---------------+
| 6             |
+---------------+
SELECT 1 row in set (... sec)
cr> select max(name), kind from locations
... group by kind order by max(name) desc;
+-------------------+-------------+
| max(name)         | kind        |
+-------------------+-------------+
| Outer Eastern Rim | Galaxy      |
| Bartledan         | Planet      |
| Altair            | Star System |
+-------------------+-------------+
SELECT 3 rows in set (... sec)

sum

The sum aggregation function returns the sum of all the values in a column that are not NULL as a double value. Its single argument is the column name of a numeric column or timestamp column. sum can not be used on other column types.

cr> select sum(position), kind from locations
... group by kind order by sum(position) asc;
+---------------+-------------+
| sum(position) | kind        |
+---------------+-------------+
| 10.0          | Star System |
| 13.0          | Galaxy      |
| 15.0          | Planet      |
+---------------+-------------+
SELECT 3 rows in set (... sec)
cr> select sum(position) as position_sum from locations;
+--------------+
| position_sum |
+--------------+
| 38.0         |
+--------------+
SELECT 1 row in set (... sec)
cr> select sum(name), kind from locations group by kind order by sum(name) desc;
SQLActionException[UnsupportedFeatureException: unknown function: sum(string)]

avg / mean

The avg or mean aggregation function returns the arithmetic mean, the average, of all values in a column that are not NULL as a double value. It accepts all numeric columns and timestamp columns as single argument. Using avg on other column types is not allowed.

Example:

cr> select avg(position), kind from locations
... group by kind order by kind;
+---------------+-------------+
| avg(position) | kind        |
+---------------+-------------+
| 3.25          | Galaxy      |
| 3.0           | Planet      |
| 2.5           | Star System |
+---------------+-------------+
SELECT 3 rows in set (... sec)

geometric_mean

The geometric_mean aggregation function computes the geometric mean, a mean for positive numbers. For details see: Geometric Mean.

geometric mean is defined on all numeric types and on timestamp. It always returns double values. If a value is negative, all values were null or we got no value at all NULL is returned. If any of the aggregated values is 0 the result will be 0.0 as well.

Note

Due to java double precision arithmetic it is possible that any two executions of the aggregation function on the same data produce slightly differing results.

Example:

cr> select geometric_mean(position), kind from locations
... group by kind order by kind;
+--------------------------+-------------+
| geometric_mean(position) | kind        |
+--------------------------+-------------+
|       2.6321480259049848 | Galaxy      |
|       2.6051710846973517 | Planet      |
|       2.213363839400643  | Star System |
+--------------------------+-------------+
SELECT 3 rows in set (... sec)

variance

The variance aggregation function computes the Variance of the set of non-null values in a column. It is a measure about how far a set of numbers is spread. A variance of 0.0 indicates that all values are the same.

variance is defined on all numeric types and on timestamp. It returns a double value. If all values were null or we got no value at all NULL is returned.

Example:

cr> select variance(position), kind from locations
... group by kind order by kind desc;
+--------------------+-------------+
| variance(position) | kind        |
+--------------------+-------------+
|             1.25   | Star System |
|             2.0    | Planet      |
|             3.6875 | Galaxy      |
+--------------------+-------------+
SELECT 3 rows in set (... sec)

Note

Due to java double precision arithmetic it is possible that any two executions of the aggregation function on the same data produce slightly differing results.

stddev

The stddev aggregation function computes the Standard Deviation of the set of non-null values in a column. It is a measure of the variation of data values. A low standard deviation indicates that the values tend to be near the mean.

stddev is defined on all numeric types and on timestamp. It always returns double values. If all values were null or we got no value at all NULL is returned.

Example:

cr> select stddev(position), kind from locations
... group by kind order by kind;
+--------------------+-------------+
|   stddev(position) | kind        |
+--------------------+-------------+
| 1.920286436967152  | Galaxy      |
| 1.4142135623730951 | Planet      |
| 1.118033988749895  | Star System |
+--------------------+-------------+
SELECT 3 rows in set (... sec)

Note

Due to java double precision arithmetic it is possible that any two executions of the aggregation function on the same data produce slightly differing results.

arbitrary

The arbitrary aggregation function returns a single value of a column. Which value it returns is not defined. It accepts references to columns of all primitive types. Using arbitrary on Object columns is not supported. Its return type is the type of its parameter column and can be NULL if the column contains NULL values.

Example:

cr> select arbitrary(position) from locations;
+---------------------+
| arbitrary(position) |
+---------------------+
| ...                 |
+---------------------+
SELECT 1 row in set (... sec)
cr> select arbitrary(name), kind from locations
... where name != ''
... group by kind order by kind desc;
+-...-------------+-------------+
| arbitrary(name) | kind        |
+-...-------------+-------------+
| ...             | Star System |
| ...             | Planet      |
| ...             | Galaxy      |
+-...-------------+-------------+
SELECT 3 rows in set (... sec)

An example use case is to group a table with many rows per user by user_id and get the username for every group, that means every user. This works as rows with same user_id have the same username. This method performs better than grouping on username as grouping on number types is generally faster than on strings. The advantage is that the arbitrary function does very little to no computation as for example max aggregation function would do.