Generated Columns in Crate.IO 0.54.0

Author
Ruslan Kovalov
Date
December 12, 2015

Crate are pleased to announce the support of generated columns. The new functionality will be available from the fresh 0.54.0 testing release!

The idea behind generated columns is to have its values computed from generation expressions provided at the time of definition, unlike regular columns, where values are set explicitly. For those of you interested, generated columns in Crate implements Feature T175 from the ANSI SQL:2006 standard.

Defining expressions is flexible. It's possible to create complex expressions, reference values of other columns and providing a data type for columns is optional and can be inferred by the return type of the supplied expression, but the values must be equal.

Use cases

Generated columns provide an effortless way to solve a variety of problems, let's take a closer look:

Partition by expression

Imagine that you have to analyze time-based data, e.g. a log with a few hundred millions of records, where only ones in particular time slices have a value. During analyzing you might have group by queries which are too slow on your current setup if you query the whole data set.

You need a smart solution to improve the query speed. One way to achieve this is to look up data in the sliced samples which contain the records of interest, instead of iterating over all existing records. That is where partitioned tables come to the rescue. This functionality lets you have separate partitions but still query them like a single table.

For our log data and analyzing queries use case, the most suitable method would be to partition the table by a column containing a timestamp. But what if the column in the partitioned by clause has low granularity or does not suit our queries well enough. In both cases we would end up either with too many partitions in the table or find ourselves performing additional work to adopt the queries.

With generated columns the challenge can be simply addressed. Let's create an example of analyzing data logs on a monthly basis, while having day granularity for the timestamped column.

First, create a log table with the new generated column month and put it into partitioned by clause

cr> create table logs(
      log string,
      created timestamp,
      month as date_trunc('month', created)
    ) partitioned by (month);
CREATE OK (... sec)

In the next step, we insert log records into the table.

cr> insert into logs (log, created) values
      ('log_2015_01_01', '2015-01-01'),
      ('log_2015-01_02', '2015-01-02'),
      ('log_2015_02_01', '2015-02-01'),
      ('log_2015_02_02', '2015-02-02'),
      ('log_2015_03_01', '2015-03-01');
INSERT OK, 5 row affected (... sec)

We can see the table has been partitioned by the generated month column.

cr> select table_name, partition_ident, 'values'
      from information_schema.table_partitions
      WHERE table_name = 'logs';
+------------+--------------------------+--------------------------+
| table_name | partition_ident          | values                   |
+------------+--------------------------+--------------------------+
| log        | 04732d1i6kojce1g60o30c1g | {'month': 1425168000000} |
| log        | 04732d1i60o3ec1k60o30c1g | {'month': 1420070400000} |
| log        | 04732d1i68rj8e1o60o30c1g | {'month': 1422748800000} |
+------------+--------------------------+--------------------------+
SELECT 3 rows in set (... sec)

Looks good! Now we can analyze the log data on a monthly basis and utilize all the benefits of having a partitioned table. As you can see the combination of partitioned tables and generated columns provides a great tool in the Crate toolkit.

Query optimization on generated column partitions

The implementation of generated columns for the partitioned by clause has been optimized for special cases.

Selection on generated columns uses current optimization

Generated columns utilize the same optimization implemented for regular columns in a partitioned by clause. Thus filtering log records on the generated column month, which is part of the partitioned by clause, would result in effecting only the specific partition. This behavior is expected from regular columns under the same circumstances.

cr> select log,
      date_format('%Y-%m-%d', created) as created,
      date_format('%Y-%m-%d', month) as month
      from logs
      where month = '2015-01';
+----------------+------------+---------+
| log            | created    | month   |
+----------------+------------+---------+
| log_2015_01_01 | 2015-01-01 | 2015-01 |
| log_2015-01_02 | 2015-01-02 | 2015-01 |
+----------------+------------+---------+
SELECT 2 rows in set (... sec)

The execution of the given query only affects the 2015-01 partition.

Selection on a column which is part of a generated column/Query rewrite

The created column is used for generating the month column on which the table is partitioned. The optimization added allows querying the column used in the generation expression of the generated column and still hit only the partition where this value belongs. For example:

cr> select log,
      date_format('%Y-%m-%d', created) as created,
      date_format('%Y-%m', month) as month
      from logs
      where created = '2015-02-01';
+----------------+------------+---------+
| log            | created    | month   |
+----------------+------------+---------+
| log_2015_02_01 | 2015-02-01 | 2015-02 |
+----------------+------------+---------+
SELECT 1 row in set (... sec)

The only partition affected by this query is 2015-02.

Calculated columns

Generated columns are computed internally upon INSERT and UPDATE. A great application of generated columns is the computation of the columns values on the fly.

Continuing our demonstration, we will create a table with cities and their current temperature measured in celsius. It would be great to get the temperature also in fahrenheit and kelvin, without having any computation on the application layer or once records are inserted or updated.

This can be accomplished by defining two generated columns, kelvin and fahrenheit, and providing them with the generation expressions needed for the conversions.

cr> create table weather (
      city string,
      celsius double,
      kelvin as (celsius + 273.15),
      fahrenheit as (celsius * 9/5 + 32)
    );
CREATE OK (... sec)

On insert calculation

First, let's populate the weather table with a few records.

cr> insert into weather (city, celsius) values
      ('Dornbirn', 9),
      ('Berlin', 7),
      ('London', 14),
      ('San Francisco', 8);
INSERT OK, 2 row affected (... sec)

After executing the insert query we select all columns of the new weather table.

cr> select city, celsius, kelvin, fahrenheit from weather;
+---------------+---------+--------+------------+
| city          | celsius | kelvin | fahrenheit |
+---------------+---------+--------+------------+
| Berlin        |     7.0 | 280.15 |       44.6 |
| San Francisco |     8.0 | 281.15 |       46.4 |
| London        |    14.0 | 287.15 |       57.2 |
| Dornbirn      |     9.0 | 282.15 |       48.2 |
+---------------+---------+--------+------------+
SELECT 4 rows in set (... sec)

Looking at the kelvin and fahrenheit columns we can see that the values of the generated columns have been computed on the fly by the generation expressions.

On update calculation

The weather has changed and so the weather table needs to be updated.

cr> update weather set celsius = -1
      where city = 'Dornbirn';
UPDATE OK, 1 row affected (... sec)

After applying the update, query the weather table for all columns again.

cr> select city, celsius, kelvin, fahrenheit
      from weather;
+---------------+---------+--------+------------+
| city          | celsius | kelvin | fahrenheit |
+---------------+---------+--------+------------+
| Berlin        |     7.0 | 280.15 |       44.6 |
| San Francisco |     8.0 | 281.15 |       46.4 |
| London        |    14.0 | 287.15 |       57.2 |
| Dornbirn      |    -1.0 | 252.15 |       30.2 |
+---------------+---------+--------+------------+
SELECT 4 rows in set (... sec)

As expected, the newly calculated values of kelvin and fahrenheit are shown in the query output where the name of the city is Dornbirn.

That blog post was a brief overview of this new Crate feature. If you have any further questions feel free to contact us or for more detailed information, read the generated columns documentation.

Back to topAll posts