Improving on JOINs

Author
Christian Haudum
Filed under
Date
December 16, 2015

We've doubled the JOINs performance for almost all queries, and removed many of the previously existing restrictions!

Since the first public release of JOINs in Crate we've been working hard to remove limitations and improve query speed. These improvements will be available in the upcoming testing release of Crate 0.54.0.

Removing Limitations

ORDER BY Limitation

The limitation that every item in the select list of the query needs to be included in the order clause is now gone. What previously had to be written as:

SELECT
  employee.id, employee.name, department.name,
FROM
  employee, department
WHERE
  employee.department_id = department.id
ORDER BY
  employee.id, employee.name, department.name;

Can now be written as:

SELECT
  employee.id, employee.name, department.name,
FROM
  employee, department
WHERE
  employee.department_id = department.id
ORDER BY
  employee.id;

Explicit INNER JOIN Syntax

We've added full syntax support for INNER JOIN, which means you can now join tables explicitly:

SELECT
  employee.id, employee.name, department.name,
FROM
  employee
INNER JOIN department
  ON employee.department_id = department.id
ORDER BY
  employee.id;

Optimizing for Speed

By removing the requirement for ordering by all expressions in the select list, query time has accelerated greatly!

Example: Weather Data

To show this in an example I am using data from the German Weather Service which Matthias wrote a blog post about Analyzing Germany's weather with Crate.

Weather Satellite Image (c) DWD

The data consists of two tables, one for the actual data points and one for the stations that recorded the data:

Table Name Records
german_climate.data ~ 235M
german_climate.stations 1455

We want to query the top 10 hottest weather records and their station names.

With the old "ORDER BY limitation" the query looked like this:

SELECT stations.name, data.temp, data.air_pressure, date_format(data.date)
FROM german_climate.data, german_climate.stations
WHERE data.station_id = stations.id
  AND data.temp IS NOT NULL
  AND data.air_pressure IS NOT NULL
ORDER BY data.temp DESC, stations.name, data.air_pressure, date_format(data.date)
LIMIT 10;

And the result:

+------------------------+-----------+-------------------+-----------------------------+
| stations.name          | data.temp | data.air_pressure | date_format(data.date)      |
+------------------------+-----------+-------------------+-----------------------------+
| Oldenburg              |      49.1 |            1027.2 | 1993-03-25T21:00:00.000000Z |
| Oldenburg              |      42.1 |            1006.5 | 1992-07-21T13:00:00.000000Z |
| Zwiesel                |      42.0 |            1013.5 | 2002-10-04T02:00:00.000000Z |
| Bremervörde            |      41.3 |            1018.2 | 1991-08-02T15:00:00.000000Z |
| Oldenburg              |      41.0 |             959.2 | 1992-10-24T00:00:00.000000Z |
| Weiden                 |      40.9 |            1012.3 | 1994-07-25T14:00:00.000000Z |
| Karlsruhe              |      40.1 |            1016.3 | 2003-08-13T13:00:00.000000Z |
| Karlsruhe              |      39.9 |            1015.4 | 2003-08-13T15:00:00.000000Z |
| Karlsruhe              |      39.8 |            1014.9 | 2003-08-09T16:00:00.000000Z |
| Schönhagen (Ostseebad) |      39.8 |            1007.0 | 1993-03-22T20:00:00.000000Z |
+------------------------+-----------+-------------------+-----------------------------+
SELECT 10 rows in set (6.908 sec)

Due to the multiple ordering, the query took about 6.9 seconds.

However, now it's possible to sort only by the relevant fields, in this case by temperature:

SELECT stations.name, data.temp, data.air_pressure, data.date
FROM german_climate.data
  INNER JOIN german_climate.stations
    ON data.station_id = stations.id
WHERE data.temp IS NOT NULL
  AND data.air_pressure IS NOT NULL
ORDER BY data.temp DESC
LIMIT 10;

And the result:

+------------------------+-----------+-------------------+-----------------------------+
| stations.name          | data.temp | data.air_pressure | date_format(data.date)      |
+------------------------+-----------+-------------------+-----------------------------+
| Oldenburg              |      49.1 |            1027.2 | 1993-03-25T21:00:00.000000Z |
| Oldenburg              |      42.1 |            1006.5 | 1992-07-21T13:00:00.000000Z |
| Zwiesel                |      42.0 |            1013.5 | 2002-10-04T02:00:00.000000Z |
| Bremervörde            |      41.3 |            1018.2 | 1991-08-02T15:00:00.000000Z |
| Oldenburg              |      41.0 |             959.2 | 1992-10-24T00:00:00.000000Z |
| Weiden                 |      40.9 |            1012.3 | 1994-07-25T14:00:00.000000Z |
| Karlsruhe              |      40.1 |            1016.3 | 2003-08-13T13:00:00.000000Z |
| Karlsruhe              |      39.9 |            1015.4 | 2003-08-13T15:00:00.000000Z |
| Karlsruhe              |      39.8 |            1014.9 | 2003-08-09T16:00:00.000000Z |
| Schönhagen (Ostseebad) |      39.8 |            1007.0 | 1993-03-22T20:00:00.000000Z |
+------------------------+-----------+-------------------+-----------------------------+
SELECT 10 rows in set (2.578 sec)

As you can see, by removing the additional unused order by fields the query time shrank from ~6.9 seconds to ~2.6 seconds, which is an increase in speed of over 165%!

This is a simple example of a minor optimization. We're working on bigger optimizations for implementating JOINs, and look forward to sharing them with you soon!

Back to topAll posts