Benchmarking Complex Query Performance in CrateDB and PostgreSQL

2017-05-04, by Andy Ellicott

CrateDB is an open source, distributed SQL database. It combines the ease of use of SQL with the horizontal scaling and data model flexibility people associate with NoSQL databases like MongoDB or Apache Cassandra.

CrateDB can perform very fast queries, over incredibly large datasets, in real-time as new data is being ingested. This makes CrateDB particularly well suited for storing and processing machine data (e.g. data collected by IoT devices).

To demonstrate this, we put together a benchmark that compares CrateDB v.1.14 versus PostgreSQL v9.6 performance in terms of query throughput and query time.

This blog post gives you the overview of what we did and the results we got. For more details, download the whitepaper: CrateDB vs. PostgreSQL Query Performance.

Benchmark Setup

We benchmarked a data set containing 314,496,000 generated records that simulated sensor readings gathered over the period of one year, from multiple customers (also known as tenants). Each record contains information about the sensor and the sensor readings.

With this data, we benchmarked a scenario that is common when dealing with large amounts of numerical sensor data: aggregations. Aggregations, like averaging and totalling, are everyday queries that are useful in real-time analysis of sensor data.

To do this, we simulated multiple users querying the 314 million row database concurrently. Three different SQL queries were used. Parameters in the queries were randomly generated to ensure the queries produced different responses, as they would in real life.

The three queries were:

  • Data from a single tenant, aggregated over a single week
  • Data aggregated over multiple weeks, grouped and ordered by tenant
  • Group by with full text

We attempted to keep the hardware assigned to both CrateDB and PostgreSQL as similar as possible. This was a little tricky because CrateDB is a distributed system, and PostgreSQL is not. What we found was that doing this meant that the PostgreSQL hardware cost was 30% higher than than the CrateDB hardware cost.

For all the gory details of everything mentioned above, check out the whitepaper.

Query Details

Because CrateDB and PostgreSQL implement a common subset of standard SQL, the SQL for the first two queries are identical for both databases. However, because CrateDB and PostgreSQL both handle full text search a little differently, there are two version of the SQL for the third query.

Query #1

Data from a single tenant, aggregated across multiple weeks:

SELECT min(v1) as v1_min, max(v1) as v1_max, avg(v1) as v1_avg, sum(v1) as v1_sum
FROM t1
WHERE tenant_id = ? AND ts BETWEEN ? AND ?;

Query #2

Data aggregated over a single week, grouped and ordered by tenant:

SELECT count(*) AS num_docs, tenant_id,
  min(v1) AS v1_min, max(v1) AS v1_max, avg(v1) AS v1_avg, sum(v1) AS v1_sum,
  min(v2) AS v2_min, max(v2) AS v2_max, avg(v2) AS v2_avg, sum(v2) AS v2_sum,
  min(v3) AS v3_min, max(v3) AS v3_max, avg(v3) AS v3_avg, sum(v3) AS v3_sum,
  min(v4) AS v4_min, max(v4) AS v4_max, avg(v4) AS v4_avg, sum(v4) AS v4_sum
FROM t1
WHERE ts BETWEEN ? AND ?
GROUP BY tenant_id
ORDER BY tenant_id;

Query #3

Group by with full text.

CrateDB:

SELECT sensor_type, COUNT(*) as sensor_count
FROM t1
WHERE taxonomy = ? AND tenant_id = ?
GROUP BY sensor_type;

PostgreSQL:

SELECT sensor_type, COUNT(*) as sensor_count
FROM t1
WHERE taxonomy <@ ?::ltree AND tenant_id = ?
GROUP BY sensor_type;

Results

A JMeter test harness simulated 30 concurrent users, each connecting to the database and executing the three test queries in random sequence and with random query parameters. Each database executed over 900 queries during their respective benchmark runs.

These results show the average duration (in milliseconds) for each query to process 314 million rows:

Queries CrateDB 1.1.4 PostgreSQL 2.6 Difference

Data from a single tenant,
aggregated over 1 week

4,811

27,575

6x faster

Data aggregated over
multiple weeks, grouped
andordered by tenant

8,871

34,660

4x faster

Group by with full text

4,863

108,966

22x faster

If we graph this, we get:

Diagram Average Query Response Time CrateDB vs PostgreSQL

Wrap Up

Against a database of 314 million sensor readings, SQL queries ran up to 22 times faster on CrateDB than they did on PostgreSQL.

CrateDB can achieve this because it parallelizes and distributes queries across the whole cluster, meaning that processing performance can be effortlessly scaled horizontally by adding new nodes, with a linear performance curve. This beats the old way of scaling SQL databases vertically, which often incurs costs (be they database downtime, higher financial outlay, or operational effort).

If you would like to discuss your use-case or need help with CrateDB, please contact us. We can provide guidance on database setup, hardware recommendations, data modelling, query design advice, performance tuning, and so on.

  • Follow us