Querying time series data with CrateDB Cloud: SQL examples

2021-03-01, by Matthijs Krul

Recently Crate.io announced the CrateDB Cloud free trial, allowing you to use CrateDB Cloud for free for 14 days. With our free trial, we offer the following:

  • You get a 3-node cluster, with 96 GB of storage and enough power to ingest thousands of metrics per second;
  • You will be able to populate CrateDB Cloud with tens of millions of records;
  • No credit card needed.

In this tutorial you will discover what you can do with your free trial cluster. Using the sample dataset from our free trial announcement - public records from taxi services in New York City - you can learn how CrateDB Cloud makes it easy to explore data in different ways while obtaining fast results. No prior SQL knowledge is needed. This article does assume you have your free trial cluster up and running and that you have imported the dataset, so make sure you sign up if you have not already.

With your free trial set up and the sample dataset imported, you may be wondering what comes next. How can you explore large quantities of data using CrateDB Cloud? Wonder no more, because in this simple guide we will show you some interesting queries - some more basic, some more advanced - that illustrate how you can interact with and learn from data using your CrateDB Cloud free trial.

The only knowledge you need is being able to access the console for the free trial cluster you created. (If you have forgotten how to do so, take another look at the end of our free trial guide.) Everything in this quick guide will be done in that console: the CrateDB Admin UI, an easy-to-use interface that makes running queries fast and straightforward.

Now that you have some data ready, let’s look at some of the things you can do with CrateDB Cloud. First off, you do not need to wait for the data import to finish. It runs as a background process and will take some time, but you can already start playing around with the data as it is being imported (your queries will be slower though).

When the data is fully imported, you will have about 40 Million data records in your cluster and about 4.8GB of data. A decent size dataset to start with. In fact, CrateDB Cloud is built to handle queries on terabytes of data at great speed - but waiting to import that much would take too long for a quick showcase. With our NYC taxi sample dataset you don't have wait for long when you want to get started right now!

The dataset you'll be working with contains a collection of various bits of information about taxi trips taken by people in New York City. Beginning with the basics, you'll see how to load that data and how to manipulate it in various ways to retrieve particularly interesting subsets of the total information available.

You'll want to look at the dataset as a whole first. Running:


will print what the table looks like:


You have quite a bit of information about each journey: fare amounts, passenger numbers, tips and tolls paid... so what can you do with all that?

Let’s say you want to get the stats of how many rides there were per day during July of 2019. You can easily do that with a query like:

  date_trunc('day', pickup_datetime) AS day,
  COUNT(*) AS rides
FROM nyc_taxi
WHERE pickup_datetime BETWEEN '2019-07-01T00:00:00' AND '2019-07-31T23:59:59'

This will look as follows:


Not a bad start. But it can get more interesting. Now let’s say you want to know the total distance travelled on any of those days, as well as the average distance per trip. You can do that too:

  date_trunc('day', pickup_datetime) AS DAY,
  COUNT(*) AS rides,
  SUM(trip_distance) AS total_distance,
  SUM(trip_distance) / COUNT(*) AS average_per_ride
FROM nyc_taxi
WHERE pickup_datetime BETWEEN '2019-07-01T00:00:00' AND '2019-07-31T23:59:59'

This gives you something a bit more interesting:


As you can now see from the dataset, when they take a cab New Yorkers go about 3 miles per trip on average. Seems legit for NYC!

Next, let’s say you want to figure out what the busiest hours of the day are. This, too, is easy enough to do:

  EXTRACT(HOUR FROM pickup_datetime) AS HOUR,
  SUM(trip_distance) AS total_distance
FROM nyc_taxi
WHERE pickup_datetime BETWEEN '2019-07-01T00:00:00' AND '2019-07-31T23:59:59'

SELECT FROM extract query

Unsurprisingly, New Yorkers tend to take the most taxi rides in the afternoon and after work, and hardly any happen at 4 in the morning. Turns out the city that never sleeps still must sleep sometime!

So far it’s all been pretty standard SQL that you may be familiar with anyway. So, what can CrateDB Cloud really do? Let’s take a look at some more advanced possibilities and put it to the test.

You can load pretty large amounts of data into CrateDB Cloud: tens of terabytes are not a problem. The data is also compressed by default and all the fields are indexed – so you don’t need to worry about having to define indices in advance.

If you’ve ever worked with such large amounts of data, you will know that some queries can suddenly become quite tricky to handle. Let's say we want to get the number of unique pickup locations from the entire dataset – you would normally do something like a:

SELECT COUNT(DISTINCT(pulocationid)) FROM nyc_taxi;

Go ahead and try. You’ll see CrateDB Cloud immediately trip a circuit breaker – the amount of data is quite large, and CrateDB Cloud knows that it can’t reasonably perform this query without having to do a ridiculous full scan of the table.

Fortunately, there is a solution for that! It comes in the form of


This uses an algorithm to estimate the number of data records, and is blazing fast, even on very large datasets:


You can use HYPERLOGLOG_DISTINCT() just like you can COUNT(DISTINCT) in any of your queries.

What about something really advanced? Imagine you want to get the list of the most high-grossing location pairs in NYC for the entire year. Not only that, but for each pair, you want to know how much more it grosses than the next best one. Here, window functions come to the rescue, which are well supported by CrateDB Cloud:

SELECT pulocationid, dolocationid, cnt, total, next_highest_grossing, total - next_highest_grossing AS leads_by
    COUNT(*) AS cnt,
    SUM(total_amount) AS total,
    LAG(sum(total_amount), 1) OVER (ORDER BY SUM(total_amount)) AS     next_highest_grossing
  FROM nyc_taxi
  WHERE pickup_datetime BETWEEN '2019-01-01T00:00:00' AND
  GROUP BY pulocationid, dolocationid
  LIMIT 100
) AS sub;

In this case the query involves using the LAG() window function, which selects the next highest grossing amount from your list. The results look like this:

Window function query

If you explore the results further, you may notice that 264 is a location called "Unknown". This just goes to show that no dataset is perfect... The second on the list is 132 to 230, which is JFK Airport to Times Square – we're not experts in NYC taxi patterns, but we're sure you'll agree that sounds plausible.

Of course, these are just some examples. Window functions and the hyperloglog algorithm are pretty advanced concepts, but hopefully with this guide you have gotten an idea of some of the things you can do with your data in CrateDB Cloud. You can read up more about what queries are supported in CrateDB, our underlying database, in our official documentation. Finally, should you want to know about how to get from the free trial to a full subscription, take a look at our CrateDB Cloud signup tutorial.

In the meantime, have fun exploring the data with your CrateDB free trial cluster. If you have any questions, feel free to reach out to us.


Stay up to date

Sign up here to keep informed about CrateDB product news,
events, how-to articles, and community update.