Visualize Crate Data with Metabase

Author
admin
Filed under
Date
June 22, 2016

Click here to download CrateDB

Download Now »

The Crate Driver

We have discovered that many people use Crate not only directly integrated in their applications, but also for analyzing their existing data. To help our users do this, we created a driver for Metabase, an open source database visualization tool focused on providing you with answers created from your data.

As Metabase is written in Clojure, we created a wrapper for our JDBC driver that ships with Metabase 0.18.1. Since the JDBC driver is included in the release, Metabase currently requires Crate 0.54.X as a backend.

Install Metabase

Metabase is easy to install and offers options including a Mac application, a Docker image, cloud images and a jar file. Find them all here. Once you’ve started Metabase you will be asked for some user information and to select your database, select Crate from the list.

Metabase Concepts

To analyze your data with Metabase, you create questions, a sequence of parameters, filters and sorting options. Metabase then returns answers that you can format and visualize in a variety of ways.

You can then combine questions and answers into dashboards and layer them on top of each other for comparison.

Setup and Configuration

First add a database and table source.

Add a Data Source

Once configured, you can find settings for the data source in the Admin panel under the Databases tab. Here you can also force sync the database if new tables are missing in question source lists.

Database Configuration

You can also view the assumptions Metabase has made about your data schema and manually override them in the Admin panel under the Data Model tab

Schema Configuration

Schema Configuration

With all that set up, let’s get started creating some questions.

Stock Data

Metabase’s analysis and visualization tools are aimed primarily at business intelligence professionals, and thus offer configuration options designed for particular use cases.

For this example, we are using our Stocks data set imported from Yahoo. We will look at the stock price of Exxon petroleum during the 70s and 80s and compare the volume of sales on the average stock price.

We created three questions, one for the average stock price per day in the 1970s, one for the 1980s and then a question to show the volume of sales across both decades. We split the questions into three parts, firstly to show how to layer answers onto one dashboard, and secondly to restrict the number of results each query returns and not overload Metabase.

Metabase Questions List

Average Stock Price per decade

This example uses stock data taken from Yahoo.

We will create two questions that are almost duplicates of each other to see the change in stock price of Exxon mobil across two decades, the 70s and 80s, i.e. from 01/01/1970 – 12/31/1979 and 01/01/1980 – 12/31/1989.

First, select the stocks table from the Crate datasource.

Set Datasource

Add a filter to only show the Exxon stock values by selecting the field you want to filter and how to filter it.

Filter Stock Value

Filtering the date is a similar process, but in this case it should be a range between the start and end of the decade.

Filter Date Range

To plot data in Metabase on a chart we need to reduce the amount of columns and the only way to do this is by using the View values. For this use case we found the best method was to set the ‘view’ of the data to an ‘average of High’ and group by ‘Date:Day’ to get the results desired, a day and a closing stock price.

Change Date View

Group Data

Finally make sure the data is sorted correctly by clicking on the icon and setting the sort order to ‘Date: Day ascending’

Sort Data

Now you can change the visualization to a line graph.

Change Visualization

Volume of Sales

Creating this question is a similar process but with some different filters and values, here’s what we used:

Query Values

Combine Answers

To compare and contrast data from answers, open the volume question on your dashboard and then add the other questions to it:

Query Values

Github Data and SQL

For times when Metabase’s visual filters and queries are not quite enough, you can jump into SQL edit mode by clicking the View the SQL icon and then Convert this question to SQL. In this example we’ll try raw SQL queries with our Github data set.

Earlier this year we created a demo application for showing off graphs generated from queries in Crate. It features queries for Github’s API stored in Crate to find out some interesting facts about Github users. With Metabase, this application has become obsolete because by running the queries in Metabase’s SQL interface we could create the same results:

Languages in Pull Requests and Repositories

This query returns the number of pull requests grouped by languages:

SELECT payload_pull_request_event['pull_request']['head']['repo']['language'] AS language,
       COUNT(*) AS num_pull_requests
FROM github
WHERE   type = 'PullRequestEvent'
    AND payload_pull_request_event['pull_request']['head']['repo']['language'] IS NOT NULL
    AND repo['id'] IS NOT NULL
GROUP BY 1
ORDER BY 2
DESC LIMIT 5

PR Languages

With this query we retrieve the distinct number of repositories grouped by languages:

SELECT payload_pull_request_event['pull_request']['head']['repo']['language'] AS language,
       COUNT(DISTINCT(repo['id'])) AS num_repos
FROM github
WHERE   type = 'PullRequestEvent'
    AND payload_pull_request_event['pull_request']['head']['repo']['language'] IS NOT NULL
    AND repo['id'] IS NOT NULL
GROUP BY 1
ORDER BY 2
DESC LIMIT 5

PR Repositories

Coffee or Tea?

This query utilizes Crate’s full text search capabilities to find out whether Github users prefer coffee or tea:

SELECT date_format('%Y-%m', date_trunc('month', created_at)) AS date,
  count(*) AS "Tea" FROM github
WHERE match(record_ft, 'tea')
  AND TYPE = 'PushEvent'
GROUP BY date
ORDER BY date ASC

SELECT date_format('%Y-%m', date_trunc('month', created_at)) AS date,
  count(*) AS "Coffee" FROM github
WHERE match(record_ft, 'coffee')
  AND TYPE = 'PushEvent'
GROUP BY date
ORDER BY date ASC

Tea

Coffee

Pull Request Delay

To answer the question how long pull requests usually stay open until they are merged, we have the following query:

SELECT
  cast((payload_pull_request_event['pull_request']['merged_at'] - payload_pull_request_event['pull_request']['created_at']) / 1000.0 / 60.0 / 60.0  AS integer) AS hours_to_merge,
  COUNT(*) / 1000 AS cnt
FROM github
WHERE payload_pull_request_event['pull_request']['created_at'] >= '2012-01-01'
  AND payload_pull_request_event['pull_request']['created_at'] <= '2015-08-31'
  AND payload_pull_request_event['pull_request']['created_at'] IS NOT NULL
  AND payload_pull_request_event['pull_request']['merged_at'] IS NOT NULL
GROUP BY 1
ORDER BY 1 ASC
LIMIT 24

From this query, Metabase can construct this nice graphic, and show that most of the PRs are merged within the hour:

Pull Request Latency

Get the Answers you’re looking for

Metabase is a powerful visualization tool and with access to Crate it can provide great answers about your (big) data.

Back to topAll Blog