The Crate Driver
We have discovered that many people use Crate not only directly integrated in their applications, but also for analyzing their existing data. To help our users do this, we created a driver for Metabase, an open source database visualization tool focused on providing you with answers created from your data.
As Metabase is written in Clojure, we created a wrapper for our JDBC driver that ships with Metabase 0.18.1. Since the JDBC driver is included in the release, Metabase currently requires Crate 0.54.X as a backend.
UPDATE 26/07/2018: the Metabase wrapper is not currently maintained and does not work with CrateDB versions 2.3 and higher.
Metabase is easy to install and offers options including a Mac application, a Docker image, cloud images and a jar file. Find them all here. Once you've started Metabase you will be asked for some user information and to select your database, select Crate from the list.
To analyze your data with Metabase, you create questions, a sequence of parameters, filters and sorting options. Metabase then returns answers that you can format and visualize in a variety of ways.
You can then combine questions and answers into dashboards and layer them on top of each other for comparison.
Setup and Configuration
First add a database and table source.
Once configured, you can find settings for the data source in the Admin panel under the Databases tab. Here you can also force sync the database if new tables are missing in question source lists.
You can also view the assumptions Metabase has made about your data schema and manually override them in the Admin panel under the Data Model tab
With all that set up, let's get started creating some questions.
Metabase's analysis and visualization tools are aimed primarily at business intelligence professionals, and thus offer configuration options designed for particular use cases.
For this example, we are using our Stocks data set imported from Yahoo. We will look at the stock price of Exxon petroleum during the 70s and 80s and compare the volume of sales on the average stock price.
We created three questions, one for the average stock price per day in the 1970s, one for the 1980s and then a question to show the volume of sales across both decades. We split the questions into three parts, firstly to show how to layer answers onto one dashboard, and secondly to restrict the number of results each query returns and not overload Metabase.
Average Stock Price per decade
This example uses stock data taken from Yahoo.
We will create two questions that are almost duplicates of each other to see the change in stock price of Exxon mobil across two decades, the 70s and 80s, i.e. from 01/01/1970 - 12/31/1979 and 01/01/1980 - 12/31/1989.
First, select the stocks table from the Crate datasource.
Add a filter to only show the Exxon stock values by selecting the field you want to filter and how to filter it.
Filtering the date is a similar process, but in this case it should be a range between the start and end of the decade.
To plot data in Metabase on a chart we need to reduce the amount of columns and the only way to do this is by using the View values. For this use case we found the best method was to set the 'view' of the data to an 'average of High' and group by 'Date:Day' to get the results desired, a day and a closing stock price.
Finally make sure the data is sorted correctly by clicking on the ... icon and setting the sort order to 'Date: Day ascending'
Now you can change the visualization to a line graph.
Volume of Sales
Creating this question is a similar process but with some different filters and values.
To compare and contrast data from answers, open the volume question on your dashboard and then add the other questions to it.
Github Data and SQL
For times when Metabase's visual filters and queries are not quite enough, you can jump into SQL edit mode by clicking the View the SQL icon and then Convert this question to SQL. In this example we'll try raw SQL queries with our Github data set.
Earlier this year we created a demo application for showing off graphs generated from queries in Crate. It features queries for Github's API stored in Crate to find out some interesting facts about Github users. With Metabase, this application has become obsolete because by running the queries in Metabase's SQL interface we could create the same results:
Languages in Pull Requests and Repositories
This query returns the number of pull requests grouped by languages:
SELECT payload_pull_request_event['pull_request']['head']['repo']['language'] AS language, COUNT(*) AS num_pull_requests FROM github WHERE type = 'PullRequestEvent' AND payload_pull_request_event['pull_request']['head']['repo']['language'] IS NOT NULL AND repo['id'] IS NOT NULL GROUP BY 1 ORDER BY 2 DESC LIMIT 5
With this query we retrieve the distinct number of repositories grouped by languages:
SELECT payload_pull_request_event['pull_request']['head']['repo']['language'] AS language, COUNT(DISTINCT(repo['id'])) AS num_repos FROM github WHERE type = 'PullRequestEvent' AND payload_pull_request_event['pull_request']['head']['repo']['language'] IS NOT NULL AND repo['id'] IS NOT NULL GROUP BY 1 ORDER BY 2 DESC LIMIT 5
Coffee or Tea?
This query utilizes Crate's full text search capabilities to find out whether Github users prefer coffee or tea:
SELECT date_format('%Y-%m', date_trunc('month', created_at)) AS date, count(*) AS "Tea" FROM github WHERE match(record_ft, 'tea') AND TYPE = 'PushEvent' GROUP BY date ORDER BY date ASC SELECT date_format('%Y-%m', date_trunc('month', created_at)) AS date, count(*) AS "Coffee" FROM github WHERE match(record_ft, 'coffee') AND TYPE = 'PushEvent' GROUP BY date ORDER BY date ASC
Pull Request Delay
To answer the question how long pull requests usually stay open until they are merged, we have the following query:
SELECT cast((payload_pull_request_event['pull_request']['merged_at'] - payload_pull_request_event['pull_request']['created_at']) / 1000.0 / 60.0 / 60.0 AS integer) AS hours_to_merge, COUNT(*) / 1000 AS cnt FROM github WHERE payload_pull_request_event['pull_request']['created_at'] >= '2012-01-01' AND payload_pull_request_event['pull_request']['created_at'] <= '2015-08-31' AND payload_pull_request_event['pull_request']['created_at'] IS NOT NULL AND payload_pull_request_event['pull_request']['merged_at'] IS NOT NULL GROUP BY 1 ORDER BY 1 ASC LIMIT 24
From this query, Metabase can construct this nice graphic, and show that most of the PRs are merged within the hour:
Get the Answers you're looking for
Metabase is a powerful visualization tool and with access to Crate it can provide great answers about your (big) data.