The Guide for Time Series Data Projects is out.

Download now
Skip to content
Blog

Use CrateDB and Apache Superset for Open Source Data Warehousing and Visualization

Apache Superset is an easy-to-use open-source business intelligence (BI) application that offers a rich set of customizable data visualizations. With Superset, you can explore, filter, and organize your data. You can also create and share custom monitoring and analytics dashboards.

CrateDB is a distributed SQL database that makes it simple to capture, store, and analyze massive amounts of operational data in real-time.

In this post, I'll show you how to get started with CrateDB and Apache Superset. And by the end of this post, you should have your very first Superset visualizations running on top of CrateDB data.

Note: The following instructions are done on macOS, and should be easily adaptable for Linux. It uses CrateDB 4.6.6 and Apache Superset 1.3.2. 

Setup CrateDB

Get CrateDB set up with a single command: 

bash -c "$(curl -L https://try.crate.io/)"
 

This command downloads CrateDB and runs it from the tarball. If you'd like to install CrateDB permanently, or you are using Windows, check out our collection of one-step install guides.

If you're using the command above, it should pop open the CrateDB admin UI for you automatically once it has finished. Otherwise, head over to http://localhost:4200/ in your browser.

You should see CrateDB Admin UI as illustrated below:

Screen-Shot-2019-07-12-at-12.49.01

Get Data

As the fresh CrateDB installation does not have any data, we are going to import some of the existing datasets. We choose to import the Twitter sentiment corpus that consists of more than 5,000 tweets classified concerning one of four different topics. The full corpus of data can be found here.

Before data import, we need to create a new table in CrateDB with the following schema:

Screenshot 2022-01-03 at 10.00.02

Finally, we import Twitter sentiment corpus with COPY FROM statement that copies data from a file into a table:

COPY doc.tweets FROM 'file:///path_to_file/full-corpus.csv' RETURN SUMMARY;

Once the data is imported, select the Tables icon from the left-hand navigation in Admin UI, and then select the tweets table. You should end up here:

http://localhost:4200/#/tables/doc/tweets

Screenshot 2021-12-27 at 10.01.53

To inspect some of the data samples, click on the Query Table button. The output of the SELECT statement should look as follows:

Screenshot 2021-12-27 at 10.12.23

Great! At this point, the CrateDB is up and running with data that can be further analyzed.

In this article, we will show you how to graph a pie chart to illustrate the numerical proportion of tweets per different topic and how to visualize sentiment data for each topic in a bar graph.

Install Apache Superset

The fastest way to set up Apache Superset is to use Docker and Docker Compose on a Linux or macOS computer. In this article, we will show how to install and use Apache Superset on macOS, and to find out how to use it with other operating systems please follow the official documentation. Before continuing, make sure that you have installed Docker for Mac.

The first step is to clone Superset’s repository with the following command:

git clone https://github.com/apache/superset.git

After the superset directory has been downloaded, we need to extend the list of requirements with a Python database driver for CrateDB. To do this, create the requirements-local.txt file and add a new line to it:

echo "crate[sqlalchemy]==0.26.0" >> ./docker/requirements-local.txt

Now, run the docker-compose command from the superset directory:

docker-compose -f docker-compose-non-dev.yml up

The command will spin several docker containers. After the logging output from the containers slows, you can start exploring Apache Superset on your local machine.

Run Superset

To access Superset, visit http://localhost:8088 in your browser:

Screenshot 2021-12-27 at 10.41.13

Log in with the default username and password:

  • username: admin
  • password: admin

Afterward, you should see the starting page of Apache Superset:

Screenshot 2021-12-27 at 10.42.30

The Superset app Docker container also loads several datasets and dashboards that can be used as a starting point for learning different visualization options.

Connect to CrateDB

To connect Apache Superset with CrateDB select DataDatabases from the top navigation menu:

 Screenshot 2021-12-27 at 10.58.49

Click on the Database button in the top right-hand corner of the screen. This will open a new window for choosing a database to connect to. Among the list of databases choose CrateDB.

Screenshot 2021-12-28 at 13.33.41

In the same form, fill out the following information:

  • Database: CrateDB

  • SQLAlchemy URI: crate://crate@host.docker.internal:4200

Like so:

Screenshot 2021-12-27 at 11.03.11

Then, click the Test Connection button. You should be presented with a modal popup window saying "Connection looks good!". To finally connect, click Connect button.

CrateDB database connection will appear in the list of databases:

Screenshot 2021-12-27 at 11.05.02

The next step is to import a new dataset: go to DataDataset from the top navigation menu and click on the Dataset button in the top right corner.

Screenshot 2021-12-27 at 11.06.46

To add data from the tweets table, select the following:

  • Database: CrateDB

  • Schema: doc

  • Table Schema: tweets

Click Add button and the data will appear in the list of data sources.

Create Dashboard

Before creating individual charts, let’s first create a new dashboard that will be used to display charts together. This can be done by choosing Dashboards from the navigation bar and clicking on the Dashboard button in the top right corner. This will open the following window:

Screenshot 2021-12-27 at 12.29.25

Name the dashboard Twitter Sentiment as illustrated and click on the Save button.

Create a Pie Chart

The first chart we are going to create is a pie chart. Choose Charts from the navigation menu and click on the Chart button in the top right corner. This opens a new window for choosing Chart type as illustrated below.

Screenshot 2021-12-27 at 12.47.55

As a dataset choose tweets, as a chart type choose pie chart, and then click on Create New Chart. The window opens a form for specifying the query that extracts data to be visualized. The query can be specified either graphically or in the SQL format. For example, if we want to show proportions of tweets per topic, we will specify the query like the following:

Screenshot 2021-12-27 at 13.03.56

  • First, we need to group tweets by the topic

  • Second, we specify the metric we want to show: in our case, it is the number of tweets. The corresponding SQL metric is defined as COUNT(tweetid)

  • Finally, we leave the filter empty as we want to take into account all tweets

The resulting chart shows the proportion of each topic to the total number of tweets. Click on the Save button above the query form, choose the chart name and add to the Twitter Sentiment dashboard as follows:

Screenshot 2021-12-27 at 13.11.40

Now if you open the Twitter Sentiment dashboard, the created pie chart will show up.

Create a Barchart

Now, let’s illustrate the sentiment per topic. This time, we choose a bar chart with the query as illustrated below:

Screenshot 2021-12-27 at 13.25.13

  • We choose the number of tweets as a metric, e.g., COUNT(tweetid)

  • Then, we group data by topic (series = topic).

  • Finally, we break each group by sentiment, to show for each topic how many tweets are classified as positive, negative, neutral, and irrelevant (breakdowns = sentiment)

Next, save the graph to the Twitter Sentiment dashboard. The dashboard now contains two charts as shown below:

Screenshot 2021-12-27 at 13.31.32

For our dataset, we can observe that Google has the highest number of positive tweets and the lowest number of negative tweets. On the other hand, about Apple, there are more negative tweets than for other topics. Interestingly, most of the tweets for all topics are classified as neutral.

Wrap Up

Apache Superset is an open-source cloud-native application that allows easy visualization of massive amounts of data. CrateDB is a distributed SQL database that makes data storage and analysis simple and at scale. Together, they provide a powerful open-source platform for data warehousing and business intelligence.

This post illustrated how to make the first start with CrateDB and Apache Superset, specifically:

  1. How to get CrateDB up-and-running

  2. Importing interesting data from a filesystem

  3. How to install Apache Superset via docker-compose

  4. Setting up a connection to CrateDB

  5. Creation of a dashboard and two different types of charts

Questions? Looking for help? Get in touch!