How to import from custom data sources with a plugin

Claus Matzinger
Filed under
May 31, 2016

Common Crawl: An  Open Web Spider

Image Credit

Ingesting and handling large amounts of data with Crate is simple. We have a best practice guide and a blog post which explain some of the crucial aspects.

Crate supports two ways to ingest large amounts of data, (Bulk) Inserts and COPY FROM. The former is useful for pushing live production data into the cluster quickly (and with a small number of requests), the latter is often used for migrating existing data to Crate where each node fetches and ingests the data directly from the source. COPY_FROM will be the focus of this article.


While looking for example data we came across the great Common Crawl project which provides free and open web crawling data. With its data size exceeding 541 TB it was a perfect challenge for Crate.

There was one major obstacle. Common Crawl generates roughly 30 000 files per month using its own format which you cannot feed into Crate directly. Preprocessing this quantity of data would take days to complete or require building a very large distributed system. This gave us the idea to use Crate's plugin infrastructure to transform the data while copying.

Plugging It In

Similar to the scalar function demonstrated in our example plugin, we want to add custom behavior, but instead of a simple function, adding support for a new file format.


Common Crawl data comes in two different formats, WARC and WET. While WARC is the raw HTML output of the crawler with headers and cookies, WET is a reduced version of the data, only containing the textual content of each web page (without HTML/CSS/JS) and some header information:

WARC-Type: conversion
WARC-Date: 2015-02-26T23:09:25Z
WARC-Record-ID: <urn:uuid:eeb56255-e75d-4d6b-9b55-8a0147862583>
WARC-Refers-To: <urn:uuid:690eff7f-404c-47b0-9ecd-bce5b45e0064>
Content-Type: text/plain
Content-Length: 1262

Has attendee Abe - Whitespace (Hackerspace Gent)

Our goal was to support WET files, which means that the plug in has to convert this gzipped, multi-line text block into a single, JSON-encoded line.

Just another InputFactory?

Starting with version 0.55.0, Crate can load data from URLs, the handling of which is determined by an InputFactory class that declares the protocol it can handle. Instances of that class then implement a method called getStream() to provide a stream of line-based JSON to the FileReadingCollector. This is exactly how we created our plugin.

The Common Crawl Plugin

To test our plugin we used it to import data to a 200+ node cluster under realistic conditions. The Common Crawl plugin consists of two essential parts. The parser for the WET format and the FileInput class to use the file type with Crate.

WET Parsing

We wanted to store the following information from the WET file:

  • URI
  • Indexing date
  • Content-Type header
  • Content-Length header
  • Content

The placement of these values is always in this exact order and terminated by two new lines, followed by another header (starting with WARC/1.0). With that knowledge we could construct a simple parser that looks for those headers, sanitizes and transforms them and serializes the result into JSON. The FileInput adds glue to this, providing an unzipped stream of text to the parser, and returning a stream for Crate to read.

Using It

Plugins integrate seamlessly with Crate. After adding the plugin to the plugins folder, it's automatically loaded when Crate starts. Then a simple statement will use the newly added implementation:

COPY commoncrawl FROM 'ccrawl://';

Similar to Crate's S3 support the protocol part of the URL identifies the data source. The plugn replaces ccrawl with http (literally) to fetch the content via a simple GET request.

The Internet In Your Hands

With Common Crawl data in a SQL database, slicing and dicing it can happen in seconds. We have imported a tiny chunk of the data to show you some examples.

SELECT count(*) FROM commoncrawl;
| count(*) |
|   870595 |

Want to know which pages contain the word 'cat'?

cr> SELECT count(distinct authority) FROM commoncrawl WHERE match(content,
| count(DISTINCT authority) |
|                      7536 |

Or how many of those are hosted on

SELECT count(*) FROM commoncrawl WHERE match(content, 'cat') and
| count(*) |
|      402 |

To some users, the average content length of the TSA's blog might be interesting...

SELECT avg(clen) FROM commoncrawl WHERE authority = '';
|          avg(clen) |
| 115768.30508474576 |

... or maybe just a simple Google-like query:

SELECT authority, path FROM commoncrawl WHERE match(content, 'doge') ORDER BY _score DESC LIMIT 5;
| authority               | path                                                                               |
| com.cryptocointalk      | /topic/3164-new-doge-faucet-1h-up-to-20-doge/                                      |
| | /explore/partners/UNTCVA/browse/?sort=date_d&fq=untl_decade:1470-1479&display=grid |
| org.bitcointalk         | /?topic=183798                                                                     |
| com.2damnfunny          | /                                                                                  |
| com.clker.www           | /clipart-z-is-for-zeppelin.html                                                    |

Whatever your questions might be, this dataset is an unlimited source of knowledge!

What now?

Since we achieved our primary goal of importing a large amount of data, we decided to Open Source this project to show how easy it is to adapt Crate to your needs. With our implementation we plan to provide a reference to bootstrap your own ideas. A wealth of plugins would benefit all Crate users, so if you create one, please let us know.

Back to topAll posts