How to Import from Custom Data Sources with a Plugin

Claus Matzinger
Filed under
May 31, 2016

Common Crawl: An  Open Web Spider


Ingesting and handling large amounts of data with Crate is simple. We have a
best practice
and a
blog post which explain some of the
crucial aspects.

Crate supports two ways to ingest large amounts of data, (Bulk)
and COPY
. The former
is useful for pushing live production data into the cluster quickly (and with a
small number of requests), the latter is often used for migrating existing data
to Crate where each node fetches and ingests the data directly from the
source. COPY_FROM will be the focus of this article.


While looking for example data we came across the great Common
project which provides free and open web
crawling data. With its data size exceeding 541
it was a perfect
challenge for Crate.

There was one major obstacle. Common Crawl generates roughly 30 000 files per
using its own format which you cannot feed into Crate directly.
Preprocessing this quantity of data would take days to complete or require
building a very large distributed system. This gave us the idea to use Crate’s
plugin infrastructure

to transform the data while copying.

Plugging It In

Similar to the scalar function demonstrated in our example
, we want to add custom
behavior, but instead of a simple function, adding support for a new file


Common Crawl data comes in two different formats, WARC and WET. While WARC is
the raw HTML output of the crawler with headers and cookies, WET is a reduced
version of the data, only containing the textual content of each web page
(without HTML/CSS/JS) and some header information:

WARC-Type: conversion
WARC-Date: 2015-02-26T23:09:25Z
WARC-Record-ID: <urn:uuid:eeb56255-e75d-4d6b-9b55-8a0147862583>
WARC-Refers-To: <urn:uuid:690eff7f-404c-47b0-9ecd-bce5b45e0064>
Content-Type: text/plain
Content-Length: 1262

Has attendee Abe - Whitespace (Hackerspace Gent)

Our goal was to support WET files, which means that the plug in has to convert
this gzipped, multi-line text block into a single, JSON-encoded line.

Just another InputFactory?

Starting with version 0.55.0, Crate can load data from URLs, the handling of
which is determined by an InputFactory class that declares the protocol it can
handle. Instances of that class then implement a method called getStream() to
provide a stream of line-based

to the FileReadingCollector. This is exactly how we created our plugin.

The Common Crawl Plugin

To test our plugin we used it to import data to a 200+ node cluster under
realistic conditions. The Common Crawl
consists of two essential
parts. The parser for the WET

and the FileInput
to use the file type with Crate.

WET Parsing

We wanted to store the following information from the WET file:

  • URI
  • Indexing date
  • Content-Type header
  • Content-Length header
  • Content

The placement of these values is always in this exact order and terminated by
two new lines, followed by another header (starting with WARC/1.0). With that
knowledge we could construct a simple parser that looks for those headers,
sanitizes and transforms them and serializes the result into JSON. The
FileInput adds glue to this, providing an unzipped stream of text to the
parser, and returning a stream for Crate to read.

Using It

Plugins integrate seamlessly with Crate. After adding the plugin to the plugins
folder, it’s automatically loaded when Crate starts. Then a simple
statement will use the newly added implementation:

COPY commoncrawl FROM 'ccrawl://';

Similar to Crate’s S3
protocol part of the URL identifies the data source. The plugn replaces ccrawl
with http
to fetch the content via a simple GET request.

The Internet In Your Hands

With Common Crawl data in a SQL database, slicing and dicing it can happen in
seconds. We have imported a tiny chunk of the data to show you some examples.

SELECT count(*) FROM commoncrawl;
| count(*) |
|   870595 |

Want to know which pages contain the word ‘cat’?

cr> SELECT count(distinct authority) FROM commoncrawl WHERE match(content,
| count(DISTINCT authority) |
|                      7536 |

Or how many of those are hosted on

SELECT count(*) FROM commoncrawl WHERE match(content, 'cat') and
| count(*) |
|      402 |

To some users, the average content length of the TSA’s blog
might be interesting…

SELECT avg(clen) FROM commoncrawl WHERE authority = '';
|          avg(clen) |
| 115768.30508474576 |

… or maybe just a simple Google-like query:

SELECT authority, path FROM commoncrawl WHERE match(content, 'doge') ORDER BY _score DESC LIMIT 5;
| authority               | path                                                                               |
| com.cryptocointalk      | /topic/3164-new-doge-faucet-1h-up-to-20-doge/                                      |
| | /explore/partners/UNTCVA/browse/?sort=date_d&fq=untl_decade:1470-1479&display=grid |
| org.bitcointalk         | /?topic=183798                                                                     |
| com.2damnfunny          | /                                                                                  |
| com.clker.www           | /clipart-z-is-for-zeppelin.html                                                    |

Whatever your questions might be, this dataset is an unlimited source of

What now?

Since we achieved our primary goal of importing a large amount of data, we
decided to Open Source this project to show how easy it is to adapt Crate to
your needs. With our implementation we plan to provide a reference to bootstrap
your own ideas. A wealth of plugins would benefit all Crate users, so if you
create one, please let us know.

Your CrateDB download should begin automatically.

If your download did not start, click here to retry. CrateDB is also available for other platforms.

Next Steps

Stay up to date is constantly adding fantastic new features and making CrateDB an even better solution for you. Sign up for our newsletter to stay up to date.