Building Data Stream pipelines with CrateDB and StreamSets Data Collector

CrateDB can be integrated with StreamSets using its JDBC driver (>= 1.12.1).

The CrateDB JDBC driver can be used as an origin and destination stage in the StreamSets Data Collector pipeline.

Install the StreamSets Data Collector on your localhost or any other desired location.

In order to build your first data stream pipeline the standalone version of the CrateDB JDBC driver with included dependencies has to be obtained and installed in the StreamSets.

You can download the latest standalone version directly from the Bintray Repository.

After the JDBC driver is downloaded to the arbitrary destination, you can proceed with the installation of the driver. We recommend following the StreamSets instruction for installing additional drivers. Although to have a quick start you can simply place the CrateDB JDBC driver jar in the StreamSets classpath.

The next two sections provide a brief introduction to building simple data stream pipelines using CrateDB with StreamSets Data Collector. We are going to demonstrate how to build the pipeline with the directory origin stage which contains CSV data sample, please refer to the StreamSets tutorial to setup the origin correctly, and stream data to JDBC destination stage using the CrateDB JDBC driver.

CrateDB JDBC Destination

The CrateDB JDBC Producer destination uses the JDBC connection to write data to the database. In this section, we show how to build the StreamSets project for injecting CSV records from the local filesystem into the CrateDB database with the preliminary preprocessing of certain records fields.

  1. Create the new pipeline and follow the above mentioned StreamSets tutorial which explains how to create the directory origin and read CSV files from the local filesystem. Use the New York taxi data sample provided in the tutorial.

  2. For the sake of simplicity, we use only 4 fields from the CSV file. The directory origin does not perform any preliminary preprocessing of records. All source fields are represented as strings in the CSV. However, some of the selected fields should be processed and converted into float values. Therefore, we add an additional processing Field Converter stage into the pipeline.

    Convert string input values
  3. Create the taxi table in CrateDB.

    CREATE TABLE taxi (
     hack_license STRING,
     medallion STRING,
     total_amount FLOAT,
     tip_amount FLOAT
  4. The next step is to configure the CrateDB JDBC destination. First, load the CrateDB JDBC driver must be loaded.

    Load the CrateDB JDBC driver

    Then, we configure the JDBC connection (1) and define the destination table (2). After the above-mentioned configurations steps, we still can see warnings in the Credential tab. It is solved by unchecking the user authorization box because CrateDB does not support an authorization. (3)

    Configure the CrateDB JDBC driver
  5. Start the pipeline.

    Pipelines run report

CrateDB JDBC Origin

The JDBC Consumer origin uses the JDBC connection to read data from the CrateDB database. The CrateDB origin stage returns data as a map with column names and field values.

Currently, the usage of the CrateDB JDBC driver in combination with StreamSets Data Collector introduces few limitations:

  • Only incremental mode for the data fetch is supported.
  • The offset column should be the primary key column to prevent the insertion of duplicate rows.
  • The LIMIT clause should be explicitly set in the SELECT query.
  • Batch size in the driver configuration is ignored.

It is important to mention that you have to adjust the LIMIT size in the query and the Query Interval property in the JDBC configuration tab to get the suitable read performance, taking into consideration your particular use case and the nature of the data.

The followings steps demonstrate how CrateDB can be used as the origin stage in the data stream pipeline. As sample data, we use AMPLab rankings dataset. The data can be imported from AWS S3 to the CrateDB database using prepared data import queries hosted in the CrateDB demo repository. Create the rankings table and import the rankings data. In the demo, we use the dataset which contains 18 millions of records. Assuming that the CrateDB cluster is set up and the rankings sample data is imported we can start building the data stream pipeline for streaming the data from the CrateDB database to CSV files.

  1. Create the new pipeline and configure the CrateDB JDBC Driver loading as it was done for the JDBC destination configuration. In the JDBC tab of the CrateDB JDBC origin we set the connection string (1), the SQL query for fetching the data from the database (2), the offset value and column (3). For more detail information on how to set the offset value and column see the StreamSets documentation.

    Configure the CrateDB JDBC driver


    To avoid unsupported transaction setting queries that may be invoked against CrateDB, it is necessary to uncheck Enforce Read-only Connection on the advanced tab of the JDBC consumer.

  2. We stream the records from CrateDB to CSV files. In order to accomplish that we provide the path where the files are going to be created and set the output file format in the directory origin to Delimited.

    Configure CSV destination
  3. Now we can start the pipeline and see rankings data streaming statistics form the CrateDB database to CSV files.

    CrateDB to CSV report