The Crate JDBC driver can be used as an origin and destination stage in the StreamSets Data Collector pipeline.
Install the StreamSets Data Collector on your localhost or any other desired location.
In order to build your first data stream pipeline the standalone version of the Crate JDBC driver with included dependencies has to be obtained and installed in the StreamSets.
You can download the latest standalone version directly from the Bintray Repository.
After the JDBC driver is downloaded to the arbitrary destination, you can proceed with the installation of the driver. We recommend following the StreamSets instruction for installing additional drivers. Although to have a quick start you can simply place the Crate JDBC driver jar in the StreamSets classpath.
The next two sections provide a brief introduction to building simple data stream pipelines using Crate with StreamSets Data Collector. We are going to demonstrate how to build the pipeline with the directory origin stage which contains CSV data sample, please refer to the StreamSets tutorial to setup the origin correctly, and stream data to JDBC destination stage using the Crate JDBC driver.
The Crate JDBC Producer destination uses the JDBC connection to write data to the database. In this section, we show how to build the StreamSets project for injecting CSV records from the local filesystem into the Crate database with the preliminary preprocessing of certain records fields.
Create the new pipeline and follow the above mentioned StreamSets tutorial which explains how to create the directory origin and read CSV files from the local filesystem. Use the New York taxi data sample provided in the tutorial.
For the sake of simplicity, we use only 4 fields from the CSV file. The directory origin does not perform any preliminary preprocessing of records. All source fields are represented as strings in the CSV. However, some of the selected fields should be processed and converted into float values. Therefore, we add an additional processing Field Converter stage into the pipeline.
Create the taxi table in Crate.
CREATE TABLE taxi ( hack_license STRING, medallion STRING, total_amount FLOAT, tip_amount FLOAT );
The next step is to configure the Crate JDBC destination. First, load the Crate JDBC driver must be loaded. Then, we configure the JDBC connection (1) and define the destination table (2). After the above-mentioned configurations steps, we still can see warnings in the Credential tab. It is solved by unchecking the user authorization box because Crate does not support an authorization. (3)
Start the pipeline.
The JDBC Consumer origin uses the JDBC connection to read data from the Crate database. The Crate origin stage returns data as a map with column names and field values.
Currently, the usage of the Crate JDBC driver in combination with StreamSets Data Collector introduces few limitations:
LIMITclause should be explicitly set in the
It is important to mention that you have to adjust the
LIMIT size in the
query and the Query Interval property in the JDBC configuration tab to get the
suitable read performance, taking into consideration your particular use case
and the nature of the data.
The followings steps demonstrate how Crate can be used as the origin stage in the data stream pipeline. As sample data, we use AMPLab rankings dataset. The data can be imported from AWS S3 to the Crate database using prepared data import queries hosted in the Crate demo repository. Create the rankings table and import the rankings data. In the demo, we use the dataset which contains 18 millions of records. Assuming that the Crate cluster is set up and the rankings sample data is imported we can start building the data stream pipeline for streaming the data from the Crate database to CSV files.
To avoid unsupported transaction setting queries that may be invoked against CrateDB, it is necessary to uncheck Enforce Read-only Connection on the advanced tab of the JDBC consumer.
We stream the records from Crate to CSV files. In order to accomplish that we provide the path where the files are going to be created and set the output file format in the directory origin to Delimited.
Now we can start the pipeline and see rankings data streaming statistics form the Crate database to CSV files.