Crate vs SQL on Hadoop

< Back to Overview

(e.g. Impala, Drill, Spark, Presto)

Crate works well side-by-side with Hadoop, actually many of the use-cases of Crate involve collecting (and long--term storing) raw data in Hadoop and ingesting a compressed data set into Crate for real time processing.

Although Crate may seem similar to SQL-on-Hadoop, in reality it is in an altogether different category. Its core design features are entirely different:

  • ‍It is an always-on database and not a “SQL translation” layer
  • It supports read and write in real time at the same time on all data
  • It allows real time adhoc queries on the full data set (all data in Crate is “hot”), contrary to solutions that just read a data-subset for real time processing, or run batch-style queries.

A general comparison is difficult as it depends a lot on requirements of the use-case. However here are some general comments:

  • ‍SQL-on-Hadoop engines are characterized by pulling (usually) a data subset from the underlying storage layer into the computing layer. While SparkSQL, as an example,  also offers powerful real time queries there is a delay that results from pulling the data, and of course only the pulled data can be processed.
  • Crate provides a REST API to query with Standard SQL, including distributed JOINs. This provides a similar convenience like SQL-on-Hadoop solutions, but on the complete data set and in real time. Because whatever you do: Hadoop-based solutions are always batch-oriented and never real time on the whole data set. Crate does not yet fully support the ANSI SQL-92 feature set, but you can ask us if a special need arises.
  • Ingestion power. The underlying HDFS of Hadoop is a serious bottleneck when writing data and making it available for queries. Crate instead is designed for massive parallel real time ingestion of data, offering read-after-write consistency. That means after a successful write, the record is consistently available via GET immediately. Often SQL-on-Hadoop solutions are read-oriented, where Crate offers read and write.
  • Despite being a distributed, scalable, always-on database Crate “feels and operates” like a simple SQL database and from a SQL-on-Hadoop point of view offers:
  • write using the same SQL interface like read
  • no need for an underlying Hadoop Cluster
  • much simpler setup and operation

A little disclaimer: comparisons depend on generalizations by their very nature. Let us know if you think we didn’t get something right and get in contact.


Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form