Updated 10 March 2016
What better way to give developers access to the power of Elasticsearch than to use something that’s familiar, battle-tested, and straightforward? Something like SQL.
Crate has decoupled the query language (SQL) from the traditional relational data store and applied it to a Lucene index-based data store.
Elasticsearch provides an excellent interface to access the underlying Lucene indices, and a fabulous distributed model with replication, sharding, and healing. But to use these features, you need to use their Lucene-like query syntax. While there’s nothing wrong with this, it’s geared towards search, not a storage mentality. While adding the SQL layer was the easy part, we also wanted to solve some other issues that arose. Let’s take a look at them.
Due to the way data is sharded and distributed by Elasticsearch into Lucene indices, it’s difficult to determine cardinality (the number of distinct values in a set) of a value. For example, determining the number of unique values in a field called “username” in a table of blog posts. In a traditional database, this would be a
GROUP BY query, and you would iterate over all returned results and count the unique values (or it might already be cached by the system).
For example, with a MySQL database, you might write:
SELECT username, COUNT(*) FROM posts GROUP BY username ORDER BY 2 DESC;
Elasticsearch can do this, but it uses the HyperLogLog Algorithm (HLL) to statistically calculate an answer. This result may not be 100% accurate, but saves a lot of compute time. Redis uses the same algorithm.
GROUP BY is a staple of SQL, Crate has worked to solve this problem, and here’s a breakdown of how we did it.
Normally in a query, the request is sent to one node in the cluster which then instructs the other nodes to perform the query and report back. In a
GROUP BY scenario this might overwhelm the requesting node with a large number of responses, thus the need for HLL. Crate built a distribution layer on top of this process to redistribute the responses for processing.
The request is sent to all nodes, but the responses fed through this distribution layer. Each node hashes the values returned from the query and distributes each row (with all the values) to the other nodes based on the hash value. This means every node reduces its distinct part of the data down to a single value, since it knows that no other node has data with the same values. Once all the hash values have been counted across the cluster, the results are sent back to the requesting node for final processing.
This results in the final returned payload being easier to process than if all the nodes had returned all the raw values (possibly billions!) to the one node. With Crate, all the nodes participate in distributing the workload for calculating the result, spreading the load across the cluster and reducing the impact on any one node. Think of it as a distributed map/reduce job.
By approaching the problem this way, Crate delivers 100% accurate results for
GROUP BY type queries, as expected with a SQL query to a traditional database.
Elasticsearch and the NoSQL family it belongs to generally treat all data as flat and relating items of data is typically handled at an application level. In a traditional database, relating data typically happens at a database level through a variety of
For example, to return all comments associated with all posts, use a
SELECT posts.title, comments.comment FROM posts CROSS JOIN comments;
To return all posts and comments that match each post, use an
SELECT posts.title, comments.comment FROM posts INNER JOIN comments ON posts.pid = comments.pid
With Crate, both cross and inner joins are possible, using the same standard SQL syntax as above.
Whith SQL, you probably immediately think of
SELECT * FROM foo WHERE, etc. So it’s reasonable to think that with a database you’d have access to at least the most common types of actions, such as
In an Elasticsearch/Lucene based storage system, which is not structured like a common RDBMS, these operations are no longer available. For example, with Elasticsearch you can only update a single document at a time using the
With Crate, you can use any standard
WHERE query and all the matching documents will be updated as expected. Now you can use common SQL queries to update all matching records in one command.
The same applies for
INSERT queries, which are not supported by Elasticsearch out-of-the-box. Elasticsearch can index documents, which is equivalent to a simple SQL
INSERT, but with Crate you can
INSERT the results of a
SELECT query. You can use this to restructure table data, rename a field, change a field’s data type or convert a normal table into a partitioned one.
With Crate, each table translates to a Lucene index, and by the nature of Elasticsearch, this means each table is automatically sharded and replicated across the cluster. Crate offers the ability to define circumstances when to create new tables based on values in your data. We call this a ‘partitioned table’, a concept borrowed from Apache Hive.
For example, you can tell Crate that every time there is a new (distinct) value in the column ‘City’, make that into its own partition. Since this partition is just an index, it’s sharded and replicated as usual. To do this, use a
CREATE TABLE statement:
CREATE TABLE ... PARTITIONED BY (column, ...);
These features help administration and allow you to optimize how data is stored and retrieved. Once you’ve defined these settings, you can rest assured knowing that your partitions are being automatically created and replicated.
In most of this article we talk about data in terms of tables, and the concept of indices managed by Elasticsearch. Here, we are using both terms to explain how the process works for Crate’s partitioned tables.
When backing up an Elasticsearch instance, you are basically copying the underlying Lucene indexes in their native form. While it’s possible to introspect the data, this structure is cumbersome and unfamiliar to most. Crate allows you to export your data in JSON format so you can access the values and process them how you see fit. This is implemented using the
COPY FROM or
COPY TO SQL statements. Each node will output JSON files for all the master shards it knows about. It’s recommended that you export these all to a shared location to make collecting them easier.
You can import these files into another instance. An interesting feature is that if a
COPY FROM query is initiated from a shared resource, say S3, the cluster understands this and distributes the incoming JSON files across all the nodes. This avoids accidental duplication of data and allows all the nodes to start importing data simultaneously.
Please don’t hesitate to contact us if you have any questions.