This document provides an overview on how Crate stores and distributes state across the cluster and what consistency and durability guarantees are provided.
Since Crate heavily relies on Elasticsearch and Lucene for storage and cluster consensus, concepts shown here might look familiar to Elasticsearch users, since the implementation is actually reused from the Elasticsearch code.
Every table in Crate is sharded, which means that tables are divided and distributed across the nodes of a cluster. Each shard in Crate is a Lucene index broken down into segments getting stored on the filesystem. Physically the files reside under one of the configured data directories of a node.
Lucene only appends data to segment files, which means that data written to the disc will never be mutated. This makes it easy for replication and recovery, since syncing a shard is simply a matter of fetching data from a specific marker.
An arbitrary number of replica shards can be configured per table. Every operational replica holds a full synchronized copy of the primary shard.
With read operations, there is no difference between executing the operation on the primary shard or on any of the replicas. Crate randomly assigns a shard when routing an operation. It is possible to configure this behavior if required, see our best practice guide on multi zone setups for more details.
Write operations are handled differently than reads. Such operations are synchronous over all active replicas with the following flow:
Should any replica shard fail to write the data or times out in step 5, it’s immediately considered as unavailable.
Each row of a table in Crate is a semi structured document which can be nested arbitrarily deep through the use of object and array types.
Operations on documents are atomic. Meaning that a write operation on a document either succeeds as a whole or has no effect at all. This is always the case, regardless of the nesting depth or size of the document.
Crate does not provide transactions. Since every document in Crate has a version number assigned, which gets increased every time a change occurs, patterns like optimistic concurrency control (see Optimistic Concurrency Control with Crate) can help to work around that limitation.
Each shard has a WAL also known as translog. It guarantees that operations on documents are persisted to disk without having to issue a Lucene-Commit for every write operation. When the translog gets flushed all data is written to the persistent index storage of Lucene and the translog gets cleared.
In case of an unclean shutdown of a shard, the transactions in the translog are getting replayed upon startup to ensure that all executed operations are permanent.
The translog is also directly transferred when a newly allocated replica initializes itself from the primary shard. There is no need to flush segments to disc just for replica-recovery purposes.
Every document has an internal identifier (see _id) . By default this identifier is derived from the primary key. Documents living in tables without a primary key are assigned a unique auto-generated id automatically when created.
Each document is routed by its routing key to one specific shard. By
default this key is the value of the
_id column. However this can
be configured in the table schema (see Routing).
While transparent to the user, internally there are two ways how Crate accesses documents:
Direct access by identifier. Only applicable if the routing key and the identifier can be computed from the given query specification. (e.g: the full primary key is defined in the where clause).
This is the most efficient way to access a document, since only
a single shard gets accessed and only a simple index lookup on
Query by matching against fields of documents across all candidate shards of the table.
Crate is eventual consistent for search operations. Search operations
are performed on shared
IndexReaders which besides other
functionality, provide caching and reverse lookup capabilities for
IndexReader is always bound to the Lucene segment it
was started from, which means it has to be refreshed in order to see
new changes, this is done on a time based manner, but can also be
done manually (see Refresh). Therefore a search only sees
a change if the according IndexReader was refreshed after that
If a query specification results in a
get operation, changes are
visible immediately. This is achieved by looking up the document in the
translog first, which will always have the most recent version of the
document. The common update and fetch use-case is therefore
possible. If a client updates a row and that row is looked up by its
primary key after that update the changes will always be visible,
since the information will be retrieved directly from the translog.
Every replica shard is updated synchronously with its primary and
always carries the same information. Therefore it does not matter if
the primary or a replica shard is accessed in terms of
consistency. Only the refresh of the
Some outage conditions can affect these consistency claims. See the resiliency documentation for details.
Cluster meta data is held in the so called “Cluster State”, which contains the following information:
Every node has its own copy of the cluster state. However there is only one node allowed to change the cluster state at runtime. This node is called the “master” node and gets auto-elected. The “master” node has no special configuration at all, any node in the cluster can be elected as a master. There is also an automatic re-election if the current master node goes down for some reason.
To avoid a scenario where two masters are elected due to network partitioning it’s required to define a quorum of nodes with which it’s possible to elect a master. For details in how to do this and further information see Primary Election.
To explain the flow of events for any cluster state change - here an example flow for an “ALTER TABLE” statement which changes the schema of a table: