CrateDB Resiliency

Distributed systems are tricky. All sorts of things can go wrong that are beyond your control. The network can go away, disks can fail, hosts can be terminated unexpectedly. CrateDB tries very hard to cope with these sorts of issues while maintaining availability, consistency, and durability.

However, as with any distributed system, sometimes, rarely, things can go wrong.

Thankfully, for most use-cases, if you follow best practices, you are extremely unlikely to experience resiliency issues with CrateDB.

Best Practices

Monitoring Cluster Status

The Admin UI in CrateDB has a status indicator which can be used to determine the stability and health of a cluster.

A green status indicates that all shards have been replicated, are available, and are not being relocated. This is the lowest risk status for a cluster. The status will turn yellow when there is an elevated risk of encountering issues, due to a network failure or the failure of a node in the cluster.

The status is updated every few seconds (variable on your cluster ping configuration).

Storage and Consistency

Code that expects the behavior of an ACID compliant database like MySQL may not always work as expected with CrateDB.

CrateDB does not support ACID transactions, but instead has atomic operations and eventual consistency at the row level. Eventual consistency is the trade-off that CrateDB makes in exchange for high-availability that can tolerate most hardware and network failures. So you may observe data from different cluster nodes temporarily falling very briefly out-of-sync with each other, although over time they will become consistent.

For example, you know a row has been written as soon as you get the INSERT OK message. But that row might not be read back by a subsequent SELECT on a different node until after a table refresh (which typically occurs within one second).

Your applications should be designed to work this storage and consistency model.

Deployment Strategies

When deploying CrateDB you should carefully weigh your need for high-availability and disaster recovery against operational complexity and expense.

Which strategy you pick is going to depend on the specifics of your situation.

Here are some considerations:

  • CrateDB is designed to scale horizontally. Make sure that your machines are fit for purpose, i.e. use SSDs, increase RAM up to 64 GB, and use multiple CPU cores when you can. But if you want to dynamically increase (or decrease) the capacity of your cluster, add (or remove) nodes.
  • If availability is a concern, you can add nodes across multiple zones (e.g. different data centers or geographical regions). The more available your CrateDB cluster is, the more likely it is to withstand external failures like a zone going down.
  • If data durability or read performance is a concern, you can increase the number of table replicas. More table replicas means a smaller chance of permanent data loss due to hardware failures, in exchange for the use of more disk space and more intra-cluster network traffic.
  • If disaster recovery is important, you can take regular snapshots and store those snapshots in cold storage. This safeguards data that has already been successfully written and replicated across the cluster.
  • CrateDB works well as part of a data pipeline, especially if you’re working with high-volume data. If you have a message queue in front of CrateDB, you can configure it with backups and replay the data flow for a specific timeframe. This can be used to recover from issues that affect your data before it has been successfully written and replicated across the cluster.

    Indeed, this is the generally recommended way to recover from any of the rare consistency or data-loss issues you might encounter when CrateDB experiences network or hardware failures (see next section).

Known Issues

CrateDB started of as a fork of Elasticsearch, but over time we have gradually rewritten large parts of it. And while the execution layer is now completely unique to CrateDB, the data distribution and replication is still managed by Elasticsearch.

Most of the known issues that relate to resiliency exist in the Elasticsearch layer.

Fortunately, most of these issues are fixed in the Elasticsearch 5.x release line. We are working on updating CrateDB to use Elasticsearch 5.x and hope to have it ready in the first quarter of 2017.

Repeated Cluster Partitions Can Cause Lost Cluster Updates

Status Work ongoing (#20384)
Severity Moderate
Likelihood Very rare
Cause Network issues, unresponsive nodes
Workloads All

Scenario

A cluster is partitioned and a new master is elected on the side that has quorum. The cluster is repaired and simultaneously a change is made to the cluster state. The cluster is partitioned again before the new master node has a chance to publish the new cluster state and the partition the master lands on does not have quorum.

Consequence

The node steps down as master and the uncommunicated state changes are lost.

Cluster state is very important and contains information like shard location, schemas, and so on. Lost cluster state updates can cause data loss, reset settings, and problems with table structures.

Retry of Updates Causes Double Execution

Status Work ongoing (#9967)
Severity Moderate
Likelihood Very rare
Cause Network issues, unresponsive nodes
Workloads Non-Idempotent writes

Scenario

A node with a primary shard receives an update, writes it to disk, but goes offline before having sent a confirmation back to the executing node. When the node comes back online, it receives an update retry and executes the update again.

Consequence

Incorrect data for non-idempotent writes.

For example:

  • An double insert on a table without an explicit primary key would be executed twice and would result in duplicate data.

  • A double update would incorrectly increment the row version number twice.

Version Number Representing Ambiguous Row Versions

Status Work ongoing (#3711)
Severity Significant
Likelihood Very rare
Cause Network issues, unresponsive nodes
Workloads Versioned reads with replicated tables while writing.

Scenario

A client is writing to a primary shard. The node holding the primary shard is partitioned from the cluster. It usually takes between 30 and 60 seconds (depending on ping configuration) before the master node notices the partition. During this time, the same row is updated on both the primary shard (partitioned) and a replica shard (not partitioned).

Consequence

There are two different versions of the same row using the same version number. When the primary shard rejoins the cluster and its data is replicated, the update that was made on the replicated shard is lost but the new version number matches the lost update. This will break Optimistic Concurrency Control.

Loss of Rows Due to Network Partition

Status Fix planned for early 2017 (#7572, #14252)
Severity Significant
Likelihood Very rare
Cause Single node isolation
Workloads Writes on replicated table

Scenario

A node with a primary shard is partitioned from the cluster. The node continues to accept writes until it notices the network partition. In the meantime, another shard has been elected as the primary. Eventually, the partitioned node rejoins the cluster.

Consequence

Data that was written to the original primary shard on the partitioned node is lost as data from the newly elected primary shard replaces it when it rejoins the cluster.

The risk window depends on your ping configuration. The default configuration of a 30 second ping timeout with three retries corresponds to a 90 second risk window. However, it is very rare for a node to lose connectivity within the cluster but maintain connectivity with clients.

Dirty Reads Caused by Bad Primary Handover

Status Fix planned for early 2017 (#15900, #12573)
Severity Moderate
Likelihood Rare
Cause Race Condition
Workloads Reads

Scenario

During a primary handover, there is a small risk window when a shard can find out it has been elected as the new primary before the old primary shard notices that it is no longer the primary.

A primary handover can happen in the following scenarios:

  • A shard is relocated and then elected as the new primary, as two separate but sequential actions. Relocating a shard means creating a new shard and then deleting the old shard.

  • An existing replica shard gets promoted to primary because the primary shard was partitioned from the cluster.

Consequence

Writes that occur on the new primary during the risk window will not be replicated to the old shard (which still believes it is the primary) so any subsequent reads on the old shard may return incorrect data.

Changes Are Overwritten by Old Data in Danger of Lost Data

Status Fix planned for early 2017 (#14671)
Severity Significant
Likelihood Very rare
Cause Network problems
Workloads Writes

Scenario

A node with a primary that contains new data is partitioned from the cluster.

Consequence

CrateDB prefers old data over no data, and so promotes an a shard with stale data as a new primary. The data on the original primary shard is lost. Even if the node with the original primary shard rejoins the cluster, CrateDB has no way of distinguishing correct and incorrect data, so that data replaced with data from the new primary shard.

Unaware Master Accepts Cluster Updates

Status Fix planned for early 2017 (#13062)
Severity Moderate
Likelihood Very rare
Cause Network problems
Workloads DDL statements

Scenario

If a master has lost quorum (i.e. the number of nodes it is in communication with has fallen below the configured minimum) it should step down as master and stop answering requests to perform cluster updates. There is a small risk window between losing quorum and noticing that quorum has been lost, depending on your ping configuration.

Consequence

If a cluster update request is made to the node between losing quorum and noticing the loss of quorum, that request will be confirmed. However, those updates will be lost because the node will not be able to perform a successful cluster update.

Cluster state is very important and contains information like shard location, schemas, and so on. Lost cluster state updates can cause data loss, reset settings, and problems with table structures.