Resiliency Issues

CrateDB uses Elasticsearch for data distribution and replication. Most of the resiliency issues exist in the Elasticsearch layer and can be tested by Jepsen.

Table of contents

Known issues

Retry of updates causes double execution

Status Work ongoing (More info)
Severity Moderate
Likelihood Very rare
Cause Network issues, unresponsive nodes
Workloads Non-Idempotent writes

Scenario

A node with a primary shard receives an update, writes it to disk, but goes offline before having sent a confirmation back to the executing node. When the node comes back online, it receives an update retry and executes the update again.

Consequence

Incorrect data for non-idempotent writes.

For example:

  • An double insert on a table without an explicit primary key would be executed twice and would result in duplicate data.

  • A double update would incorrectly increment the row version number twice.

Fixed issues

Repeated cluster partitions can cause lost cluster updates

Status Fixed in CrateDB v4.0 (#32006, #32171)
Severity Moderate
Likelihood Very rare
Cause Network issues, unresponsive nodes
Workloads All

Scenario

A cluster is partitioned and a new master is elected on the side that has quorum. The cluster is repaired and simultaneously a change is made to the cluster state. The cluster is partitioned again before the new master node has a chance to publish the new cluster state and the partition the master lands on does not have quorum.

Consequence

The node steps down as master and the uncommunicated state changes are lost.

Cluster state is very important and contains information like shard location, schemas, and so on. Lost cluster state updates can cause data loss, reset settings, and problems with table structures.

Partially fixed

This problem is mostly fixed by #20384 (CrateDB v2.0.x), which uses committed cluster state updates during master election process. This does not fully solve this rare problem but considerably reduces the chance of occurrence. The reason is that if the second partition happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be that the in flight update is lost. If the now-isolated master can still acknowledge the cluster state update to the client this will result to the loss of an acknowledged change.

Version number representing ambiguous row versions

Status Fixed in CrateDB v4.0 (#19269, #10708)
Severity Significant
Likelihood Very rare
Cause Network issues, unresponsive nodes
Workloads Versioned reads with replicated tables while writing.

Scenario

A client is writing to a primary shard. The node holding the primary shard is partitioned from the cluster. It usually takes between 30 and 60 seconds (depending on ping configuration) before the master node notices the partition. During this time, the same row is updated on both the primary shard (partitioned) and a replica shard (not partitioned).

Consequence

There are two different versions of the same row using the same version number. When the primary shard rejoins the cluster and its data is replicated, the update that was made on the replicated shard is lost but the new version number matches the lost update. This will break Optimistic Concurrency Control.

Replicas can fall out of sync when a primary shard fails

Status Fixed in CrateDB v4.0 (#10708)
Severity Modest
Likelihood Rare
Cause Primary fails and in-flight writes are only written to a subset of its replicas
Workloads Writes on replicated table

Scenario

When a primary shard fails, a replica shard will be promoted to be the primary shard. If there is more than one replica shard, it is possible for the remaining replicas to be out of sync with the new primary shard. This is caused by operations that were in-flight when the primary shard failed and may not have been processed on all replica shards. Currently, the discrepancies are not repaired on primary promotion but instead would be repaired if replica shards are relocated (e.g., from hot to cold nodes); this does mean that the length of time which replicas can be out of sync with the primary shard is unbounded.

Consequence

Stale data may be read from replicas.

Loss of rows due to network partition

Status Fixed in Crate v2.0.x (#7572, #14252)
Severity Significant
Likelihood Very rare
Cause Single node isolation
Workloads Writes on replicated table

Scenario

A node with a primary shard is partitioned from the cluster. The node continues to accept writes until it notices the network partition. In the meantime, another shard has been elected as the primary. Eventually, the partitioned node rejoins the cluster.

Consequence

Data that was written to the original primary shard on the partitioned node is lost as data from the newly elected primary shard replaces it when it rejoins the cluster.

The risk window depends on your ping configuration. The default configuration of a 30 second ping timeout with three retries corresponds to a 90 second risk window. However, it is very rare for a node to lose connectivity within the cluster but maintain connectivity with clients.

Dirty reads caused by bad primary handover

Status Fixed in CrateDB v2.0.x (#15900, #12573)
Severity Moderate
Likelihood Rare
Cause Race Condition
Workloads Reads

Scenario

During a primary handover, there is a small risk window when a shard can find out it has been elected as the new primary before the old primary shard notices that it is no longer the primary.

A primary handover can happen in the following scenarios:

  • A shard is relocated and then elected as the new primary, as two separate but sequential actions. Relocating a shard means creating a new shard and then deleting the old shard.

  • An existing replica shard gets promoted to primary because the primary shard was partitioned from the cluster.

Consequence

Writes that occur on the new primary during the risk window will not be replicated to the old shard (which still believes it is the primary) so any subsequent reads on the old shard may return incorrect data.

Changes are overwritten by old data in danger of lost data

Status Fixed in CrateDB v2.0.x (#14671)
Severity Significant
Likelihood Very rare
Cause Network problems
Workloads Writes

Scenario

A node with a primary that contains new data is partitioned from the cluster.

Consequence

CrateDB prefers old data over no data, and so promotes an a shard with stale data as a new primary. The data on the original primary shard is lost. Even if the node with the original primary shard rejoins the cluster, CrateDB has no way of distinguishing correct and incorrect data, so that data replaced with data from the new primary shard.

Make table creation resilient to closing and full cluster crashes

Status The issue has been fixed with the following issues. Table recovery: #9126 Reopening tables: #14739 Allocation IDs: #15281
Severity Modest
Likelihood Very Rare
Cause Either the cluster fails while recovering a table or the table is closed during shard creation.
Workloads Table creation

Scenario

Recovering a table requires a quorum of shard copies to be available to allocate a primary. This means that a primary cannot be assigned if the cluster dies before enough shards have been allocated. The same happens if a table is closed before enough shard copies were started, making it impossible to reopen the table. Allocation IDs solve this issue by tracking allocated shard copies in the cluster. This makes it possible to safely recover a table in the presence of a single shard copy. Allocation IDs can also distinguish the situation where a table has been created but none of the shards have been started. If such an table was inadvertently closed before at least one shard could be started, a fresh shard will be allocated upon reopening the table.

Consequence

The primary shard of the table cannot be assigned or a closed table cannot be re-opened.

Unaware master accepts cluster updates

Status Fixed in CrateDB v2.0.x (#13062)
Severity Moderate
Likelihood Very rare
Cause Network problems
Workloads DDL statements

Scenario

If a master has lost quorum (i.e. the number of nodes it is in communication with has fallen below the configured minimum) it should step down as master and stop answering requests to perform cluster updates. There is a small risk window between losing quorum and noticing that quorum has been lost, depending on your ping configuration.

Consequence

If a cluster update request is made to the node between losing quorum and noticing the loss of quorum, that request will be confirmed. However, those updates will be lost because the node will not be able to perform a successful cluster update.

Cluster state is very important and contains information like shard location, schemas, and so on. Lost cluster state updates can cause data loss, reset settings, and problems with table structures.