In this article, our goal is to give you a throughout understanding of how CrateDB writes new records.
In our previous article, we gave a general overview of the storage layer in CrateDB. In general, every shard in CrateDB represents a Lucene index that is broken down into segments and stored in the filesystem. A Lucene segment can be seen as a sub-index and can be searched independently. When new records are written in CrateDB, Lucene first creates in-memory segments before flushing them to the disk.
We will go through the basic concepts of Lucene, such as Lucene segments, refresh and flush operation and introduce the concept of translog that guarantees that write operations are persistent to disk.
The Lucene segment is a part of the logical Lucene index and the Lucene index maps 1-1 to a CrateDB shard. It is an independent index that can contain inverted indexes, k-d trees, and doc values. A Lucene segment can be searched independently of other segments and documents in the segments are immutable. Every time a field of a document is updated, the document is flagged as deleted in the segment it belonged, and the updated document is added to a new segment. The same behavior applies when a document is deleted in CrateDB. All subsequent queries will skip all the documents that were previously marked as deleted.
To keep the number of segments manageable, Lucene periodically merges them according to some merge policy. When segments are merged, documents that are marked as deleted are discarded and newly created segments will contain only valid, non-deleted documents from the original segments. Merge is triggered when new documents are added and maybe surprisingly, this can result in a smaller index size. To merge segments of a table or a partition on demand one can use the OPTIMIZE TABLE command in CrateDB. When the max_num_segments parameter is set to 1, CrateDB will fully merge the table or partition for optimal query performance. The more details on table optimization in CrateDB, check out our documentation.
During the recovery of nodes, to speed up the process, CrateDB needs to replay operations that were executed on one shard on other shards (replicas), residing on different nodes. To preserve the recent deletions within the Lucene index, CrateDB supports the concept of soft delete. Because of that, deleted documents still take up disk space, which is why CrateDB keeps only recently deleted documents. It is possible to customize the retention lease period that says for how long the deleted documents have to be preserved. CrateDB will discard documents only after the expiration of this period. This means that if the merge operation takes place before the expiration, deleted documents will still remain physically available. The default retention lease period is 12 hours.
Writing data to CrateDB
The following diagram illustrates how a new record is stored in CrateDB. When a new document arrives it is first committed to a memory buffer and translog (step 1). Documents in the memory buffer will become in-memory Lucene segments after the refresh operation takes place (step 2). CrateDB executes a refresh operation every second if a search request is received in the last 30 seconds. Eventually, Lucene will commit new segments to the disk (step 3). Once the data are stored on the disk, the merge operation will get triggered and some of the segments will be merged.
Translog stores write operations for documents that are in memory and that have not been committed. A translog exists for each shard and it is stored in the physical disk. In case of a node failure, CrateDB can retrieve the potentially lost data by replaying the operations from the translog (step 4).
The data in translog are persisted to disk when the translog is
fsynced and committed. The default behavior of CrateDB implies that the translog is flushed to the disk after every operation. Additionally, it is possible to flush the translog every
translog.sync_interval which can be controlled by
The following parameters control the behavior of the translog:
translog.flush_threshold_sizesets the size of the transaction log containing the operations that are not yet safely persisted. This is done to prevent recoveries from taking too long.
translog.intervalsets the frequency of flush necessity check.
translog.durabilitycan be set as
REQUEST. When set to
ASYNCthe translog is flushed every
translog.sync_intervaland when set to
REQUESTthe flush happens after every operation.
translog.sync_intervalsets how often the translog is
fsyncedto disk. The default value is 5s. The setting of this parameter takes effect only if the
translog.durabilityis set to
Between CrateDB and the disk, there is RAM memory. New records are first buffered in memory before being written to a new segment. Refresh operation makes the in-memory segment available for search. Furthermore, a table can be refreshed explicitly to ensure that the latest state of the table is fetched. In CrateDB this is done with the
REFRESH TABLE command. However, the refresh operation doesn’t guarantee durability: to address the persistence issue, CrateDB relies on translog.
If not done explicitly, the table is refreshed with a specified refresh interval. The default value is one second, but the interval can be changed with the table parameter
refresh_interval. If no query accesses the table during the refresh interval, the table becomes idle. When the table is in an idle state it will not be refreshed until the next query arrives which will first refresh the table and then execute. This will also enable the periodic refresh again.
A flush operation triggers a Lucene commit: it writes the segments permanently on the disk and clears the transaction log. Writing segments to disk is expensive and takes place at less frequent intervals than the refresh operation. However, a flush happens automatically based on the translog configurations, as illustrated in the previous section.
To summarize, this article provides an overview of how data is written to CrateDB. The following properties of CrateDB are important to be aware of:
A Lucene index is organized in segments which are first built and kept in memory and later flushed to disk. This can influence the time when data becomes available on disk.
In-memory segments become searchable after the refresh operation.
Documents are immutable and deleted documents are not discarded until a merge takes place.
Segments are occasionally merged and deleted documents are discarded.
translog.durabilityis set to
REQUEST, the translog is flushed after every operation. Otherwise, if set to
ASYNC, the translog is flushed when certain configurable conditions are met.
CrateDB maintains the transaction log of operations on each shard for recovery in case of a node crash.