Going into production

Running CrateDB in different environments requires different approaches. This document outlines the basics you need to consider when going into production.

Table of contents

Configure bootstrapping

The process of forming a cluster is known as bootstrapping. Consult the how-to guide on CrateDB multi-node setups for an overview of the two different ways to bootstrap a cluster.

If you have been using CrateDB for development on your local machine, there is a good chance you have been using single host auto-bootstrapping.

For improved performance and resiliency, you should run production CrateDB clusters with three or more nodes and one node per host machine. To do this, you must manually configure the bootstrapping process by telling nodes how to:

This process is known as manual bootstrapping. See the how-to guide for more information about how to bootstrap a cluster manually.

Switching to a manual bootstrapping configuration is the first step towards going into production.

Naming

Configure a logical cluster name

The cluster.name setting allows you to override the default cluster name of crate. You should use this setting to give a logical name to your cluster.

For example, add this to your configuration file:

cluster.name: acme-prod

The acme-prod name suggests that this cluster is the production cluster for the Acme organization. If Acme has a cluster running in a staging environment, you might want to name it acme-staging. This way, you can differentiate your clusters by name (visible in the Admin UI).

Tip

A node will refuse to join a cluster if the respective cluster names do not match

Bind nodes to logical hostnames

By default, CrateDB binds to the loopback address (i.e., localhost). It listens on port 4200-4299 for HTTP traffic and port 4300-4399 for node-to-node communication. Because CrateDB uses a port range, if one port is busy, it will automatically try the next port.

When using multiple hosts, nodes must bind to a non-loopback address.

Caution

Never expose an unprotected CrateDB node to the public internet

You can bind to a non-loopback address with the network.host setting in your configuration file, like so:

network.host: node-01-md.acme-prod.internal.example.com

You must configure the node-01-md.acme-prod.internal.example.com hostname using DNS. You must then set network.host to match the DNS name.

You should use the hostname to describe each node logically. To this end, the example hostname (above) has four components:

  • example.com – The root domain name
  • internal – The internal private network
  • acme-prod – The cluster name
  • node-01-md – The node label

When CrateDB is bound to a non-loopback address, CrateDB will enforce the bootstrap checks. These checks may require changes to your operating system configuration.

See also

Host settings

Logical node labels

CrateDB supports multiple types of node, determined by the node.master and node.data settings. You can use this information to give a logical DNS label to each of your nodes.

Tip

CrateDB sets node names automatically. If you are happy with automatic node names, there is no need to set node.name and hence you can use the same configuration on every node.

When configuring cluster bootstrapping, you can specify the list of master-eligible nodes using hostnames. This allows you to configure logical hostnames with DNS node labels that differ from the node name set by CrateDB.

If you would prefer your node names to match your DNS node labels, you will have to configure node.name manually on each host.

Multi-purpose nodes

You can configure a master-eligible node that also handles query execution loads like this:

node.master: true
node.data: true

A good DNS label for this node might be node-01-md.

Here, node is used as base label with a sequence number of 01. Every node in the cluster should have a unique sequence number, independent of the node type. The letters md indicate that this node has node.master and node.data set to true.

If you optionally want your node name to match (see above), configure the node.name setting in your configuration file, like so:

node.name: node-01-md

Alternatively, you can configure this setting at startup with a command-line option:

sh$ bin/crate \
        -Cnode.name=node-01-md
Request handling and query execution nodes

You can configure a node that only handles client requests and query execution (i.e., is not master-eligible) like this:

node.master: false
node.data: true

A good DNS label for this node might be node-02-d.

Here, node is used as base label with a sequence number of 02. Every node in the cluster should have a unique sequence number, independent of the node type. The letter d indicates that this node has node.data set to true.

If you optionally want your node name to match (see above), configure the node.name setting in your configuration file, like so:

node.name: node-02-d

Alternatively, you can configure this setting at startup with a command-line option:

sh$ bin/crate \
        -Cnode.name=node-02-d
Cluster management nodes

You can configure a node that handles cluster management (i.e., is master-eligible) but does not handle query execution loads like this:

node.master: true
node.data: false

A good DNS label for this node might be node-03-m.

Here, node is used as base label with a sequence number of 03. Every node in the cluster should have a unique sequence number, independent of the node type. The letter m indicates that this node has node.master set to true.

If you optionally want your node name to match (see above), configure the node.name setting in your configuration file, like so:

node.name: node-03-m

Alternatively, you can configure this setting at startup with a command-line option:

sh$ bin/crate \
        -Cnode.name=node-03-m
Request handling nodes

You can configure a node that handles client requests but does not handle query execution loads or cluster management (i.e., is not master-eligible) like this:

node.master: false
node.data: false

A good DNS label for this node might be node-04.

Here, node is used as base label with a sequence number of 04. Every node in the cluster should have a unique sequence number, independent of the node type. The absence of any additional letters indicates that node.master and node.data are false.

If you optionally want your node name to match (see above), configure the node.name setting in your configuration file, like so:

node.name: node-04

Alternatively, you can configure this setting at startup with a command-line option:

sh$ bin/crate \
        -Cnode.name=node-04

Configure persistent data paths

By default, CrateDB keeps data under the CRATE_HOME directory (which defaults to the installation directory). When you upgrade CrateDB, you will have to switch to a new installation directory.

Instead of migrating data by hand each time, you should move the data directories off to a persistent location. You can do this using the CRATE_HOME environment variable and the path settings in your configuration file.

See also

Path settings

For example, if you are running CrateDB on a Unix-like operating system, the Filesystem Hierarchy Standard (FHS) recommends the /srv directory as the root for site-specific data.

With this in mind, if you are installing CrateDB by hand, a good value for CRATE_HOME on a Unix-like system might be /srv/crate. Make sure to set CRATE_HOME before running bin/crate.

Then, you could configure your data paths like this:

path.conf: /srv/crate/config
path.data: /srv/crate/data
path.logs: /srv/crate/logs
path.repo: /srv/crate/snapshots

Note

If you have installed CrateDB using a system package for Debian, Ubuntu, or Red Hat, the CRATE_HOME variable (as well as some other data paths) are configured for by the systemd service file. You can view the crate service file, like so:

sh$ systemctl cat crate

System packages use of system-level directories instead of the /srv directory, which the FHS reserves for use by the local system administrator.

This setup is fine for production clusters. However, because the data directory holds table data and cluster metadata, you may want to configure path.data to point to a mounted volume, giving you the option to optimize the underlying storage mechanism for performance. For example:

path.data: /srv/crate/data

In this example, you can configure /srv/crate as a mount point.

Tip

You should take care size your data storage volumes according to your needs. You should also use storage with high IOPS when possible to improve CrateDB performance.

Warning

Docker containers are stateless by design. You should configure all data paths to point to a mounted volume to avoid data loss.

Tune the JVM

Heap

CrateDB is a Java application running on top of a Java Virtual Machine (JVM). The JVM uses a heap for memory allocations. For optimal performance, you must pay special attention to your heap configuration.

By default, CrateDB configures the JVM to dump out-of-memory exceptions to the file or directory specified by CRATE_HEAP_DUMP_PATH. You must make sure there is enough disk space available for heap dumps at this location.

Garbage collection

CrateDB logs JVM garbage collection times using the built-in garbage collection (GC) logging provided by the JVM. You can configure this process with the GC logging environment variables.

You must ensure that the log directory is on a fast-enough disk and has enough space. When using Docker, use a path on a mounted volume.

If garbage collection takes too long, CrateDB will log this. You can adjust the timeout settings to suit your needs. However, the default settings should work in most instances.

If you are running CrateDB on Docker, you should configure the container to send debug logs to STDERR so that the container orchestrator handles the output.