Configuration

Since CrateDB has sensible defaults, there is no configuration needed at all for basic operation.

CrateDB is mainly configured via a configuration file, which is located at config/crate.yml. The vanilla configuration file distributed with the package has all available settings as comments in it along with the according default value.

The location of the config file can be specified upon startup like this:

sh$ ./bin/crate -Cpath.conf=/path/to/config/directory

Any setting can be configured either by the config file or via the -C option upon startup.

For example, configuring the cluster name by using command line properties will work this way:

sh$ ./bin/crate -Ccluster.name=cluster

This is exactly the same as setting the cluster name in the config file:

cluster.name = cluster

Settings will get applied in the following order where the latter one will overwrite the prior one:

  1. internal defaults
  2. options from config file
  3. command-line properties

Table Settings

For more info about table creating syntax please refer to CREATE TABLE

number_of_replicas
Default: 1
Runtime: yes

Specifies the number or range of replicas each shard of a table should have for normal operation.

refresh_interval
Default: 1000
Runtime: yes

Specifies the refresh interval of a shard in milliseconds.

Blocks

blocks.read_only
Default: false
Runtime: yes

If set to true, the table will be read_only and no write, alter and drop operations are allowed. It has the same effect like setting blocks.read and blocks.metadata to true.

blocks.read
Default: false
Runtime: yes

If set to true, no read (including export and snapshot) operations are allowed on that table.

blocks.write
Default: false
Runtime: yes

If set to true, no write operations are allowed.

blocks.metadata
Default: false
Runtime: yes

If set to true, no alter or drop operations are allowed. There is one exception: setting a single blocks.* setting is still allowed using ALTER to make it possible to change the blocks setting.

Translog

Note

The translog provides a persistent log of all operations that have not been transferred (flushed) to disk. Whenever a record is inserted into a table (or updated) that change is appended both to an in-memory buffer and the translog. When the translog reaches a certain size (see translog.flush_threshold_size) the translog is fsynced, flushed to disk, and cleared. In case of hardware failure, any data written since the last translog flush will be discarded.

translog.flush_threshold_size
Default: 512mb
Runtime: yes

Sets size of transaction log prior to flushing.

translog.sync_interval
Default: 5s
Runtime: yes

Setting translog.sync_interval controls the period after which the translog is fsynced to disk (defaults to 5 s). Value cannot be set below 100ms. When setting this interval, please keep in mind that changes logged during this interval and not synced to disk may get lost in case of a failure. This setting only takes effect if translog.durability is set to ASYNC.

translog.durability
Default: REQUEST
Runtime: yes

If set to ASYNC the translog gets flushed to disk in the background every translog.sync_interval. If set to REQUEST the flush happens after every operation.

Allocation

routing.allocation.enable
Default: all
Runtime: yes
Allowed Values: all | primaries | new_primaries | none

Controls shard allocation for a specific table.

routing.allocation.total_shards_per_node
Default: -1 (unbounded)
Runtime: yes

Controls the total number of shards (replicas and primaries) allowed to be allocated on a single node.

Recovery

recovery.initial_shards
Default: quorum
Runtime: yes

When using local gateway a particular shard is recovered only if there can be allocated quorum of it’s copies in the cluster. See recovery.initial_shards for more info on the option.

Warmer

warmer.enabled
Default: true
Runtime: yes

disable/enable table warming. Table warming allows to run registered queries to warm up the table before it is available.

Unassigned

unassigned.node_left.delayed_timeout
Default: 1m
Runtime: yes

Delay the allocation of replica shards which have become unassigned because a node has left. It defaults to 1m to give a node time to restart completely (which can take some time when the node has lots of shards). Setting the timeout to 0 will start allocation immediately. This setting can be changed on runtime in order to increase/decrease the delayed allocation if needed.

Column Policy

column_policy
Default: dynamic
Runtime: yes

Specifies the column policy of the table.

Mapping

mapping.total_fields.limit
Default: 1000
Runtime: yes

Sets the maximum amount of fields in the Lucene index mapping. This includes both the user facing mapping (columns) and internal fields.

Node Specific Settings

cluster.name
Default: crate
Runtime: no

The name of the CrateDB cluster the node should join to.

node.name
Runtime: no

The name of the node. If no name is configured a random one will be generated.

Note

Node names must be unique in a CrateDB cluster.

node.max_local_storage_nodes
Default: 1
Runtime: no

Defines how many nodes are allowed to be started on the same machine using the same configured data path defined via path.data.

Node Types

CrateDB supports different kinds of nodes. The following settings can be used to differentiate nodes upon startup:

node.master
Default: true
Runtime: no

Whether or not this node is able to get elected as master node in the cluster.

node.data
Default: true
Runtime: no

Whether or not this node will store data.

node.client
Default: false
Runtime: no

Shorthand for: node.data=false and node.master=false.

node.local
Default: false
Runtime: no

If set to true, the node will use a JVM-local transport and discovery. Used primarily for testing purposes.

Examples

A node by default is eligible as master and contains data.

Nodes that only contain data but cannot become master will mainly execute and respond to queries:

node:
  data: true
  master: false

Master-only-nodes that do not contain data but are able to become master can be used to separate cluster-management loads from the query execution loads:

node:
  data: false
  master: true

Nodes that do not contain data and are not eligible as master are called client-nodes. They can be used to separate request handling loads:

node:
  client: true

Read-only node

node.sql.read_only
Default: false
Runtime: no

If set to true, the node will only allow SQL statements which are resulting in read operations.

Hosts

network.host
Default: _local_
Runtime: no

The IP address CrateDB will bind itself to. This setting sets both the network.bind_host and network.publish_host values.

network.bind_host
Default: _local_
Runtime: no

This setting determines to which address CrateDB should bind itself to.

network.publish_host
Default: _local_
Runtime: no

This setting is used by a CrateDB node to publish its own address to the rest of the cluster.

Note

Apart from IPv4 and IPv6 addresses there are some special values that can be used for all above settings:

_local_ Any loopback addresses on the system, for example 127.0.0.1.
_site_ Any site-local addresses on the system, for example 192.168.0.1.
_global_ Any globally-scoped addresses on the system, for example 8.8.8.8.
_[networkInterface]_ Addresses of a network interface, for example _en0_.

Ports

http.port
Runtime: no

This defines the TCP port range to which the CrateDB HTTP service will be bound to. It defaults to 4200-4300. Always the first free port in this range is used. If this is set to an integer value it is considered as an explicit single port.

The HTTP protocol is used for the REST endpoint which is used by all clients except the Java client.

http.publish_port
Runtime: no

The port HTTP clients should use to communicate with the node. It is necessary to define this setting if the bound HTTP port (http.port) of the node is not directly reachable from outside, e.g. running it behind a firewall or inside a Docker container.

transport.tcp.port
Runtime: no

This defines the TCP port range to which the CrateDB transport service will be bound to. It defaults to 4300-4400. Always the first free port in this range is used. If this is set to an integer value it is considered as an explicit single port.

The transport protocol is used for internal node-to-node communication.

transport.publish_port
Runtime: no

The port that the node publishes to the cluster for its own discovery. It is necessary to define this setting when the bound tranport port (transport.tcp.port) of the node is not directly reachable from outside, e.g. running it behind a firewall or inside a Docker container.

psql.port
Runtime: no

This defines the TCP port range to which the CrateDB Postgres service will be bound to. It defaults to 5432-5532. Always the first free port in this range is used. If this is set to an integer value it is considered as an explicit single port.

Node Attributes

It is possible to apply generic attributes to a node, with configuration settings like node.key: value. These attributes can be used for customized shard allocation.

See also Awareness Settings.

Paths

path.conf
Runtime: no

Filesystem path to the directory containing the configuration files crate.yml and log4j2.properties.

path.data
Runtime: no

Filesystem path to the directory where this CrateDB node stores its data (table data and cluster metadata).

Multiple paths can be set by using a comma separated list and each of these paths will hold full shards (instead of striping data across them). In case Crate finds striped shards at the provided locations (from CrateDB <0.55.0), these shards will be migrated automatically on startup.

path.logs
Runtime: no

Filesystem path to a directory where log files should be stored. Can be used as a variable inside log4j2.properties.

For example:

appender:
  file:
    file: ${path.logs}/${cluster.name}.log
path.repo
Runtime: no

A list of filesystem or UNC paths where repositories of type fs may be stored.

Without this setting a CrateDB user could write snapshot files to any directory that is writable by the CrateDB process. To safeguard against this security issue, the possible paths have to be whitelisted here.

See also location setting of repository type fs.

Plugins

plugin.mandatory
Runtime: no

A list of plugins that are required for a node to startup. If any plugin listed here, the CrateDB node will fail to start.

Memory

bootstrap.memory_lock
Runtime: no
Default: false

CrateDB performs poorly when the JVM starts swapping: you should ensure that it never swaps. If set to true, CrateDB will use the mlockall system call on startup to ensure that the memory pages of the CrateDB process are locked into RAM.

Garbage Collection

Crate logs if JVM garbage collection on different memory pools takes too long. The following settings can be used to adjust these timeouts:

monitor.jvm.gc.collector.young.warn
Default: 1000ms
Runtime: no

CrateDB will log a warning message if it takes more than the configured timespan to collect the Eden Space (heap).

monitor.jvm.gc.collector.young.info
Default: 700ms
Runtime: no

CrateDB will log an info message if it takes more than the configured timespan to collect the Eden Space (heap).

monitor.jvm.gc.collector.young.debug
Default: 400ms
Runtime: no

CrateDB will log a debug message if it takes more than the configured timespan to collect the Eden Space (heap).

monitor.jvm.gc.collector.old.warn
Default: 10000ms
Runtime: no

CrateDB will log a warning message if it takes more than the configured timespan to collect the Old Gen / Tenured Gen (heap).

monitor.jvm.gc.collector.old.info
Default: 5000ms
Runtime: no

CrateDB will log an info message if it takes more than the configured timespan to collect the Old Gen / Tenured Gen (heap).

monitor.jvm.gc.collector.old.debug
Default: 2000ms
Runtime: no

CrateDB will log a debug message if it takes more than the configured timespan to collect the Old Gen / Tenured Gen (heap).

Elasticsearch HTTP REST API

es.api.enabled
Default: false
Runtime: no

Enable or disable elasticsearch HTTP REST API.

Warning

Manipulating your data via elasticsearch API and not via SQL might result in inconsistent data. You have been warned!

Cross-Origin Resource Sharing (CORS)

Many browsers support the same-origin policy which requires web applications to explicitly allow requests across origins. The cross-origin resource sharing settings in CrateDB allow for configuring these.

http.cors.enabled
Default: false
Runtime: no

Enable or disable cross-origin resource sharing.

http.cors.allow-origin
Default: <empty>
Runtime: no

Define allowed origins of a request. * allows any origin (which can be a substantial security risk) and by prepending a / the string will be treated as a regular expression. For example /https?:\/\/crate.io/ will allow requests from http://crate.io and https://crate.io. This setting disallows any origin by default.

http.cors.max-age
Default: 1728000 (20 days)
Runtime: no

Max cache age of a preflight request in seconds.

http.cors.allow-methods
Default: OPTIONS, HEAD, GET, POST, PUT, DELETE
Runtime: no

Allowed HTTP methods.

http.cors.allow-headers
Default: X-Requested-With, Content-Type, Content-Length
Runtime: no

Allowed HTTP headers.

http.cors.allow-credentials
Default: false
Runtime: no

Add the Access-Control-Allow-Credentials header to responses.

Blobs

blobs.path
Runtime: no

Path to a filesystem directory where to store blob data allocated for this node.

By default blobs will be stored under the same path as normal data. A relative path value is interpreted as relative to CRATE_HOME.

Repositories

Repositories are used to backup a CrateDB cluster.

repositories.url.allowed_urls
Runtime: no

This setting only applies to repositories of type url.

With this setting a list of urls can be specified which are allowed to be used if a repository of type url is created.

Wildcards are supported in the host, path, query and fragment parts.

This setting is a security measure to prevent access to arbitrary resources.

In addition, the supported protocols can be restricted using the repositories.url.supported_protocols setting.

repositories.url.supported_protocols
Default: http, https, ftp, file and jar
Runtime: no

A list of protocols that are supported by repositories of type url.

The jar protocol is used to access the contents of jar files. For more info, see the java JarURLConnection documentation.

See also the path.repo Setting.

Cluster Wide Settings

All current applied cluster settings can be read by querying the sys.cluster.settings column. Most cluster settings can be changed at runtime using the SET/RESET statement. This is documented at each setting.

Non-Runtime Cluster Wide Settings

Cluster wide settings which cannot be changed at runtime need to be specified in the configuration of each node in the cluster.

Note

Cluster settings specified via node configurations are required to be exactly the same on every node in the cluster for proper operation of the cluster.

Collecting Stats

stats.enabled
Default: false
Runtime: yes

A boolean indicating whether or not to collect statistical information about the cluster.

Note

Enabling the collection of statistical information incurs a performance penalty, as every operation across the cluster will cause data to be inserted into the job logs table.

stats.jobs_log_size
Default: 10000
Runtime: yes

The maximum number of job records kept to be kept in the sys.jobs_log table on each node.

A job record corresponds to a single SQL statement to be executed on the cluster. These records are used for performance analytics. A larger job log produces more comprehensive stats, but uses more RAM.

Older job records are deleted as newer records are added, once the limit is reached.

Setting this value to 0 disables collecting job information.

stats.jobs_log_expiration
Default: 0s (disabled)
Runtime: yes

The job record expiry time in seconds.

Job records in the sys.jobs_log table are periodically cleared if they are older than the expiry time. This setting overrides stats.jobs_log_size.

If the value is set to 0, time based log entry eviction is disabled.

Note

If both the stats.operations_log_size and stats.operations_log_expiration settings are disabled, jobs will not be recorded.

stats.operations_log_size
Default: 10000
Runtime: yes

The maximum number of operations records to be kept in the sys.operations_log table on each node.

A job consists of one or more individual operations. Operations records are used for performance analytics. A larger operations log produces more comprehensive stats, but uses more RAM.

Older operations records are deleted as newer records are added, once the limit is reached.

Setting this value to 0 disables collecting operations information.

stats.operations_log_expiration
Default: 0s (disabled)
Runtime: yes

Entries of sys.operations_log are cleared by a periodically job when they are older than the specified expire time. This setting overrides stats.operations_log_size. If the value is set to 0 the time based log entry eviction is disabled.

Note

If both setttings stats.operations_log_size and stats.operations_log_expiration are disabled, no job information will be collected.

stats.service.interval
Default: 1h
Runtime: yes

Defines the refresh interval to refresh tables statistics used to produce optimal query execution plans.

This field expects a time value either as a long or double or alternatively as a string literal with a time suffix (ms, s, m, h, d, w).

If the value provided is 0 then the refresh is disabled.

Note

Using a very small value can cause a high load on the cluster.

Settings that control the behaviour of the the stats circuit breaker. There are two breakers in place, one for the jobs log and one for the operations log. For each of them the breaker limit can be set.

stats.breaker.log.jobs.limit
Default: 5%
Runtime: yes

The maximum memory that can be used from CRATE_HEAP_SIZE for the sys.jobs_log table on each node.

When this memory limit is reached the job log circuit breaker logs an error message and clears the sys.jobs_log table completely.

stats.breaker.log.operations.limit
Default: 5%
Runtime: yes

The maximum memory that can be used from CRATE_HEAP_SIZE for the sys.operations_log table on each node.

When this memory limit is reached the operations log circuit breaker logs an error message and clears the sys.operations_log table completely.

Usage Data Collector

The settings of the Usage Data Collector are read-only and cannot be set during runtime. Please refer to Usage Data Collector to get further information about its usage.

udc.enabled
Default: true
Runtime: no

true: Enables the Usage Data Collector.

false: Disables the Usage Data Collector.

udc.initial_delay
Default: 10m
Runtime: no

The delay for first ping after start-up.

This field expects a time value either as a long or double or alternatively as a string literal with a time suffix (ms, s, m, h, d, w).

udc.interval
Default: 24h
Runtime: no

The interval a UDC ping is sent.

This field expects a time value either as a long or double or alternatively as a string literal with a time suffix (ms, s, m, h, d, w).

udc.url
Default: https://udc.crate.io
Runtime: no

The URL the ping is sent to.

Graceful Stop

By default, when the CrateDB process stops it simply shuts down, possibly making some shards unavailable which leads to a red cluster state and lets some queries fail that required the now unavailable shards. In order to safely shutdown a CrateDB node, the graceful stop procedure can be used.

The following cluster settings can be used to change the shutdown behaviour of nodes of the cluster:

cluster.graceful_stop.min_availability
Default: primaries
Runtime: yes
Allowed Values: none | primaries | full

none: No minimum data availability is required. The node may shut down even if records are missing after shutdown.

primaries: At least all primary shards need to be availabe after the node has shut down. Replicas may be missing.

full: All records and all replicas need to be available after the node has shut down. Data availability is full.

Note

This option is ignored if there is only 1 node in a cluster!

cluster.graceful_stop.reallocate
Default: true
Runtime: yes

true: The graceful stop command allows shards to be reallocated before shutting down the node in order to ensure minimum data availability set with min_availability.

false: The graceful stop command will fail if the cluster would need to reallocate shards in order to ensure the minimum data availability set with min_availability.

Note

Make sure you have enough nodes and enough disk space for the reallocation.

cluster.graceful_stop.timeout
Default: 2h
Runtime: yes

Defines the maximum waiting time in milliseconds for the reallocation process to finish. The force setting will define the behaviour when the shutdown process runs into this timeout.

The timeout expects a time value either as a long or double or alternatively as a string literal with a time suffix (ms, s, m, h, d, w).

cluster.graceful_stop.force
Default: false
Runtime: yes

Defines whether graceful stop should force stopping of the node if it runs into the timeout which is specified with the cluster.graceful_stop.timeout setting.

Bulk Operations

SQL DML Statements involving a huge amount of rows like COPY FROM, INSERT or UPDATE can take an enormous amount of time and resources. The following settings change the behaviour of those queries.

bulk.request_timeout
Default: 1m
Runtime: yes

Defines the timeout of internal shard-based requests involved in the execution of SQL DML Statements over a huge amount of rows.

Discovery

discovery.zen.minimum_master_nodes
Default: 1
Runtime: yes

Set to ensure a node sees N other master eligible nodes to be considered operational within the cluster. It’s recommended to set it to a higher value than 1 when running more than 2 nodes in the cluster.

discovery.zen.ping_timeout
Default: 3s
Runtime: yes

Set the time to wait for ping responses from other nodes when discovering. Set this option to a higher value on a slow or congested network to minimize discovery failures.

discovery.zen.publish_timeout
Default: 30s
Runtime: yes

Time a node is waiting for responses from other nodes to a published cluster state.

Note

Multicast used to be an option for node discovery, but was deprecated in CrateDB 1.0.3 and removed in CrateDB 1.1.

Unicast Host Discovery

CrateDB has built-in support for several different mechanisms of node discovery. The simplest mechanism is to specify a list of hosts in the configuration file.

discovery.zen.ping.unicast.hosts
Default: not set
Runtime: no

Currently there are two other discovery types: via DNS and via EC2 API.

When a node starts up with one of these discovery types enabled, it performs a lookup using the settings for the specified mechanism listed below. The hosts and ports retrieved from the mechanism will be used to generate a list of unicast hosts for node discovery.

The same lookup is also performed by all nodes in a cluster whenever the master is re-elected (see Cluster Meta Data).

disovery.type
Default: not set
Runtime: no
Allowed Values: srv, ec2

See also: Discovery.

Discovery via DNS

Crate has built-in support for discovery via DNS. To enable DNS discovery the discovery.type setting needs to be set to srv.

The order of the unicast hosts is defined by the priority, weight and name of each host defined in the SRV record. For example:

_crate._srv.example.com. 3600 IN SRV 2 20 4300 crate1.example.com.
_crate._srv.example.com. 3600 IN SRV 1 10 4300 crate2.example.com.
_crate._srv.example.com. 3600 IN SRV 2 10 4300 crate3.example.com.

would result in a list of discovery nodes ordered like:

crate2.example.com:4300, crate3.example.com:4300, crate1.example.com:4300
discovery.srv.query
Runtime: no

The DNS query that is used to look up SRV records, usually in the format _service._protocol.fqdn If not set, the service discovery will not be able to look up any SRV records.

discovery.srv.resolver
Runtime: no

The hostname or IP of the DNS server used to resolve DNS records. If this is not set, or the specified hostname/IP is not resolvable, the default (system) resolver is used. Optionally a custom port can be specified using the format hostname:port.

Discovery on Amazon EC2

CrateDB has built-in support for discovery via the EC2 API. To enable EC2 discovery the discovery.type setting needs to be set to ec2.

cloud.aws.access_key
Runtime: no

The access key id to identify the API calls.

cloud.aws.secret_key
Runtime: no

The secret key to identify the API calls.

Note that the AWS credentials can also be provided by environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_KEY or via system properties aws.accessKeyId and aws.secretKey.

Following settings control the discovery:

discovery.ec2.groups
Runtime: no

A list of security groups; either by id or name. Only instances with the given group will be used for unicast host discovery.

discovery.ec2.any_group
Runtime: no
Default: true

Defines whether all (false) or just any (true) security group must be present for the instance to be used for discovery.

discovery.ec2.host_type
Runtime: no
Default: private_ip
Allowed Values: private_ip, public_ip, private_dns, public_dns

Defines via which host type to communicate with other instances.

discovery.ec2.availability_zones
Runtime: no

A list of availability zones. Only instances within the given availability zone will be used for unicast host discovery.

discovery.ec2.tag.<name>
Runtime: no

EC2 instances for discovery can also be filtered by tags using the discovery.ec2.tag. prefix plus the tag name. E.g. to filter instances that have the environment tags with the value dev your setting will look like: discovery.ec2.tag.environment: dev.

cloud.aws.ec2.endpoint
Runtime: no

If you have your own compatible implementation of the EC2 API service you can set the enpoint that should be used.

Discovery on Microsoft Azure

CrateDB has built-in support for discovery via the Azure Virtual Machine API. To enable Azure discovery set the discovery.type setting to azure.

cloud.azure.management.resourcegroup.name
Runtime: no

The name of the resource group the CrateDB cluster is running on. All nodes need to be started within the same resource group.

cloud.azure.management.subscription.id
Runtime: no

The subscription ID of your Azure account. You can find the ID on the Azure Portal.

cloud.azure.management.tenant.id
Runtime: no

The tenant ID of the Active Directory application.

cloud.azure.management.app.id
Runtime: no

The application ID of the Active Directory application.

cloud.azure.management.app.secret
Runtime: no

The password of the Active Directory application.

discovery.azure.method
Runtime: no
Default: vnet
Allowed Values: vnet | subnet

Defines the scope of the discovery. vnet will discover all VMs within the same virtual network (default), subnet will discover all VMs within the same subnet of the CrateDB instance.

Routing Allocation

cluster.routing.allocation.enable
Default: all
Runtime: yes
Allowed Values: all | none | primaries | new_primaries

all allows all shard allocations, the cluster can allocate all kinds of shards.

none allows no shard allocations at all. No shard will be moved or created.

primaries only primaries can be moved or created. This includes existing primary shards.

new_primaries allows allocations for new primary shards only. This means that for example a newly added node will not allocate any replicas. However it is still possible to allocate new primary shards for new indices. Whenever you want to perform a zero downtime upgrade of your cluster you need to set this value before gracefully stopping the first node and reset it to all after starting the last updated node.

Note

This allocation setting has no effect on recovery of primary shards! Even when cluster.routing.allocation.enable is set to none, nodes will recover their unassigned local primary shards immediatelly after restart, in case the recovery.initial_shards setting is satisfied.

cluster.routing.allocation.allow_rebalance
Default: indices_all_active
Runtime: yes
Allowed Values: always | indices_primary_active | indices_all_active

Allow to control when rebalancing will happen based on the total state of all the indices shards in the cluster. Defaulting to indices_all_active to reduce chatter during initial recovery.

cluster.routing.allocation.cluster_concurrent_rebalance
Default: 2
Runtime: yes

Define how many concurrent rebalancing tasks are allowed cluster wide.

cluster.routing.allocation.node_initial_primaries_recoveries
Default: 4
Runtime: yes

Define the number of initial recoveries of primaries that are allowed per node. Since most times local gateway is used, those should be fast and we can handle more of those per node without creating load.

cluster.routing.allocation.node_concurrent_recoveries
Default: 2
Runtime: yes How many concurrent recoveries are allowed to happen on a node.

Awareness

Cluster allocation awareness allows to configure shard and replicas allocation across generic attributes associated with nodes.

cluster.routing.allocation.awareness.attributes
Runtime: no

Define node attributes which will be used to do awareness based on the allocation of a shard and its replicas. For example, let’s say we have defined an attribute rack_id and we start 2 nodes with node.rack_id set to rack_one, and deploy a single table with 5 shards and 1 replica. The table will be fully deployed on the current nodes (5 shards and 1 replica each, total of 10 shards).

Now, if we start two more nodes, with node.rack_id set to rack_two, shards will relocate to even the number of shards across the nodes, but a shard and its replica will not be allocated in the same rack_id value.

The awareness attributes can hold several values

cluster.routing.allocation.awareness.force.*.values
Runtime: no

Attributes on which shard allocation will be forced. * is a placeholder for the awareness attribute, which can be defined using the cluster.routing.allocation.awareness.attributes setting. Let’s say we configured an awareness attribute zone and the values zone1, zone2 here, start 2 nodes with node.zone set to zone1 and create a table with 5 shards and 1 replica. The table will be created, but only 5 shards will be allocated (with no replicas). Only when we start more shards with node.zone set to zone2 the replicas will be allocated.

Balanced Shards

All these values are relative to one another. The first three are used to compose a three separate weighting functions into one. The cluster is balanced when no allowed action can bring the weights of each node closer together by more then the fourth setting. Actions might not be allowed, for instance, due to forced awareness or allocation filtering.

cluster.routing.allocation.balance.shard
Default: 0.45f
Runtime: yes

Defines the weight factor for shards allocated on a node (float). Raising this raises the tendency to equalize the number of shards across all nodes in the cluster.

cluster.routing.allocation.balance.index
Default: 0.55f
Runtime: yes

Defines a factor to the number of shards per index allocated on a specific node (float). Increasing this value raises the tendency to equalize the number of shards per index across all nodes in the cluster.

cluster.routing.allocation.balance.threshold
Default: 1.0f
Runtime: yes

Minimal optimization value of operations that should be performed (non negative float). Increasing this value will cause the cluster to be less aggressive about optimising the shard balance.

Cluster-Wide Allocation Filtering

Allow to control the allocation of all shards based on include/exclude filters. E.g. this could be used to allocate all the new shards on the nodes with specific IP addresses or custom attributes.

cluster.routing.allocation.include.*
Runtime: no

Place new shards only on nodes where one of the specified values matches the attribute. e.g.: cluster.routing.allocation.include.zone: “zone1,zone2”

cluster.routing.allocation.exclude.*
Runtime: no

Place new shards only on nodes where none of the specified values matches the attribute. e.g.: cluster.routing.allocation.exclude.zone: “zone1”

cluster.routing.allocation.require.*
Runtime: no

Used to specify a number of rules, which all MUST match for a node in order to allocate a shard on it. This is in contrast to include which will include a node if ANY rule matches.

Disk-based Shard Allocation

cluster.routing.allocation.disk.threshold_enabled
Default: true
Runtime: yes

Prevent shard allocation on nodes depending of the disk usage.

cluster.routing.allocation.disk.watermark.low
Default: 85%
Runtime: yes

Defines the lower disk threshold limit for shard allocations. New shards will not be allocated on nodes with disk usage greater than this value. It can also be set to an absolute bytes value (like e.g. 500mb) to prevent the cluster from allocating new shards on node with less free disk space than this value.

cluster.routing.allocation.disk.watermark.high
Default: 90%
Runtime: yes

Defines the higher disk threshold limit for shard allocations. The cluster will attempt to relocate existing shards to another node if the disk usage on a node rises above this value. It can also be set to an absolute bytes value (like e.g. 500mb) to relocate shards from nodes with less free disk space than this value.

By default, the cluster will retrieve information about the disk usage of the nodes every 30 seconds. This can also be changed by setting the cluster.info.update.interval setting.

Recovery

indices.recovery.max_bytes_per_sec
Default: 40mb
Runtime: yes

Specifies the maximum number of bytes that can be transferred during shard recovery per seconds. Limiting can be disabled by setting it to 0. This setting allows to control the network usage of the recovery process. Higher values may result in higher network utilization, but also faster recovery process.

indices.recovery.retry_delay_state_sync
Default: 500ms
Runtime: yes

Defines the time to wait after an issue caused by cluster state syncing before retrying to recover.

indices.recovery.retry_delay_network
Default: 5s
Runtime: yes

Defines the time to wait after an issue caused by the network before retrying to recover.

indices.recovery.internal_action_timeout
Default: 15m
Runtime: yes

Defines the timeout for internal requests made as part of the recovery.

indices.recovery.internal_action_long_timeout
Default: 30m
Runtime: yes

Defines the timeout for internal requests made as part of the recovery that are expected to take a long time. Defaults to twice internal_action_timeout.

indices.recovery.recovery_activity_timeout
Default: 30m
Runtime: yes

Recoveries that don’t show any activity for more then this interval will fail. Defaults to internal_action_long_timeout.

Store Level Throttling

indices.store.throttle.type
Default: none
Runtime: yes
Allowed Values: all | merge | none

Allows to throttle merge (or all) processes of the store module.

indices.store.throttle.max_bytes_per_sec
Default: 0b
Runtime: yes

If throttling is enabled by indices.store.throttle.type, this setting specifies the maximum bytes per second a store module process can operate with.

Query Circuit Breaker

The Query circuit breaker will keep track of the used memory during the execution of a query. If a query consumes too much memory or if the cluster is already near its memory limit it will terminate the query to ensure the cluster keeps working.

indices.breaker.query.limit
Default: 60%
Runtime: yes

Specifies the limit for the query breaker. Provided values can either be absolute values (intepreted as a number of bytes), byte sizes (eg. 1mb) or percentage of the heap size (eg. 12%). A value of -1 disables breaking the circuit while still accounting memory usage.

indices.breaker.query.overhead
Default: 1.09
Runtime: no

A constant that all data estimations are multiplied with to determine a final estimation.

Field Data Circuit Breaker

The field data circuit breaker allows estimation of needed heap memory required for loading field data into memory. If a certain limit is reached an exception is raised.

indices.breaker.fielddata.limit
Default: 60%
Runtime: yes

Specifies the JVM heap limit for the fielddata breaker.

indices.breaker.fielddata.overhead
Default: 1.03
Runtime: yes

A constant that all field data estimations are multiplied with to determine a final estimation.

Request Circuit Breaker

The request circuit breaker allows an estimation of required heap memory per request. If a single request exceeds the specified amount of memory, an exception is raised.

indices.breaker.request.limit
Default: 60%
Runtime: yes

Specifies the JVM heap limit for the request circuit breaker.

indices.breaker.request.overhead
Default: 1.0
Runtime: yes

A constant that all request estimations are multiplied with to determine a final estimation.

Threadpools

Every node holds several thread pools to improve how threads are managed within a node. There are several pools, but the important ones include:

  • index: For index/delete operations, defaults to fixed
  • search: For count/search operations, defaults to fixed
  • get: For queries that are optimized to do a direct lookup by primary key, defaults to fixed
  • bulk: For bulk operations, defaults to fixed
  • refresh: For refresh operations, defaults to cache
threadpool.<threadpool>.type
Runtime: no
Allowed Values: fixed | cache

fixed holds a fixed size of threads to handle the requests. It also has a queue for pending requests if no threads are available.

cache will spawn a thread if there are pending requests (unbounded).

Fixed Threadpool Settings

If the type of a threadpool is set to fixed there are a few optional settings.

threadpool.<threadpool>.size
Runtime: no

Number of threads. The default size of the different thread pools depend on the number of available CPU cores.

threadpool.<threadpool>.queue_size
Default index: 200
Default search: 1000
Default get: 1000
Default bulk: 50
Runtime: no

Size of the queue for pending requests. A value of -1 sets it to unbounded.

Metadata

cluster.info.update.interval
Default: 30s
Runtime: yes

Defines how often the cluster collect metadata information (e.g. disk usages etc.) if no concrete event is triggered.

Metadata Gateway

The gateway persists cluster meta data on disk every time the meta data changes. This data is stored persistently across full cluster restarts and recovered after nodes are started again.
gateway.expected_nodes
Default: -1
Runtime: no

The setting gateway.expected_nodes defines the number of nodes that should be waited for until the cluster state is recovered immediately. The value of the setting should be equal to the number of nodes in the cluster, because you only want the cluster state to be recovered after all nodes are started.

gateway.recover_after_time
Default: 0ms
Runtime: no

The gateway.recover_after_time setting defines the time to wait before starting starting the recovery once the number of nodes defined in gateway.recover_after_nodes are started. The setting is relevant if gateway.recover_after_nodes is less than gateway.expected_nodes.

gateway.recover_after_nodes
Default: -1
Runtime: no

The gateway.recover_after_nodes setting defines the number of nodes that need to be started before the cluster state recovery will start. Ideally the value of the setting should be equal to the number of nodes in the cluster, because you only want the cluster state to be recovered once all nodes are started. However, the value must be bigger than the half of the expected number of nodes in the cluster.

Session Setting Parameters

The section lists the session setting parameters supported by CrateDB. Parameters can be set with the SET/SET SESSION statement, see SET / RESET.

search_path
Default: doc
This parameter holds the default schema for a session that should be used when no schema is provided in queries. The value of search_path can be either a string or a comma-separated list of strings. However, CrateDB only considers the first element when a list is provided.

Logging

CrateDB comes, out of the box, with Log4j 1.2.x. It tries to simplify log4j configuration by using YAML to configure it. The logging configuration file is at config/log4j2.properties.

The yaml file is used to prepare a set of properties used for logging configuration using the PropertyConfigurator but without the tediously repeating log4j prefix. Here is a small example of a working logging configuration.

rootLogger.level = info
rootLogger.appenderRef.console.ref = console

# log action execution errors for easier debugging
logger.action.name = org.crate.action.sql
logger.action.level = debug


appender.console.type = Console
appender.console.name = console
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = [%d{ISO8601}][%-5p][%-25c{1.}] %marker%m%n

And here is a snippet of the generated properties ready for use with log4j. You get the point.

log4j.rootLogger=INFO, console

log4j.logger.action=DEBUG

log4j.appender.console=org.elasticsearch.common.logging.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.conversionPattern=[%d{ISO8601}][%-5p][%-25c] %m%n

...

Logger Settings

It’s possible to set the log level of loggers at runtime. This is particularly useful when debugging problems and there is a need to increase the log level without wanting to restart nodes. Logging settings are cluster wide and override the logging configuration of nodes defined in their log4j2.properties.

The RESET statement is also supported, however only with the limitation that the reset of the logging override only takes affect after cluster restart.

To set the log level you can use the regular SET statement, for example:

SET GLOBAL TRANSIENT "logger.action" = 'INFO';

The logging setting consists of the prefix logger and a variable suffix which defines the name of the logger that the log level should be applied to.

In addition to hierarchical named loggers you can also change the log level of the root logger using the _root suffix.

In the example above the log level INFO is applied to the logger action.

Possible log levels are the same as for Log4j: TRACE, DEBUG, INFO, WARN, and ERROR. They must be provided as string literals in the SET statement.

Note

Be careful using the TRACE log level because it’s extremely verbose, can obscure other important log messages and even fill up entire data disks in some cases.

It is also possible to inspect the current “logging overrides” in a cluster by querying the sys.cluster table (see Cluster Settings).

Environment Variables

CRATE_HOME

Specifies the home directory of the installation, it is used to find default file paths like e.g. config/crate.yml or the default data directory location. This variable is usally defined at the by-distribution shipped start-up script. In most cases it is the parent directory of the directory containing the bin/crate executable.

CRATE_HOME:Home directory of CrateDB installation. Used to refer to default config files, data locations, log files, etc. All configured relative paths will use this directory as a parent.

CRATE_JAVA_OPTS

This variable allows you to set Java options for CrateDB, such as as the thread stack size.

For example, to change the stack size in order to avoid stack overflow exceptions:

CRATE_JAVA_OPTS=-Xss500k

CRATE_HEAP_SIZE

This variable specifies the amount of memory that can be used by the JVM.

The value of the environment variable can be suffixed with g or m. For example:

CRATE_HEAP_SIZE=4g

Certain operations in CrateDB require a lot of records to be hold in memory at a time. If the amount of heap that can be allocated by the JVM is too low these operations would fail with an OutOfMemory exception.

So it’s important to choose a value high enough for the intended use-case. But there are two limitations:

Use max. 50% of available RAM

Be aware that there is also another user of memory besides CrateDB’s HEAP: our underlying storage engine Lucene. It leverages the underlying OS for caching in-memory data structures by design. Lucene indexes are split in several segment files, every file is immutable and will never change. This makes them super cache-friendly and the underlying OS will keep hot segments resident in memory for faster access. So if all system memory is assigned to CrateDB’s HEAP, there won’t be any left-over for Lucene which can cause serious performance impacts.

Note

A good recommendation is to assign 50% of the available memory to CrateDB’s HEAP while leaving the other 50% free. It will not get unused, Lucene will use whatever is left-over.

Never use more than 30.5 Gigabyte

In order to save on precious memory on x64 systems the Hotspot Java Virtual Machine uses a technique called Compressed Ordinary object pointers (oops).

These are pointers to java objects in the heap that only consume 32 Bit, which saves you lots of space. The actual native 64 bit pointers are computed by scaling the 32 bit value by a factor of 8 and add it to a base heap address. This allows the JVM to address about 32 GB of heap.

If you configure your heap to more than 32 GB Compressed Oops cannot be used anymore. In effect, there will be much less space available in the heap as object pointers now consume twice as much.

This boundary should be considered an upper bound for the heap size of any JVM application.

Note

In order to ensure that Compressed Oops are used no matter what JVM CrateDB runs on, configuring the heap to a value less than or equal to 30.5 GB (30500m) is suggested, as some JVMs only support Compressed Oops up to that value.

Running CrateDB on machines with huge RAM

If hardware with much more RAM is available, it is suggested to run more than one CrateDB instance on that machine with each one having a heap size of around 30.5 GB (30500m). But still leave half of the available RAM to Lucene.

In this case consider adding: cluster.routing.allocation.same_shard.host: true to your config. This will prevent allocating primary and replica of the same shard on the same machine even if more than one instances running on it.

Other System Configuration

Virtual Memory

CrateDB uses a hybrid MMap/NIO file system storage directory to store its indices. Out-of-memory exceptions and failing bootstrap checks may be caused by the operating system’s default limit on max_map_count (the maximum number of memory map areas a process may have) being too low.

On Linux, you can increase the limit by running the following command:

sh$ sudo sysctl -w vm.max_map_count=262144

To set this value permanently, update the vm.max_map_count setting in /etc/sysctl.conf. You can then verify this after rebooting by running:

sh$ sysctl vm.max_map_count

Enterprise Features

Note

To enable or use any of the enterprise features, Crate.io must have given you permission to enable and use the Enterprise Edition of CrateDB and you must have a valid Enterprise or Subscription Agreement with Crate.io. If you enable or use features that are part of the Enterprise Edition, you represent and warrant that you have a valid Enterprise or Subscription Agreement with Crate.io. Your use of features of the Enterprise Edition is governed by the terms and conditions of your Enterprise or Subscription Agreement with Crate.io.

Enterprise License

license.enterprise
Default: true
Runtime: no

Setting this to false disables the Enterprise Edition of CrateDB.

Javascript Language

lang.js.enabled
Default: false
Runtime: no

Setting to enable the Javascript language. As The Javascript language is an experimental feature and is not securely sandboxed its disabled by default.

Authentication

Trust Authentication

auth.trust.http_default_user
Runtime: no
Default: crate

The default user that should be used for authentication when clients connect to CrateDB via HTTP protocol and they do not specify a user via the X-USER request header.

Host Based Authentication

Authentication settings (auth.host_based.*) are node settings, which means that their values apply only to the node where they are applied and different nodes may have different authentication settings.

auth.host_based.enabled
Runtime: no
Default: false

Setting to enable or disable Host Based Authentication (HBA). It is disabled by default.

HBA Entries

The auth.host_based.config. setting is a group setting that can have zero, one or multiple groups that are defined by their group key (${order}) and their fields (user, address, method, protocol).

${order}:
An identifier that is used as a natural order key when looking up the
host based configuration entries. For example, an order key of a will
be looked up before an order key of b. This key guarantees that the entry
lookup order will remain independent from the insertion order of the entries.

The Host Based Authentication (HBA) setting is a list of predicates that users can specify to restrict or allow access to CrateDB.

The meaning of the fields of the are as follows:

auth.host_based.config.${order}.user
Runtime: no
Specifies an existing CrateDB username, only crate user (superuser)
is available. If no user is specified in the entry, then
all existing users can have access.
auth.host_based.config.${order}.address
Runtime: no
The client machine addresses that the client matches, and which are
allowed to authenticate. This field can contain an IP address or a CIDR
mask. For example: 127.0.0.1 or 127.0.0.1/32. If no address is specified
in the entry, then access to CrateDB is open for all hosts.
auth.host_based.config.${order}.method
Runtime: no
The authentication method to use when a connection matches this entry.
At the moment only the trust method is available. If no method is
specified, the trust method is used by default (see Trust Method).
auth.host_based.config.${order}.protocol
Runtime: no
Specifies the protocol for which the authentication entry should be used.
If no protocol is specified, then this entry will be valid for all protocols
that rely on host based authentication see Trust Method).

Example of config groups:

auth.host_based.config:
  entry_a:
    user: crate
    address: 127.16.0.0/16
  entry_b:
    method: trust
  entry_3:
    user: crate
    address: 172.16.0.0/16
    method: trust
    protocol: pg