Troubleshooting with sys-tables

CrateDB maintains a set of diagnostic tables in the sys schema. It currently consists of ten tables that provide an overview of the cluster state. If something is going wrong and you initially don’t know why, they help you to analyze, identify the problem and start mitigating it. While there is detailed information about all system tables, this guide runs you through the most common situations.

Table of contents

Step 1: Health check

A good point to start is the table sys.check that maintains a number of health checks. You may know it from the admin UI. Order them by severity:

cr> SELECT description FROM sys.checks WHERE NOT passed ORDER BY severity DESC;
+---------...-+
| description |
+---------...-+...
+---------...-+
SELECT ... in set (... sec)

If a check fails, the description offers some explanation on how to proceed. The table reports checks that verify your cluster layout, give recommendations for configuration options, and warn you on incompatible software versions. More will be added as you go.

Step 2: Activity in the cluster

Statements that are currently executed on the server are tracked in the tables sys.jobs and sys.operations. They give you the opportunity to view the ongoing activity in the cluster.

If you’re using an earlier version than CrateDB 3.0.0, you will have to enable statistics using:

cr> SET GLOBAL stats.enabled = true; SET OK, 1 row affected ( … sec)

When enabled, each syntactically correct request that got parsed and planned is listed in the sys.jobs table while it’s executed:

cr> SELECT id as job_uuid, date_format(started) AS start, stmt FROM sys.jobs;
+--------------------------------------+-----------------------------+-------------------------------------...----------------------------------+
| job_uuid                             | start                       | stmt                                                                     |
+--------------------------------------+-----------------------------+-------------------------------------...----------------------------------+
...
+--------------------------------------+-----------------------------+-------------------------------------...----------------------------------+
SELECT ... in set (... sec)

Once you identified the dedicated job UUID, you can kill that job with the KILL command. A single job is split into several operations which run, depending on the query, on distributed nodes of your cluster. The table has also a system column _node indicating on which node CrateDB actually executes the operation:

cr> SELECT _node['name'], _node['hostname'], * FROM sys.operations;
+---------------+------------------...+----+---------------...+---------+---------------+------------+
| _node['name'] | _node['hostname']   | id | job_id           | name    |       started | used_bytes |
+---------------+------------------...+----+---------------...+---------+---------------+------------+
...
+---------------+------------------...+----+---------------...+---------+---------------+------------+
SELECT ... in set (... sec)

Find out more about the _node system column in the next sections. If there are no current jobs nor operations that are causing problems, check the recorded history of finished jobs and operations in the tables sys.jobs_log and sys.operations_log, respectively.

Step 3: Analyzing cluster resources

Sometimes it’s not a single query that causes problems, but a component of your distributed cluster. To find out more about it, check the table sys.cluster, which holds a single row containing the name and ID of the current master along with several other settings. To list all available data, run:

cr> SHOW COLUMNS IN cluster FROM sys;
+--------------------------------------------------------------------------------...+-----------...+
| column_name                                                                       | data_type    |
+--------------------------------------------------------------------------------...+-----------...+
...
+--------------------------------------------------------------------------------...+-----------...+
SHOW 104 rows in set (... sec)

While sys.cluster contains information about the cluster as a whole, sys.nodes maintains more detailed information about each CrateDB instance. This can be useful to track down data nodes that misbehave because their CPU is overloaded or because they have an outdated Java version:

cr> SELECT name, load['1'], os_info['jvm']['version'] FROM sys.nodes;
+-------+--------...+------------------------...+
| name  | load['1'] | os_info['jvm']['version'] |
+-------+--------...+------------------------...+
...
+-------+--------...+------------------------...+
SELECT ... in set (... sec)

To list all nodes using more than 98 per cent of the memory, type:

cr> SELECT * FROM sys.nodes WHERE mem['used_percent'] > 98;
+--...+---...+------...-+-...+---...+--...+---...+------...+-...+------...+---...+-----...-+-------...+----------...-+------...+
| fs  | heap | hostname | id | load | mem | name | network | os | os_info | port | process | rest_url | thread_pools | version |
+--...+---...+------...-+-...+---...+--...+---...+------...+-...+------...+---...+------...+-------...+----------...-+------...+
...
SELECT ... in set (... sec)

The table also contains the performance metrics like the load average, disk, memory, heap, or network throughput. Running:

cr> SHOW columns IN nodes FROM sys;
+-------------------------------------------------...+-----------...+
| column_name                                        | data_type    |
+-------------------------------------------------...+-----------...+
...
+-------------------------------------------------...+-----------...+
SHOW ... rows in set (... sec)

lists all available attributes. This object has the same structure as the _node system column of sys.operations from the previous section.

Step 4: Insights about partitions, shards, and replication

CrateDB divides the rows of each table into shards that are distinctively distributed to all nodes in your cluster. Replication uses the same mechanism to add redundancy and thus resilience to your data. While most of the time CrateDB transparently takes care of distributing and replicating the shards, it’s useful during troubleshooting to actually find out some more about these data structures. The sys.shards table provides access to the status and size of shards, their names and IDs:

cr> SHOW COLUMNS IN shards FROM sys;
+--------------------------------+-----------+
| column_name                    | data_type |
+--------------------------------+-----------+
| blob_path                      | string    |
| id                             | integer   |
| min_lucene_version             | string    |
| num_docs                       | long      |
| orphan_partition               | boolean   |
| partition_ident                | string    |
| path                           | string    |
| primary                        | boolean   |
| recovery                       | object    |
| recovery['files']              | object    |
| recovery['files']['percent']   | float     |
| recovery['files']['recovered'] | integer   |
| recovery['files']['reused']    | integer   |
| recovery['files']['used']      | integer   |
| recovery['size']               | object    |
| recovery['size']['percent']    | float     |
| recovery['size']['recovered']  | long      |
| recovery['size']['reused']     | long      |
| recovery['size']['used']       | long      |
| recovery['stage']              | string    |
| recovery['total_time']         | long      |
| recovery['type']               | string    |
| relocating_node                | string    |
| routing_state                  | string    |
| schema_name                    | string    |
| size                           | long      |
| state                          | string    |
| table_name                     | string    |
+--------------------------------+-----------+
SHOW 28 rows in set (... sec)

The cluster state is somewhat delicate when nodes join or leave, since in those situations shards have to be rearranged to ensure that each of them is replicated to different nodes. As long as the state attribute is STARTED for all shards, the cluster is in a stable state; otherwise, CrateDB is occupied with some background activity. The cluster state indicators on the admin UI evaluate these values as well.

The sys.shards table contains even more information about the rebalancing activities. Sometimes CrateDB needs to transfer a shard to another node, since that may be necessary to ensure there are enough replicas of it distributed in the cluster. You can estimate the progress of that operation with the recovery object. To monitor the progress of the shard transfer, run this query:

cr> select _node['hostname'], id, recovery['stage'], recovery['size']['percent'], routing_state, state from sys.shards
... where routing_state in ('RELOCATING','INITIALIZING') order by id;
+-------------------+----+-------------------+-----------------------------+---------------+-------+
| _node['hostname'] | id | recovery['stage'] | recovery['size']['percent'] | routing_state | state |
+-------------------+----+-------------------+-----------------------------+---------------+-------+
+-------------------+----+-------------------+-----------------------------+---------------+-------+
SELECT ... in set (... sec)

It lists pairs of rows, in which the first row denotes the destination shard and the second row the source shard.

Each row contains the shard’s hostname, ID, and the recovery percentage of the transferred shard. When the shard starts relocating, a new shard entry appears in the sys.shards table with a routing_state of INITIALIZING. The state of this row is RECOVERING. Meanwhile, the value of routing_state of the source row switches from STARTED to RELOCATING until the transfer is done. After that, the source row is deleted from sys.shards automatically.

To find out on which specific node a shard is stored, also use the object in the _node system column that is available for this table. As an example:

cr> SELECT _node['hostname'], table_name, num_docs FROM sys.shards ORDER BY num_docs DESC LIMIT 3;
+-------------------...+-----------...-+----------+
| _node['hostname']    | table_name    | num_docs |
+-------------------...+------------...+----------+
...
+-------------------...+------------...+----------+
SELECT ... in set (... sec)

This query lists the hosts and tables with the highest number of rows inside a single shard.

Step 5: Analyze allocation problems

Related to the previous step about gaining insights about shards and replication is the step about cluster-wide shard allocations.

In some circumstances, shard allocations might behave differently than you expect. A typical example might be that a table remains under-replicated for no apparent reason. You would probably want to find out what is causing the cluster to not allocate the shards. For that, there is the sys.allocations table, which lists all shards in the cluster.

If a shard is unassigned, the row will also include a reason why it cannot be allocated on any node.

If a shard is assigned but cannot be moved or rebalanced, the row includes a reason why it remains on the current node.

For a full list of available columns, see the reference documentation about the sys.allocations table.

To find out about the different states of shards of a specific table, you can simply filter by table_schema and table_name, e.g.:

cr> SELECT table_name, shard_id, node_id, explanations
... FROM sys.allocations
... WHERE table_schema = 'doc' AND table_name = 'my_table'
... ORDER BY current_state, shard_id;
+------------+----------+---------+--------------+
| table_name | shard_id | node_id | explanations |
+------------+----------+---------+--------------+
| doc        | my_table | ...     | ...          |
+------------+----------+---------+--------------+
...
+------------+----------+---------+--------------+
SELECT ... in set (... sec)

Step 6: Managing snapshots

Finally: if your repair efforts did not succeed, and your application or users accidentally deleted some data, recover one of the previously taken snapshots of your cluster. The tables sys.snapshots and sys.repositories assist you in managing your backups. Remember, one or more backups are stored in repositories outside the CrateDB cluster initialized with the CREATE REPOSITORY request. An actual copy of a current database state is made with the CREATE SNAPSHOT command. If you forgot where you store your snapshots:

cr> SELECT * FROM sys.repositories;
+------+----------+------+
| name | settings | type |
+------+----------+------+
+------+----------+------+
SELECT ... in set (... sec)

might come in handy. To actually recover data, first determine which snapshot to restore. Suppose you make nightly backups, the command:

cr> SELECT * FROM sys.snapshots ORDER BY started DESC LIMIT 7;
+------------------+----------+------+------------+---------+-------+---------+
| concrete_indices | finished | name | repository | started | state | version |
+------------------+----------+------+------------+---------+-------+---------+
+------------------+----------+------+------------+---------+-------+---------+
SELECT ... in set (... sec)

shows you last week’s snapshots along with their name, the stored indices, and how long they took.