Monitoring is an essential part of every production system. It helps us to understand performance characteristics and spot current or potential future problems. This is especially valuable when it comes to distributed systems, which are often be a good deal more tricky to keep an eye on.
Fortunately, once you know how to monitor one distributed Java application, you will have a good idea of how to monitor others.
So, in this post, I am going to look at one application in particular, one I am particularly familiar with: CrateDB, a distributed SQL database. Though the lessons learned here should be broadly applicable to any other distributed system written in Java, such as Spark, Elasticsearch, or HDFS.
In this post, I will use CrateDB to demonstrate the most important metrics when monitoring a distributed Java application, and explain why those metrics are important. I will also do a quick round-up of some of the tools you might want to consider using.
There are four main areas that should be monitored:
- Java Virtual Machine (JVM)
- CPU utilization
- Disk utilization
- Network utilization
For bonus points you can also monitor CrateDB cluster integrity.
Let's take a look at each area.
Java Virtual Machine (JVM)
The three most critical metrics for any Java application are:
- Heap usage
- Garbage collection time
- Thread count
When running CrateDB in production, the heap size should be fixed in order to prevent the JVM from paging out to disk. Paging to disk (aka "swapping") is very slow and will have a significant impact on performance.
The JVM maintains a memory heap (i.e. system memory that is reserved by the JVM) and dynamically allocates this to the application as needed.
Later, when that memory is no longer needed by the application, the JVM garbage collector will free it up (i.e. put it back on the heap for future use).
The utilization of heap memory over time is an important indicator of the health of CrateDB. CrateDB exposes that information directly in the
sys.nodes table, which we can query like so:
SELECT heap['probe_timestamp'] AS ts, heap['max'] AS heap_max, heap['used'] AS heap_used, 100.0 * heap['used'] / heap['max'] AS heap_percent FROM sys.nodes
When drawing a graph from this data, heap memory usage in a healthy CrateDB cluster should look like a sawtooth pattern. Heap uses should gradually increases as CrateDB requires more memory, and should drop suddenly when garbage collection happens.
Ideally, the normal operation of CrateDB should vary between 33% heap usage and 66% heap usage.
An ever increasing line would indicate a memory leak and would eventually lead to an
OutOfMemory exception. To avoid this scenario, it makes sense to set a trigger threshold that will alert you when heap usage gets too high, e.g. above 90% of available heap.
Garbage Collection Time
A healthy garbage collection should be running regularly, but not often enough that it impacts performance. And garbage collection time should be quick and not vary too much.
An increasing garbage collection time, due to constant high load, may be the first signs of a node failure where the node becomes unresponsive and drops out of the cluster. Sometimes a too big heap size is the reason for high garbage collection time, because allocated memory can pile up and not be released in time.
At the moment, garbage collection times are not exposed directly by CrateDB, but slow garbage collection times are logged. A garbage collection log line in CrateDB looks like this:
[2018-02-19T14:52:30,798][INFO ][o.e.m.j.JvmGcMonitorService] [crate1] [gc] overhead, spent [...s] collecting in the last [...s]
However, if you want to monitor garbage collection times, you will need to use a third-party tool to do so.
CrateDB uses multiple, differently sized thread pools for specific tasks, such as indexing or search. CrateDB sets fixed-size thread pools when it starts (though you can change this to dynamic) which is mostly based on the amount of available CPU cores.
If the thread pools are constantly full, it may indicate that CrateDB is overloaded and that there is too much "pressure" on the CPU.
CrateDB exposes statistics about its own thread pools via the
sys.nodes table. You can query this table to get the number of currently running and queued threads for each pool on each node, like so:
SELECT thread_pools['active'] AS active_threads, thread_pools['queue'] AS queued_threads, thread_pools['threads'] AS max_threads FROM sys.nodes
Some tasks in CrateDB are memory intensive, whereas others are more CPU intensive.
The most CPU intensive tasks are: table indexing and shard recovery at startup. Additionally, handling a large number of client connections can be CPU intensive.
There are three aspects of CPU utilization:
- Operating system CPU usage
- Process CPU usage
Operating System CPU Usage
When CrateDB is the only computation intensive application running on the host, overall operating system CPU usage gives you a decent indication of how the CPU cores are being utilized.
CrateDB exposes this metric via the
sys.nodes table, which you can query, like so:
SELECT os['cpu']['used'] FROM sys.nodes
If there are also other CPU intensive services running (e.g. and client application hosted on the same machine) this metric will be less useful.
Process CPU Usage
If you want to monitor how much CPU CrateDB is using, as distinct from the overall operating system CPU usage, this metric is also exposed via the
sys.nodes table, which you can query, like so:
SELECT process['cpu']['percent'] FROM sys.nodes
On Linux systems, the system load is an indication of how many processes are waiting for resources like CPU or disk. This is a good high-level metric. However, as with operating system CPU usage, this will become less useful for monitoring CrateDB when other resource intensive services are running on the same host.
Disk utilization has two components:
- Disk usage
- Disk input/output
The disk (or disks) on which CrateDB stores its data (defined by the
path.data setting) need to be monitored if you want to make sure you never run out of disk space.
Additionally there are two thresholds in CrateDB, known as the low and high disk watermarks:
- The low disk watermark (configured by cluster.routing.allocation.disk.watermark.low) is the threshold at which CrateDB will stop allocating any more replica shards on that node.
- The high disk watermark (configured by cluster.routing.allocation.disk.watermark.high) is the threshold at which CrateDB will try to actively relocate replica shards away from that node.
If you are monitoring disk usage, it makes sense to set up some sort of alerting that takes these values into consideration.
Disk input/output, or disk I/O, is how often and how much data is being written to or read from disk.
Disk I/O is often a performance bottleneck for CrateDB clusters, and monitoring it can verify whether this is the case.
Additionally, extremely high amounts of disk reads, in combination with slow queries, may indicate that CrateDB does not have enough memory, and so disk reads are not being cached often enough.
CrateDB exposes two disk I/O statistics (bytes read and bytes written) in the
sys.nodes table, but we recommend that you use a third-party tool for a more complete picture of disk I/O health and so that you can continue to collect metrics even if CrateDB becomes non-responsive. Prometheus is one good option for this.
sys.nodes table can be queried like so:
SELECT fs['total']['bytes_read'] / 1024.0 / 1024.0 AS read_mb, fs['total']['bytes_written'] / 1024.0 / 1024.0 AS write_mb FROM sys.nodes
While these figures represent running totals, sampling them regularly allows you to calculate time period read/write values.
One additional disk-related metric you might want to consider is the number of open file descriptors. Most operating systems impose an upper limit on this number, so it may be a good idea to monitor this.
Network utilization metrics are not as important as the previously mentioned metrics, but they can help you debug some problems. This is especially when you are running CrateDB on hardware that you do not control, i.e. cloud environments, because you are not in full control of network performance.
With a distributed system like CrateDB, operations that require the involvement of multiple nodes are limited by the slowest network connection between those nodes. The more network latency or packet loss in a cluster, the slower the cluster. In fact, in some poor network performance situations, you may find that nodes are dropping out of the cluster because they are not able to respond quickly enough.
Monitoring the amount of data that is sent and received on each node as well as the number of sent, received, and retransmitted packets, will help you understand how stable your network performance is.
Another network-related metric to consider is the number of open connections. Again, most operating systems impose an upper limit on this number, so it may be a good idea to monitor this.
Since CrateDB does not expose any of these metrics, network statistics need to be gathered with a third-party tool.
So far we have only looked at external metrics, i.e. metrics that tell you about the host machine that CrateDB is running on.
External metrics will give you a decent picture of how healthy your cluster is. However, there is still the possibility that CrateDB may experience internal issues.
The two most important internal metrics for CrateDB are data health and cluster health. And both of these can be found in the CrateDB administration UI in the status bar at the top of the screen, for example:
There are three possible statuses for data health:
- Green: All data is replicated and available
- Yellow: There are unreplicated records
- Red: Some data is unavailable
And three possible statuses for cluster health:
- Green: Good configuration
- Yellow: Some configuration warnings
- Red: Some configuration errors
At the time of writing, CrateDB unfortunately does not yet provide a way to get either value via SQL. However, if the (unsupported) Elasticsearch API is enabled, you can query the cluster health API via HTTP.
We plan to expose this information via SQL in future release.
Hosted solutions often provide a proprietary collect daemon that can collect a wide range of host metrics and well as Java application metrics via the JMX interface. These metrics are then sent to and stored at the service provider who also provides web based dashboards to analyze the data.
In no particular order, some options you might want to consider are:
There are plenty of open source monitoring tools, if you'd prefer to run the monitoring software yourself.
In no particular order, some examples are:
Monitoring and alerting are a vital part of any production deployment of a distributed Java application. Of which, CrateDB is just one example.
There are three important principles:
- While the application itself might expose metrics, it is always better to gather those metrics from the host system directly
- Less metrics is better than a lot of metrics, if you choose them well
- How you monitor is less important than what you monitor
In future posts, we will take a closer look at setting up monitoring and alerting using individual third-party tools.