Monitoring is a critical part of any complex system. If you’re not monitoring your system properly, you can’t be sure that it’s working the way it’s supposed to, and it’s hard to figure out what’s going on when there is a problem.
For most of FastMail's life, it's been monitored by a set of scripts that run regularly on each machine, look for anything that seems wrong and if a problem is found, report it. Most of the time that's by sending an email for engineers to look into later but in extreme cases, they would send an SMS and demand immediate attention.
These tests are fairly simple. The example I usually use is about disk space. Every time the monitor scripts run, they would check the amount of space free on each mail disk. If it's more than 92% full, it would emit a warning. More than 95%, it would wake someone.
The problem with this simple test is that it doesn't really say much about why the disk is filling or what the correct remedy is.
- It could be that a user is importing a lot of mail, so the usage has likely grown quite quickly and we'll soon run out of space if some isn't made available.
- It could be that the user is being flooded with mail, and blocking or deferring incoming mail for them will fix the problem.
- It could be that a remote replica is short on space, which isn't an immediate problem at all.
All of these cases are important, but not all demand immediate attention, and all have different methods to resolve the problem. But we've just woken someone and told them "the disk is nearly full", and it's up to them to figure out what's going on.
As one of the lucky people that would get woken up by this, it wasn't long before I started to get a sense of what information we didn't have that could really help here.
- How fast is the disk usage growing? How much time do I have before it reaches a critical level?
- Which user has a lot of activity on their mailbox (new messages appearing)? Are they doing an import right now? Are we delivering a lot of mail to them right now?
These are hard questions to answer with a simple per-host "is anything wrong right now?" check. Importers and deliver happen on different hosts to where the mail is stored. Tracking rates of change means you need to know previous values, which means some storage for the check scripts. To improve things, we needed something new.
A central metrics store
During 2015 we started looking seriously at a product or service that could do the job for us. There were lots of possibilities, each with pros and cons. Things we particularly wanted:
- A single place to store information about everything, from low-level host data (hardware health, disk usage) to system management and health info, to high-level app data (number and type of users)
- A store of historical data, so we can see how things have changed over time
- A way to ask questions including all types of data, so we can correlate events from different services (the "disk usage grows when the importer runs" scenario)
- Alert based on the answers to these questions
- Pretty and useful graphs of things that we can easily share with the team
- Minimal maintenance required
We did a few experiments, and eventually settled on Prometheus, which ticks pretty much all of these boxes.
Prometheus is what's known as a time-series database, which basically means that it stores the value of different things at a given point in time. Each thing it stores is called a "metric", and has a numeric value.
Using our disk usage example again, we might store two values: the total size of a disk, and the amount of free space. So our metrics (in the Prometheus data format) might look like:
disk_size 1968480002048 disk_free 1236837498880
Now we have these raw values stored, we can use Prometheus' query language to understand more about these values. For example, we could use a simple query to get a "percent full" value:
100 - 100 * disk_free / disk_size 37.16789108382110
Because Prometheus is constantly checking and storing these values as they change, we can also do more advanced things based on the history. For example, we can use the
deriv() function to find out how much the space on this disk changes each second (based on the last 15 minutes worth of values):
We can also use a separate product, Grafana, to graph these values. This is a very boring one of this disk value:
There's loads more things we can ask Prometheus about our metrics, and Grafana can graph pretty much everything we can query. Prometheus also comes with an alerting component, which can send emails, SMS messages and other stuff based on the results of queries. It's a fantastic piece of software and we're very happy with it.
It's all very well having a place to store all these metrics, but we still have to get them out of our systems. Prometheus' model is very simple: every few seconds, it connects to very simple web servers all over your network and requests a list of all the metrics they have to give.
The idea is that all of your internal services will have a HTTP port and be able to provide metrics about what they're doing, but that's not always possible. Some software is too simple to carry its own metrics, or does do its own metrics but presents them in some different way (often logfiles).
So, the Prometheus project and wider community have produced a huge number of "exporters", which interrogate applications or the OS to collect information about them, and then present those to Prometheus. We use a number of off-the-shelf exporters for off-the-shelf software we use (eg
node_exporter for the OS,
mysql_exporter for MySQL, etc), and we've written our own for a few things where off-the-shelf exporter didn't exist or didn't do what we wanted (PowerDNS, Postfix, tinydns, etc).
The most common style of exporter we have at the moment is one that monitors an application logfile, extracts information about events that occurred and presents them to Prometheus as counts. Many of our existing monitoring scripts already did that, and most of our own internal applications log lots about what they're doing, so it's been a very good transition step.
One of the most interesting applications for us, of course, is the Cyrus mail server, the place where your mail is stored, searched and accessed. When we started with Prometheus there was no good way to see inside and find out what it was doing. It does log lots about what it's doing, and we do watch those logs for problems, but there's so many things it does internally that we don't have a lot of visibility on. One of the things about gathering metrics is that you want information available when you have new questions to ask, new things you want to know. So we wanted to bring out as much information from Cyrus as we could, much more than what was currently available in the logs.
Adding Prometheus metrics to Cyrus
So I sat down with Ellie, my local Cyrus developer, and had a chat about it.
It turns out that Cyrus once had SNMP support. SNMP is an older protocol for remotely monitoring and managing things. It's still in wide use with network equipment. Cyrus still had support for it, but large parts of it didn't appear to work. Ellie had been keen to understand it for a while and either fix it or remove it, and this seemed like a great opportunity.
We spent an afternoon pooling our knowledge, me with my basic understanding of how Prometheus worked and what Cyrus is like to operate, her with her knowledge of Cyrus internals and what kind of monitoring other users were asking for, and worked out plan.
Cyrus already has a HTTP server (for CalDAV, CardDAV, etc) so getting it to serve metrics was not complicated. The toughest part is actually around Cyrus' architecture. The Cyrus model is a single process that accepts incoming connections, which it then hands off to another process for service. There's no central coordinating component for these processes; they just do their thing and use filesystem locks to make sure they don't get in each others' way when accessing mail data. That's part of how it gets its performance, so that’s great. The downside is that all these worker processes need to record metric events, so something somewhere needs to collect all the stats coming back from each process and combine it.
To do that, Ellie modified each Cyrus process to write its own stats out to a file, and then created a new Cyrus component called
promstatsd which collects all of these, combines them and prepares a metrics data file ready for Prometheus to come and fetch.
As for the metrics themselves, we chose to start with a few that already had internal counters inside Cyrus: active connection counts, message delivery counts and IMAP command counts. These aren't things we currently pay a lot of attention to but they might be useful in the future (as noted above, we want to collect everything we can, just in case) but Cyrus already knows them so it makes it easier to focus on adding the metrics support. They're also easy to test, since they are incremented when simple actions are taken: login, send a message, etc.
Cyrus is written in C, which is not widely used for server development anymore, so there were no good Prometheus metrics libraries available to use. Ellie wrote a very nice little metrics implementation for Cyrus that is designed to perform well and make it easy to add new metrics without adding overhead.
Put it all together, and we can now use Prometheus and Grafana to see what's happening inside Cyrus. For example, here's a graph of the number of active IMAP connections to a single Cyrus shard:
Now that we have the basic framework in place, we have a lot more metrics we want to add, particularly for the httpd server that will be used for JMAP, and improvements to the way the metrics are expressed to Prometheus to give us more querying options (eg, only count CalDAV requests, excluding CardDAV requests).
Adding support to Cyrus has been pretty exciting, partly because of the general excitement of being able to better understand what our servers are doing, but also because it's really interesting to keep updating this ancient project to support new ways of working.
We won't stop! We still have a long way to go to get all of our monitoring and alerting moved into Prometheus, but every week we move one or two little things over, or add one or two little new things, often finding minor problems or quirks that we didn't know about but can now see, and thus fix. Every time we do this we smile, because we learn a little more about how our systems really work and feel a little more confident that they're doing what they're supposed to. And with everything we know about our systems all available in one central place that everyone can see and work with, we’ll be able to sleep easy at night knowing that everything is running smoothly and any problems will be quickly noticed.