Monitoring and reporting are a critical part of any business that relies on technology. At any point in time you should know exactly what state your system is in. This window into the health of your system helps you diagnose and troubleshoot problems, plan for the future, and most importantly, prevent issues from happening in the first place.
We use several different monitoring and reporting tools here at Signal, all serving their distinct purpose.
Scout is a hosted server monitoring service that allows you to collect stats on just about anything. Scout uses a plugin-based approach to determine what stats you would like to collect for each of your servers. After setting up your sever profile in Scout, you can configure that server profile with any number of plugins. An agent application runs on your sever, determines what stats it needs to collect based on your server’s profile, and sends the data to Scout. The data can then be viewed in multiple ways, or can trigger alarms based on thresholds that you set.
At the time of this writing, Scout has 57 plugins! Plugins cover the absolute necessities (CPU usage, memory usage, disk utilization, disk usage, server load), popular databases (MySQL, MongoDB, CouchDB, Redis), web servers (Apache, Nginx, Mongrel), full text search engines (Sphinx, Elasticsearch), and more. And, if they don’t have the plugin you need, it is incredibly easy to create one using their API. We’ve created several, including a few now included in the official Scout plugin directory (RabbitMQ, CouchDB, and Elasticsearch).
We are currently using Scout to monitor all of our production servers, which provides alerts for things like:
- high disk I/O utilization
- high server load
- excessively high numbers of messages in our various queues
- status of our databases and full text search cluster
- high memory usage for any of our processes
Pingdom is a service that provides a very simple yet critical function. It ensures that your site is up. Pingdom pings your site from several locations around the world to make sure it is reachable and returning a successful response. Whenever our site has gone down (which isn’t often!), Pingdom has always been the first to notify us, allowing us to take action immediately.
Pingdom also lets you define what should be considered a successful response. By default, it is a HTTP response of 200 (Success). However, you can configure it to post data to a given URL, and check the response body for some value. Utilizing this functionality, we have Pingdom hit a few of our critical APIs to make sure that they are not only reachable, but also returning the expected content.
Proby is a service we built here at Signal that monitors cron jobs and other scheduled tasks. In our efforts to find the right monitoring tool for us, we determined we’d be better off building it ourselves. We have several jobs, executed by cron, that run at various intervals (minutely, hourly, daily, etc). Proby makes sure that these jobs start and finish when expected. If a job fails to start when expected, runs longer than expected, or fails, we are notified immediately via email. Proby also supports SMS notifications, for more critical tasks.
Task monitoring is just one of Proby’s features. Proby also keeps an execution history for each of your tasks, allowing you to see trends in runtimes, when an issue started, and when it was resolved. It can also store error messages for failed task executions. All of this information can be very valuable when diagnosing issues.
We currently have 39 scheduled tasks being monitored by Proby and sleep sound at night knowing that we’ll be notified immediately if any of them start to misbehave.
Monit and God
Monit and God are process monitoring frameworks that help us keep an eye on the processes that make up our system. Both allow you to define what to check for your process (memory usage, CPU utilization, HTTP responses, etc) and what to do when those checks fail (send an email, restart the process, etc). They also allow you to easily manage groups of similar processes. For example, we can easily restart a cluster of related processes with a single command.
While we started out using Monit, the engineering team at Signal has slowly been migrating all process monitoring to God. There are several reasons for this:
- God config is ruby code, so it’s easy to apply the same set of config rules to multiple processes. With Monit, we need to generate an almost identical config section for each process.
- God provides better control over the environment the process is executed in.
- God provides better control for letting it know when a process is dead.
Regardless of what you use, process monitoring is essential. A rogue process can easily take down an entire machine if there is nothing to stop it.
We love Scout, but sometimes we need more granular data. If you’re looking at data within the past 5 hours, Scout will show you that data in 5 minute intervals. However, as the amount of time you’re evaluating increases, so does the interval at which Scout displays data. Scanning the data for the past day will show you data at 1 hour intervals. The large interval can make it difficult to investigate issues after they have occurred.
Graphite is an open source graphing application that was originally created at Orbitz (where most of our engineering team spent time). We use Graphite to monitor the most important parts of our system. Graphite lets you build graphs that display any of the data it has collected. You can easily add and remove metrics from the graph, letting you compare and contrast any of your data. The graphs also auto-update, allowing us to watch our system closely while it’s processing a large volume of work.
We have it configured to aggregate data at 1 minute intervals. And since we hold on to the data stored at that interval, we can easily see how the different parts of our system were behaving at any point in time. This has proven invaluable in identifying and eliminating bottlenecks in our system.
While not technically a monitoring tool, PagerDuty plays a very important role in our monitoring solution. It makes sure the engineer on support knows when the shit is hitting the fan. PagerDuty works with several monitoring systems, including Scout and Pingdom. When one of the configured monitoring systems notifies PagerDuty of an alert, PagerDuty will contact the team according to the escalation policy you define.
Since PagerDuty supports rotating on-call schedules, we’ve uploaded our support schedule so PagerDuty always knows who is on support and who is on backup. When an alert comes in, PagerDuty immediately sends an email to the engineer on support. If that email remains unacknowledged after 10 minutes, PagerDuty will send a SMS message to the mobile phone of the engineer on support. If the alert still remains unacknowledged, PagerDuty repeats this process for the engineer on backup support. This entire flow is configurable to meet the needs of the particular team.
Airbrake is a service that collects errors generated by other applications, and aggregates them for review. All of of our applications use Airbrake to report errors. Before Airbrake, an application error simply resulted in a email to the tech team. This got the job done (alerted us there was a problem), but left much to be desired.
If we had an error in the part of our system that handles incoming or outgoing messages, our inboxes would be flooded with error emails. Airbrake’s duplicate error detection helps us avoid this problem, while still being able to clearly see the number of times a specific error has occurred. Airbrake also supports multiple projects, so all of our error reports show up in project specific buckets instead of a single one (the developer’s inbox). And, since Airbrake auto-resolves a project’s errors when the project is deployed, we always know which errors are taking place in the current version of our applications, saving us the time of looking into errors that have already been resolved.
Use something different? Let us know!
Do you use any other tools to help monitor your system? If so, we’d love to hear about it! Please leave a comment stating what you use, and why you use it.