Quoth the runtime, "Segmentation Fault": On monitors and error detection

Earlier today, a colleague and I were discussing monitoring tools for Web services. He recently joined our team as a systems administrator, and I was filling him in on a homebrew monitoring service I put together a couple of years ago, to cover a gap in our existing monitor’s configuration, done in the spirit of Big Brother. He had praise for its elegance, and we joked a bit about reusing it outside the company, the fact that it would need to be completely rebuilt in that case (since, though it wasn’t composed of original ideas, just a merger of Big Brother and Cacti, it remains the intellectual property of $EMPLOYER$), and whether or not I would even need such a service for Prophecy.

After thinking about it briefly, I realized that not only will Project Seshat deserve some kind of monitoring once I install it on my server—I guess I’ll just add that to the pile of TODOs—but I remembered that I have a WordPress instance running for the Cu Nim Gliding Club, in Okotoks, Alberta. Surely a production install of WordPress deserves monitoring, in order to make sure that Cu Nim's visitors can access the site.

So, while waiting at a restaurant for my wife and our dinner guests to arrive, I took to The Internet to look for any existing solutions for monitoring WordPress with, say, Nagios. I may not be familiar with many monitors, but I know enough about Nagios to know that it works well with heartbeats—URIs that indicate the health of a particular aspect of a service.

The first hit I found that wasn’t a plugin for one of the two was a blog entry describing how to manually set up a few monitors for a local WordPress instance. It explained how to configu Nagios to run a few basic service checks: that the host in question can serve HTTP, that it can access the MySQL server, and that WordPress is configured, a single check on the homepage.

To me, this seems woefully incomplete. A single check to see that anything is returned by WordPress, even if you are separately checking on Apache and MySQL, strikes me as being little more than an “allswell” test. Certainly, success of this test can be reasonably inferred to indicate good health of the system, but failure of this test could mean any number of things, which would need to be investigated to determine what has gone wrong, and the priority of the fix.

When I use a monitoring system, I want it to be able to tell me exactly what went wrong, to the best of its ability. I want it to be able to tell me when things are behaving out of the ordinary. I want it to tell me that, even though the page loaded, it took longer than some threshold that I've set (which would probably warrant a different level of concern and urgency than the page not loading at all, which would be the case with a single request having a short timeout). In short, I want more than just the night watch to call out, “twelve o’clock and all’s well!”.

The options that I could take to accomplish this goal are myriad. First of all, yes, I want something in place to monitor the WordPress instance. But for original products, like Project Seshat, I would definitely like something not just more robust, but also more automatic. Project Alchemy is intended to create an audit trail for all edits without having to specifically issue calls to auditing methods in the controllers. I’d love to take a page from JavaMelody and create an aspect-oriented monitoring solution that can report request timing, method timing, errors per request, and perhaps even send out notifications the first time an error of a particular severity occurs, instead of the way Big Brother does it, where it polls regularly to gather data.

Don’t get me wrong, it’s probably a huge undertaking. I don’t expect to launch Project Seshat with such a system in place (as much as I’d love to). But it’s certainly food for thought for what to work on next. And when Seshat does launch, I will want to have a few basic checks to make sure that it hasn’t completely fallen over. After all, so far, I’ve been adhering to the principle of “make it work, then make it pretty.” May as well keep it up.

Quoth the runtime, "Segmentation Fault"

Saturday, 27 October 2012

On monitors and error detection

No comments:

Post a Comment

Pages

Labels

Older Posts

About Me