Showing posts with label reliability. Show all posts
Showing posts with label reliability. Show all posts

Wednesday, 29 October 2014

On taking the time to see what happened

I kind of want to watch the video of the failed Antares launch from last night, for the sake of context for the photo of the launch site that NASA recently published on Google+… but I saw the photos of the explosion on Twitter shortly after it happened, and just kind of shivered. Knowing perfectly well that no one was killed in the accident didn’t help—that was clearly a huge blow for the company. As someone on Twitter said, spaceflight is hard. NASA and Roscosmos have it pretty well sorted at this point, having been in the game longer than anyone else (with full credit due, of course, to CNSA and ISRO), so it really does bear pointing out that both Orbital Sciences and SpaceX have only been putting rockets into orbit since about 1990 and 2008 respectively. It’s not that SpaceX is necessarily doing launches and rockets better than Orbital. Orbital has way more experience. SpaceX just hasn’t had a rocket blow up on the pad yet. Hopefully, that doesn’t happen, but the reality is that there’s fundamentally little difference between a rocket and a bomb.

I’ve been trying to stay on top of the commercial American launches, almost just because the advent of commercial spaceflight is really exciting to me. I’ve seen two Falcon 9 launches so far—the most recent one, and (if I recall correctly) its first mission to the ISS—but I haven’t had the opportunity to see anything from Orbital Sciences and Wallops. Maybe it’s a stronger cultural affinity for Cape Canaveral; as far as I knew until the last year or so, the only launch facility NASA had was in Florida. But every time a Virginia launch is mentioned, I secretly hope that this time it’ll be visible from Toronto. I look at the maps of where the launch will be visible from, at what angle, and I’m always a little disappointed that Toronto is well outside the arc. I took the opportunity to see the launch of STS-135 when my family travelled to Orlando for a Disney World/Universal Studio holiday. Our timing coincided with the launch date, and having never seen a Space Shuttle launch before, my wife, my sister-in-law, and I took my then-six-month-old son from Greater Orlando to Titusville to visit Kennedy Space Center. We didn’t make it Titusville for the launch, but we saw it, pretty clearly, from the roughly twenty miles away that we were when the countdown hit the last few minutes. That was an incredible experience, even from that distance (because, by God, you can hear it), and I’d love to have that experience again.

I’m also interested in the Antares launch, and specifically its failure, from a process engineering perspective. A few people on Twitter and Google+ noted that, as soon as the rocket exploded, Orbital Sciences’ Twitter feed went silent. Reports came in from NASA about the same time that the Orbital mission controllers were giving witness statements and storing the telemetry they’d had from the rocket up until that point. In their business, this is absolutely critical for figuring out what caused the incident, so that it can be avoided in the future. Rockets are expensive, so having all that cash go up in flames is a disaster.

But in technology, we can certainly learn from this. So often, when something goes wrong on a server, particularly a production server, our first response is simply to fix it, and get the website running again. Don’t get me wrong; this is important, too—in an industry where companies can live or die on uptime, getting the broken services fixed as soon as possible is important. But preventing the problem from happening again is equally important, because if you’re constantly fighting fires, you can’t improve your offering. When something goes wrong, and you have the option, your first response should be to remove the broken machine from the load balance. Disconnect it from any message queues that it might be listening to, but otherwise keep the environment untouched, so that you can perform some forensic analysis and discover what went wrong. In addition to redundancy, you also need logging. Oh my good good God, you need logging. Yes, logs take up disk space. That’s what services like logrotate are for—logs take up space, sure, but gzipped logs take up a tenth of that space. And if you haven’t looked at those logs for, let’s say, six months…you probably have a solid enough service that you don’t need them any more. And if, for business reasons, you think you might… you can afford to buy more disks and archive your logs to tape. In the grand scheme of things, disks are cheap, but tape is cheaper.

So, ultimately, what’s the takeaway for the software industry? Log everything you can. Track everything you can. And when the shit hits the fan, stop, and gather information before you do anything. I know cloud computing gives us the option (when we plan it out well) of just dropping a damaged cloud instance on the floor, spinning up a new one, and walking away, but if you do that without even trying to diagnose what went wrong, you’ll never fix it.

Saturday, 4 October 2014

A lesson learned the very hard way

Two nights ago, I took a quick look at a website I run with a few friends. It’s a sort of book recommendation site, where you describe some problem you’re facing in your life, and we recommend a book to help you through it. It’s fun to try to find just the right book for someone else, and it really makes you consider what you keep on your shelves.

But alas, it wasn’t responding well—the images were all fouled up, and when I tried to open up a particular article, the content was replaced by the text "GEISHA format" over and over again. So now I’m worried. Back to the homepage, and the entire thing—markup and everything—has been replaced by this text.

First things first: has anyone else ever heard of this attack? I can’t find a thing about it on Google, other than five or six other sites that were hit by it when Googlebot indexed them, and one of them at least a year ago.

So anyway, I tried to SSH in, with no response. Pop onto my service provider to access the console (much as I wish I had the machine colocated, or even physically present in my home, I just can’t afford the hardware and the bandwidth fees), and that isn’t looking good, either.

All right, restart the server.

Now HTTP has gone completely nonresponsive. And when I access the console, it’s booted into initramfs instead of a normal Linux login. This thing is hosed. So I click the “Rescue Mode” button on my control panel, but it just falls into an error state. I can’t even rescue the thing. At this point, I’m assuming I’ve been shellshocked.

Very well. Open a ticket with support, describing my symptoms, asking if there’s any hope of getting my data back. I’m assuming, at this point, the filesystem’s been shredded. But late the next morning, I hear back. They’re able to access Rescue Mode, but the filesystem can’t fsck properly. Not feeling especially hopeful, I switch on Rescue Mode and log in.

And everything’s there. My Maildirs, my Subversion repositories, and all the sites I was hosting. Holy shit!

I promptly copied all that important stuff down to my personal computer, over the course of a few hours, and allowed Rescue Mode to end, and the machine to restart into its broken state. All right, I think, this is my cosmic punishment for not upgrading the machine from Ubuntu Hardy LTS, and not keeping the security packages up to date. Reinstall a new OS, with the latest version of Ubuntu they offer, and keep the bastard thing up to date.

Except that it doesn’t quite work that well. On trying to rebuild the new OS image… it goes into a error state again.

Well and truly hosed.

I spun up a new machine, in a new DC, and I’m in the process of reinstalling all the software packages and restoring the databases. Subversion’s staying out; this is definitely the straw that broke the camel’s back in terms of moving my personal projects to Git. Mail comes last, because setting up mail is such a pain in the ass.

And monitoring this time! And backups! Oh my God.

Let this be a lesson: if you aren’t monitoring it, you aren’t running a service. Keep backups (and, if you have the infrastructure, periodically try to refresh from them). And keep your servers up-to-date. Especially security updates!

And, might I add, many many thanks to the Rackspace customer support team. They really saved my bacon here.

Saturday, 27 October 2012

On monitors and error detection

Earlier today, a colleague and I were discussing monitoring tools for Web services. He recently joined our team as a systems administrator, and I was filling him in on a homebrew monitoring service I put together a couple of years ago, to cover a gap in our existing monitor’s configuration, done in the spirit of Big Brother. He had praise for its elegance, and we joked a bit about reusing it outside the company, the fact that it would need to be completely rebuilt in that case (since, though it wasn’t composed of original ideas, just a merger of Big Brother and Cacti, it remains the intellectual property of $EMPLOYER$), and whether or not I would even need such a service for Prophecy.

After thinking about it briefly, I realized that not only will Project Seshat deserve some kind of monitoring once I install it on my server—I guess I’ll just add that to the pile of TODOs—but I remembered that I have a WordPress instance running for the Cu Nim Gliding Club, in Okotoks, Alberta. Surely a production install of WordPress deserves monitoring, in order to make sure that Cu Nim's visitors can access the site.

So, while waiting at a restaurant for my wife and our dinner guests to arrive, I took to The Internet to look for any existing solutions for monitoring WordPress with, say, Nagios. I may not be familiar with many monitors, but I know enough about Nagios to know that it works well with heartbeats—URIs that indicate the health of a particular aspect of a service.

The first hit I found that wasn’t a plugin for one of the two was a blog entry describing how to manually set up a few monitors for a local WordPress instance. It explained how to configu Nagios to run a few basic service checks: that the host in question can serve HTTP, that it can access the MySQL server, and that WordPress is configured, a single check on the homepage.

To me, this seems woefully incomplete. A single check to see that anything is returned by WordPress, even if you are separately checking on Apache and MySQL, strikes me as being little more than an “allswell” test. Certainly, success of this test can be reasonably inferred to indicate good health of the system, but failure of this test could mean any number of things, which would need to be investigated to determine what has gone wrong, and the priority of the fix.

When I use a monitoring system, I want it to be able to tell me exactly what went wrong, to the best of its ability. I want it to be able to tell me when things are behaving out of the ordinary. I want it to tell me that, even though the page loaded, it took longer than some threshold that I've set (which would probably warrant a different level of concern and urgency than the page not loading at all, which would be the case with a single request having a short timeout). In short, I want more than just the night watch to call out, “twelve o’clock and all’s well!”.

The options that I could take to accomplish this goal are myriad. First of all, yes, I want something in place to monitor the WordPress instance. But for original products, like Project Seshat, I would definitely like something not just more robust, but also more automatic. Project Alchemy is intended to create an audit trail for all edits without having to specifically issue calls to auditing methods in the controllers. I’d love to take a page from JavaMelody and create an aspect-oriented monitoring solution that can report request timing, method timing, errors per request, and perhaps even send out notifications the first time an error of a particular severity occurs, instead of the way Big Brother does it, where it polls regularly to gather data.

Don’t get me wrong, it’s probably a huge undertaking. I don’t expect to launch Project Seshat with such a system in place (as much as I’d love to). But it’s certainly food for thought for what to work on next. And when Seshat does launch, I will want to have a few basic checks to make sure that it hasn’t completely fallen over. After all, so far, I’ve been adhering to the principle of “make it work, then make it pretty.” May as well keep it up.