Quoth the runtime, "Segmentation Fault": On taking the time to see what happened

I kind of want to watch the video of the failed Antares launch from last night, for the sake of context for the photo of the launch site that NASA recently published on Google+… but I saw the photos of the explosion on Twitter shortly after it happened, and just kind of shivered. Knowing perfectly well that no one was killed in the accident didn’t help—that was clearly a huge blow for the company. As someone on Twitter said, spaceflight is hard. NASA and Roscosmos have it pretty well sorted at this point, having been in the game longer than anyone else (with full credit due, of course, to CNSA and ISRO), so it really does bear pointing out that both Orbital Sciences and SpaceX have only been putting rockets into orbit since about 1990 and 2008 respectively. It’s not that SpaceX is necessarily doing launches and rockets better than Orbital. Orbital has way more experience. SpaceX just hasn’t had a rocket blow up on the pad yet. Hopefully, that doesn’t happen, but the reality is that there’s fundamentally little difference between a rocket and a bomb.

I’ve been trying to stay on top of the commercial American launches, almost just because the advent of commercial spaceflight is really exciting to me. I’ve seen two Falcon 9 launches so far—the most recent one, and (if I recall correctly) its first mission to the ISS—but I haven’t had the opportunity to see anything from Orbital Sciences and Wallops. Maybe it’s a stronger cultural affinity for Cape Canaveral; as far as I knew until the last year or so, the only launch facility NASA had was in Florida. But every time a Virginia launch is mentioned, I secretly hope that this time it’ll be visible from Toronto. I look at the maps of where the launch will be visible from, at what angle, and I’m always a little disappointed that Toronto is well outside the arc. I took the opportunity to see the launch of STS-135 when my family travelled to Orlando for a Disney World/Universal Studio holiday. Our timing coincided with the launch date, and having never seen a Space Shuttle launch before, my wife, my sister-in-law, and I took my then-six-month-old son from Greater Orlando to Titusville to visit Kennedy Space Center. We didn’t make it Titusville for the launch, but we saw it, pretty clearly, from the roughly twenty miles away that we were when the countdown hit the last few minutes. That was an incredible experience, even from that distance (because, by God, you can hear it), and I’d love to have that experience again.

I’m also interested in the Antares launch, and specifically its failure, from a process engineering perspective. A few people on Twitter and Google+ noted that, as soon as the rocket exploded, Orbital Sciences’ Twitter feed went silent. Reports came in from NASA about the same time that the Orbital mission controllers were giving witness statements and storing the telemetry they’d had from the rocket up until that point. In their business, this is absolutely critical for figuring out what caused the incident, so that it can be avoided in the future. Rockets are expensive, so having all that cash go up in flames is a disaster.

But in technology, we can certainly learn from this. So often, when something goes wrong on a server, particularly a production server, our first response is simply to fix it, and get the website running again. Don’t get me wrong; this is important, too—in an industry where companies can live or die on uptime, getting the broken services fixed as soon as possible is important. But preventing the problem from happening again is equally important, because if you’re constantly fighting fires, you can’t improve your offering. When something goes wrong, and you have the option, your first response should be to remove the broken machine from the load balance. Disconnect it from any message queues that it might be listening to, but otherwise keep the environment untouched, so that you can perform some forensic analysis and discover what went wrong. In addition to redundancy, you also need logging. Oh my good good God, you need logging. Yes, logs take up disk space. That’s what services like logrotate are for—logs take up space, sure, but gzipped logs take up a tenth of that space. And if you haven’t looked at those logs for, let’s say, six months…you probably have a solid enough service that you don’t need them any more. And if, for business reasons, you think you might… you can afford to buy more disks and archive your logs to tape. In the grand scheme of things, disks are cheap, but tape is cheaper.

So, ultimately, what’s the takeaway for the software industry? Log everything you can. Track everything you can. And when the shit hits the fan, stop, and gather information before you do anything. I know cloud computing gives us the option (when we plan it out well) of just dropping a damaged cloud instance on the floor, spinning up a new one, and walking away, but if you do that without even trying to diagnose what went wrong, you’ll never fix it.

Quoth the runtime, "Segmentation Fault"

Wednesday 29 October 2014

On taking the time to see what happened

1 comment:

Post a Comment

Pages

Labels

Older Posts

About Me