Sunday, 24 January 2010

In which software that wasn't properly engineered is shown to have killed people... again.

My wife showed me an article in today’s online New York Times (yesterday’s print edition) that got my blood boiling—a radiation therapy machine manufactured and programmed by Varian Medical Systems wound up massively overdosing and killing two patients in 2005, just like Therac-25 did back in the eighties. The bad part is that Therac-25 is a standard case study for CS students throughout the continent (and probably worldwide), so this shouldn’t have happened in the first place. The worst is that Varian Medical Systems said it was pilot error. Now, as a software developer and a computer scientist, incidents like these are particularly poignant, as they demonstrate the urgent need for a cultural sea change within the industry.

A quick backgrounder: Therac-25 was involved in six known massive overdoses of radiation, at various sites, that killed two patients. A careful study of the machines and their operating circumstances was conducted and it was ultimately revealed that the control software was designed so poorly that it allowed almost precisely the same problem outlined in the Times’ article. In Therac-25’s case, a higher-powered beam was able to be activated without a beam spreader in place to disperse the radiation, much like how the Varian Medical Systems machine allowed the collimator to be left wide open during therapy. Furthermore, the Therac-25 user interface, like the Varian interface, was found to occasionally show a different configuration to the operator than was active in the machine, as the article mentions: “when the computer kept crashing, …the medical physicist did not realize that her instructions for the collimator had not been saved.” The same issue occurred with Therac-25, where operator instructions never arrived at the therapy machine. Finally, both interfaces failed to prevent potentially lethal configurations without some kind of unmistakable warning of the danger to the operator.

The Therac-25 incident quickly became a standard case study in computer science and software engineering programs throughout Canadian and American universities, so the fact that this problem was able to happen again is shocking, as is the fact that Varian Medical Systems and the hospitals in question deflected the blame onto the operators. The fact of the matter is that Varian’s machine is one that, when it fails to operate correctly, can maim or even kill people. In such a system, there is simply no room for operator error and it must be safeguarded against. Unfortunately, in shifting the blame, Varian Medical Systems has denied their responsibility in untold more tragedies.

Had a professional engineer been overseeing the machine’s design, these deaths could have been prevented. Unfortunately, most professional engineering societies do not yet recognise the discipline of software engineering, primarily because it is exceedingly difficult to define exactly what software engineering entails. For decades, software systems have been in positions where life and death are held in the balance of their proper operation, and it is critical, in these cases, that professional engineers be involved in their design. These tragedies underscore the need for engineering societies to begin licensing and regulating the proper engineering of such software. By comparison, an aerospace engineer certifies that an airplane’s design is sound, and when an airframe fails, those certified plans typically reveal that the construction of (or more often, later repairs to) the plane did not adhere to the design. Correspondingly, in computer-controlled systems that can fail catastrophically, as Varian’s has, it is imperative that a professional engineer certify that the design—and in a computer program’s case, the implementation—is sound.

Varian Medical Systems’ response—to merely send warning letters reminding “users to be especially careful when using their equipment”—is appallingly insufficient. Varian Medical Systems is responsible for these injuries and deaths, due to their software’s faulty design and implementation, and I urge them to admit their fault. I recognise that it would be bad for their business, but it is their business practices that have cost lives and livelihoods. I think the least they could do is offer a mea culpa with a clear plan for how they will redesign their systems to prevent these incidents in the future.

The IEEE, the ACM, and professional engineering societies need to sit up and take notice of incidents such as this. That they are still happening, even with the careful studies that have been performed of similar tragedies, is undeniable proof that software engineers are necessary in our ever more technologically dependent society, and that software companies must, without exception, be willing to accept the blame when their poor designs cause such incidents. Medical therapy technology must be properly engineered, or we will certainly see history continue to repeat itself.

If this reads like a submission to the Op-Ed page, there’s a very good reason for that! It was, but they decided not to publish it. Oh well.

No comments:

Post a Comment