Robustness

This entry is part 2 of 44 in the series Words

Robust. My dictionary defines it as

ro·bust, adj.

1.strong and healthy; hardy; vigorous: a robust young man; a robust faith; a robust mind.

2.strongly or stoutly built: his robust frame.

3.suited to or requiring bodily strength or endurance: robust exercise.

4.rough, rude, or boisterous: robust drinkers and dancers.

5.rich and full-bodied: the robust flavor of freshly brewed coffee.

For ED systems (including computer systems), what does “robust” mean?

Many ED computer systems are “rude” (in an interpersonal relations sense, that is; see the works of Alan Cooper such as The Inmates are Running the Asylum). But that’s not the point of this essay. Instead, let’s concentrate on the term “hardy.” If I look this up in my dictionary, I find a definition that reads

capable of enduring fatigue, hardship, exposure, etc.; sturdy; strong: hardy explorers of northern Canada.

The idea of graceful degradation under pressure was the topic of another essay on brittleness. Robustness is in some sense the opposite of brittleness. But, robustness is a bit different. A robust system can take multiple hits but keep on going. Think John Wayne, Rambo.

ED systems can get hit from multiple directions. Let’s consider some of them.

Power outage

Most hospitals have provisions for power outages, usually a generator system that kicks in quickly, with a backup generator to the backup generators. But we saw during Katrina that generators in the basement don’t work during floods, and that generators and fuel supplies designed to power the hospital for a day don’t work on the second or third day. (An interesting disaster-preparation sidebar: a short train with a couple of Diesel freight locomotives and a few tank cars of Diesel fuel and some power cables makes a dandy, mobile power plant, capable of powering a large town.)

And, at my own Mercy Hospital in Pittsburgh, we once found that, once in a while, both the backup generator and backup generator for the backup generator sometimes fail. Though the ED nurses make jokes about all the search and rescue stuff I keep in my truck, that day they were happy that I could run out and grab several headlights. See also my article on emergency lighting for hospital disasters.

Even if the generators take over, there is a momentary loss of power that can crash computers. Most servers have a UPS or standby power supply to cover this brief outage, but do all of the computers in the ED have a UPS? What happens when the ED is busy and every PC in the ED reboots itself? Several times in a row? Just as there is an influx of patients?

Server crash

A recurrent thread in discussions of ED computer systems is “downtime.” It seems that some hospital-wide computer system vendors think that the hospital is basically shut down late at night and nobody minds if they take the system down for an hour or so. Those in the ED beg to differ, often in vehement terms.

An essential selling point for any ED computer system is “no planned downtime.” Still, unplanned downtime (seems oh so civilized compared to “crash”) does happen. If the server is down, can the individual PCs continue to operate in some local mode, caching some information until the server is up again?

Why don’t our systems allow people to complete a patient chart on the isolated PC, and then upload it once the server is up again|? Yes, our operating systems don’t support this yet, but if we’re serious about IT disaster preparation, this should be a goal.

Patient tracking usually relies on communications between individual PCs and the server. But – again, thinking in disaster mode – why not have a server-failure mode where the PCs in the ED can work as a mesh and still communicate tracking information with each other to keep some basic tracking going? Again, not supported by current operating systems and applications, but a good goal for the future.

This also brings up the question of diversity as a means to robustness, which I will discuss in an upcoming essay.

Single-point failure and redundancy

If we think of systems as being robust, in terms of general systems theory as defined in the works of Gerald Weinberg and others, it means that you can break any single part and the system continues to function. This is a characteristic of many biological and ecological systems, driven by Darwinian evolution, and a particular interest of many generalists, see for example an overview of the research goals of the Santa Fe Institute.

Redundancy is one method to robustness. For example, many EDs use printers to print discharge instructions – and, in the 24/7/365 ED environment, these printers often get used much more than their design specifications. They sometimes start jamming, and this jams up the ED as patients are standing around clamoring for their discharge instructions, and ED doctors and nurses are playing printer service tech instead of doing patient care.

The most elegant solution to such problems is to have a single large hardware button in the ED that says “switch to backup printer.” One presses this button, and a backup printer, one that is already plugged in, full of paper, and tested on a regular basis, starts printing discharge instructions. Reproducing such a function in software would be better than nothing, but as we learn from the Pen-Ivory experiments, positional memory is everything, and positional memory for “the big green button above the printer” would be outstanding.

Mean time Between Failures

A standard measure of hardware reliability is MTBF (mean time between failures). And, in a 24/7/365 ED context, even “planned” downtime is a failure of the system to meet user’s needs. Or, if the system depends on a printer to print discharge instructions, and the printer jams, that’s a failure too. I think that when we rate medical systems, especially ED systems, we need to actually measure and compare MTBF.

If we really want to be persnickety, we look at what Jakob Nielsen has said about response times. And any time the system takes more than a second (being generous here) the system has failed. I know some ED systems (Cerner Firstnet at one particular hospital, Pyramis ECG Management Software at another hospital) that have a MTBF measured in seconds (I exaggerate only slightly). Yes, this may be due to the implementation at that particular hospital – but it’s a failure nonetheless.

Next time: diversity.

Series NavigationBrittlenessDiversity
Share

Tags: , , , , , , , , , , , , , , , , , , , ,

This entry was posted by kconover on Tuesday, April 1st, 2008 at 4:46 pm and is filed under Disaster, Tutorials . You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.