TCS Daily

'Normalization of Deviance'

By Jeffrey Goldader - February 10, 2003 12:00 AM

"An Accident Rooted in History," was the title of Chapter VI of the Rogers Commission report on the Space Shuttle Challenger disaster. Originally meant to refer to the clues in past incidents involving the fragile O-ring seals in space shuttle booster rockets, the title was truer in a broader sense than the Commission originally recognized. It may haunt us still in our latest loss.

Following the Challenger investigation, Diane Vaughan, a sociologist at Boston College, was fascinated by the interaction between engineers and manager-engineers at NASA and its contractors. She wished to try and understand the patterns of those interactions, expecting to find that managers had knowingly broken safety rules in order to foster immediate program success. Instead, Vaughan found something far more troubling: she discovered that the decision to launch in very cold weather, which doomed the Challenger crew, was made in accord with engineering practices. She published her work in 1996 as the book, "The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA" (Cambridge University Press). Her findings are relevant not only to NASA and the shuttle, but to all high-technology programs.

Fifty years ago, determining risks and performance limits of an aircraft was quite difficult. Given the available technology, engineers had only a limited ability to determine (for example) the flying performance of a new aircraft before the prototypes actually flew. Extensive flight-testing was needed to determine actual performance and uncover design flaws. This often resulted in a damaged or destroyed test aircraft and dead pilots, particularly in the case of high-performance military aircraft.

Changes in Design, Testing, Risk Assessment

Beginning in the 1960's and continuing to the present day, a number of changes have altered the way increasingly complex machines (aircraft in our example) are designed, tested, and assessed for risk.

First, advances in computational techniques are allowing engineers to perform ever more high-fidelity simulations, uncovering flaws during the design stage.

Second, the high costs of complex systems mean that the old ways of testing until the machine fails can no longer be allowed. Instead, engineers compute approximate limits of the performance envelope, and then validate those computations via comparison with data from real situations.

Finally, interactions between components in many systems are simply so complex that they cannot possibly be fully tested on paper, in computers, or even during developmental testing. A significant component of knowledge about complex systems must be found during normal operation. The combination of performance data from calculations, controlled testing, and real-world situations makes up the "engineering database," from which the performance of the system under a wide range of conditions can be directly determined, or at least extrapolated. As Vaughan discovered, this is how the space shuttle is operated.

The genesis of the Challenger disaster was the failure of O-ring pressure seals in the solid rocket motors that help power the shuttle during the first two minutes of flight. Two phenomena called blow-by, which is the escape of hot gases past the O-rings, and erosion, or actual burning of the O-rings by hot gases, should never have occurred in the joints. However, engineers discovered that these were in fact occurring, as early as the second flight of the shuttle program.

'Normalization of Deviance'

This is where Vaughan's study exposes the real failure behind the Challenger disaster, and sounds a warning to all of us involved in today's complex technologies: the "normalization of deviance." Although the joint was not behaving as expected, the results were not catastrophic. The observed behavior modified the engineers' understanding of the "expected" behavior of the joint: a significant part of the database consisted of operational experience that blow-by and erosion did occur, but the system was sufficiently robust that it withstood the behavior. Over time, the deviant behavior of the joint became expected, normal (if not desired) behavior.

As history tells, Challenger was launched at the coldest temperatures the program had ever experienced. The O-rings were so cold that they lost their elasticity, and failed to seat properly; catastrophic erosion and blow-by occurred; and the O-rings failed to seal the joint. The previous instances of erosion and blow-by had not led the engineers to conclude that their design was faulty because it failed, in dangerous ways, to behave as expected. Instead, they had come to accept the unanticipated behavior as both normal and not imminently catastrophic, and were unable to prevent what seems in retrospect an inevitable tragedy.

Now we have lost another shuttle and seven more precious lives. The cause is not yet known, but much speculation centers on the possibility that insulating foam broke off the external fuel tank during launch, impacting the shuttle's wing and damaging some of the heat resistant tiles critical for atmospheric re-entry. During press briefings in the days immediately following the loss of Columbia, Space Shuttle Program Manager Ron Dittemore (who should be lauded as an example of courage and accountability) acknowledged the history of damaged tiles due to foam impacts. However, he insisted that the "database" included many impacts, but that the shuttles had all survived the incidents even when tile damage had been fairly widespread. An analysis of the debris hit on Columbia, conducted by engineers during the flight using what was known from previous impacts, concluded the damage should have been survivable.

Again, Vaughan's thesis is there: the unexpected became the expected-became the accepted. Foam should not fall off the tank, as it could be dangerous to the shuttle. But foam was indeed falling off - sometimes big pieces, sometimes small, from various places. Though it hit the shuttle, the damage to the tiles was not enough to cause a catastrophe. The head of the NASA Marshall Space Flight Center was quoted by the Associated Press as saying engineers were "comfortable" with the amount of foam coming off the tanks, that they did not consider it a fundamental design flaw. Seen through that lens, the loss of Columbia seems eerily similar to the loss of her sister ship Challenger.

This second shuttle loss, though, and the potential connection with the systemic failures that enabled the first loss, should be a strong caution for all of us working with complex technologies. We cannot possibly completely test and debug aircraft, buildings, factories, refineries, trains, even rockets, before they are in actual use. But we must make sure that when the systems behave unexpectedly, in ways that could potentially lead to disasters, we pay attention and correct the underlying defects. For high performance systems in particular, where even normal operation stresses the systems very near their tolerance limits, we cannot allow ourselves to accept dangerously deviant behavior as normal. It is too costly a lesson to relearn.

Jeffrey Goldader received his PhD in Astronomy in 1995 from the University of Hawaii. Since 1998, he has been a Lecturer in Astronomy at the University of Pennsylvania.

TCS Daily Archives