Continuing our anniversary retrospective of content we've created during the past 15 years, this time we offer up, as it originally appeared in May 2004, an article by a renowned industrial safety expert that cautioned about the steadily growing dependency that control systems had on software and, as a result, why it was imperative that we recognize the need to pay a lot more attention to software reliability. Eight years later, the incentive to do so remains vitally important.
Considering all the components used in today's generation of control systems, it's the root cause of hardware failure that gets studied most often. The root cause of software failure, on the other hand, is rarely studied or well understood.
In the field studies that have been conducted, some theories on what causes software failure have emerged, but even those are not widely known or followed by software engineers. Similarly, few practitioners know the rules of software reliability or take the time to understand how to create reliable software. Why? In part because software development tool producers work hard to make control software developers think it's easy to produce reliable software.
No one, however, can ignore the importance of software reliability, and as control systems grow in functionality and complexity, machine and production equipment builders must increasingly depend on software to carry the load.
We'll address these issues here and include examples of software failures, the root causes of those failures, some rules for avoiding those causes and some guidance in evaluating software reliability in control system products.
More Complex Control
Powerful new tools enable us to develop software-dependent control systems that are increasingly more complex. Software reliability, the ability of this software to perform the expected function when needed, is essential. Yet, how often do we hear, "The network is down," or "My computer froze up--again," or "How long has this operator station been frozen?" Our experience with software is far from perfect.
As industry's dependency on software increases, so does the incentive to develop higher levels of software reliability.
Software Failure Happens
Consider why software fails the next few examples offer some insight. The console of an industrial machine operator had functioned normally for two years. On one of a newly hired operator's first shifts his console stopped updating the CRT screen and would not respond to commands shortly after an alarm acknowledgment. The unit was powered down and successfully restarted, finding no hardware failures.
With more than 400 units in the field and 8 million operating hours, the manufacturer found it difficult to believe that a significant software fault existed in such a mature product. An extensive testing procedure produced no further failures. A test engineer visited the site and interviewed the new operator. At this interview, the engineer noted, "this guy is very fast on the keyboard." That small observation allowed the problem to be traced and further testing revealed that if an alarm acknowledgment key was struck within 32 msec of the alarm silence key, a software routine would overwrite a critical area of memory and the computer would fail.
At another plant an operator requested that a data file be displayed on the terminal and the computer failed. This was not a new request--the same data file had been displayed successfully on the system numerous times before.
The problem was traced to a software module that did not always append a terminating "null zero" to the end of the file character string. On most occasions the file name was stored in memory that had been cleared by zeros written into all locations. Because of this, the operation was always successful and the software fault remained hidden. On the occasion that the dynamic memory allocation algorithm chose memory that had not been cleared, the system failed. This failure occurred only when the software module did not append the zero in combination with a memory allocation in an uncleared area of memory.
Consider a third example in which a computer stopped working after it received a message on its communication network. The message came from an incompatible operating system and, while it used the correct "frame" format, the operating system contained different data formats. Because the computer did not check for a compatible data format, the data bits within the frame were incorrectly interpreted. The events caused the computer to fail in a few seconds.