Login | Register
Print page
Email page

Home » Reliability in control software

Reliability in control software

By Dr. William Goble, Exida

Control Design magazine

The root cause of software failure is rarely studied or well understood. William Goble, writting for CONTROL DESIGN, believes operator safety and product quality depend on it. Can you count on your mission-critical software?


C

onsidering all the components used in today's generation of control systems, it's the root cause of hardware failure that gets studied most often. The root cause of software failure, on the other hand, is rarely studied or well understood.

In the field studies that have been conducted, some theories on what causes software failure have emerged, but even those are not widely known or followed by software engineers. Similarly, few practitioners know the rules of software reliability or take the time to understand how to create reliable software. Why? In part because software development tool producers work hard to make control software developers think it's easy to produce reliable software.

No one, however, can ignore the importance of software reliability, and as control systems grow in functionality and complexity, machine and production equipment builders must increasingly depend on software to carry the load.

We'll address these issues here and include examples of software failures, the root causes of those failures, some rules for avoiding those causes and some guidance in evaluating software reliability in control system products.

ADVERTISEMENT

More Complex Control
Powerful new tools enable us to develop software-dependent control systems that are increasingly more complex. Software reliability, the ability of this software to perform the expected function when needed, is essential. Yet, how often do we hear, "The network is down," or "My computer froze up--again," or "How long has this operator station been frozen?" Our experience with software is far from perfect.

As industry's dependency on software increases, so does the incentive to develop higher levels of software reliability.

Software Failure Happens
Consider why software fails the next few examples offer some insight. The console of an industrial machine operator had functioned normally for two years. On one of a newly hired operator's first shifts his console stopped updating the CRT screen and would not respond to commands shortly after an alarm acknowledgment. The unit was powered down and successfully restarted, finding no hardware failures.

With more than 400 units in the field and 8 million operating hours, the manufacturer found it difficult to believe that a significant software fault existed in such a mature product. An extensive testing procedure produced no further failures. A test engineer visited the site and interviewed the new operator. At this interview, the engineer noted, "this guy is very fast on the keyboard." That small observation allowed the problem to be traced and further testing revealed that if an alarm acknowledgment key was struck within 32 msec of the alarm silence key, a software routine would overwrite a critical area of memory and the computer would fail.

Rock Beats Scissors ... Strength Beats Stress

The strength curve indicates the chances of any particular strength value in a collection of products. The area under both curves represents failure conditions. When the product design produces higher strength levels, failure probability decreases.

At another plant an operator requested that a data file be displayed on the terminal and the computer failed. This was not a new request--the same data file had been displayed successfully on the system numerous times before.

The problem was traced to a software module that did not always append a terminating "null zero" to the end of the file character string. On most occasions the file name was stored in memory that had been cleared by zeros written into all locations. Because of this, the operation was always successful and the software fault remained hidden. On the occasion that the dynamic memory allocation algorithm chose memory that had not been cleared, the system failed. This failure occurred only when the software module did not append the zero in combination with a memory allocation in an uncleared area of memory.

Consider a third example in which a computer stopped working after it received a message on its communication network. The message came from an incompatible operating system and, while it used the correct "frame" format, the operating system contained different data formats. Because the computer did not check for a compatible data format, the data bits within the frame were incorrectly interpreted. The events caused the computer to fail in a few seconds.

Many examples of software failure are documented and most of them seem to contain some combination of events considered unlikely, rare or even impossible.

Stress vs. Strength

Reliability engineering provides the stress-vs.-strength concept. Failures occur when a stress is greater than a corresponding strength. While this concept comes from mechanical and civil engineering and is most frequently applied to stress as a mechanical force and strength as a structure's physical ability to resist that force, the same concept is applicable to software reliability.

A.C. Brombacher applies this concept to electronic hardware reliability. In his book, "Reliability by Design," Brombacher notes that failures occur when some stress or combination of stressors exceeds the associated strength (susceptibility) of the system (See above figure). Stress, or the combination of stressors, is represented by a curve of the probability of any particular stress value. The strength curve indicates the chances of any particular strength value in a collection of products. The area under both curves represents failure conditions. Within a product, strength is the measure of resistance to stress. When the product design produces higher strength levels, the product is much less likely to fail.


More content on this topic:

Free Subscriptions

Control Design Digital Edition

Access the entire print issue on-line and be notified each month via e-mail when your new issue is ready for you. Subscribe today.