As I read a news story on how India lost and found its Vikram Lander on the moon, I couldn't help thinking about some of the spectacular automation crashes and failures I've witnessed. Fortunately for India, it was not a total loss; its Chandrayaan-2 orbiter and its eight scientific instruments will likely be orbiting the moon and providing valuable information for years.
Over my career and like India's moon program, I have definitely lost a lander craft or two, but the main ship still functioned—after the bent parts were repaired. However, complete system failures do occur during machine startup or maintenance activities. Fortunately, I was just a spectator to some loud, flaming and truly destructive failure events.
A couple decades ago, NASA's Mars Climate Orbiter crashed into Mars because the engineers were talking to the craft in English engineering units when it should have been speaking metric engineering units. Feet-per-second of thrust is quite different than Newtons-per-second of thrust, causing the adjustment to be off by a factor of 4.45. NASA repeatedly sent the wrong information to correct the craft's motion and maybe didn't check for a proper response. Unfortunately, it often only takes one incorrect variable, closed contact or misplaced wire to crash an automated machine.
It's hard to believe that engineers cannot check that the engineering units are correct in a $124 million spacecraft, but it happens. It is also hard to believe that a programmer does not include contacts (interlocks) in a drive-enable circuit that ensures the tooling is clear, but it's not that simple.
Some will say it is hard to test something flying through space 100 million miles away, but sending a command to test its response is about as basic as it gets, especially if it's off by a factor of 4.45. And it has to be done in the correct order. Here's an example of how not to do it.
Years ago, I was working at a machine builder as an integrator programming and starting up a machine, and my competition comes in to work on the large dial table next to me. The first thing he does is dry-cycle the eight stations on the dial table. Shortly after starting that, within an hour of arriving, the dial table unexpectedly indexes and damages every station on the machine—massive damage, everything was bent.
From my spectator position, I thought it was great, and the programmer’s failure was obvious. He didn't check the critical machine safety interlocks. In this case, the interlock all station tooling is clear of the dial. He should also have programmed an interlock that while the dial is indexing the station tooling must stay clear or the dial must stop.
While the PLC program checks many sensors to ensure the tooling is clear of a potentially damaging motion, often only one "clear" interlock contact is used in series with an output coil to inhibit a dangerous and powerful machine motion.
Just one wrong program bit and a machine motion can literally peel the tooling off a machine, which is much more common than crashing a spacecraft into a planet or moon. The same is true for a single relay contact or a misplaced wire.
The loudest and most destructive machine automation crash I’ve ever seen was at an appliance manufacturing plant. An integrator was starting up a multi-station, 100-foot-long walking-beam transfer that moved refrigerators through a final assembly and test system. The technician manually actuated a relay, and well over a million dollars of automated equipment was ripped from the mounts, including 10 large freestanding control panels.
The technician barely escaped with his life, but the resulting damage to the equipment was similar to a multi-car pileup on the freeway; and it certainly sounded like it.
I saw a similar thing happen in an automotive body shop. The walking-beam transfer cycled when about 30 robots were working on several vehicles. While not as catastrophic as the appliance line, blow torches were needed to cut up several vehicle frames to clear the resulting crash.
So, how do you keep that from happening? It's easy; carefully perform a well-thought-out test procedure, and use a safety relay in a Category 3, or similar, control reliable circuit. Just as a safety circuit can be used to safely stop machine motion, it can be used to check that tooling is clear before motion is started. While they are two physically separate functions and circuits, the technology is the same.
It is important to predict and find those single-point failures before your automation makes the problem catastrophically obvious. Good design practices and a bit of failure analysis will help. For example, is it a problem to have 120 Vac and 24 Vdc directly adjacent to each other?
Some will say no; electrical noise could be a problem. That's true, but another problem is incorrect wiring. Did I tell you about the integrator who accidentally connected 120 Vac to a 24 Vdc circuit and burned up 40 reed switches on a piece of test equipment? It started to burn as I was turning breakers on and testing voltages, and then a fire extinguisher became involved. As a reminder of this smoking automation, the shop smelled like an electrical fire for the better part of a week.
As NASA and others know, you cannot always get it right 100% of the time, but that is the goal. Be careful with those units or measure, bits and wiring.