Trees and Rings

Tips For How to Keep Your Network Up and Running. Think Recundancy

Nov. 10, 2010

10 min read

By Ian Verhappen, Industrial Automation Networks

With the increasing importance Ethernet has in today's automation networks, a key consideration in any design is the overall reliability of the resulting system.

To demonstrate the importance of getting things right and their impact on reliability, data from Datacom demonstrates the impact of each layer of the OSI model and percentages of failures that affect system reliability (Figure 1). The lower you go in the OSI model, the more failures there are, and 72% of failures occur in the first three layers. These faults include hardware failures, cabling failures, power losses, programming misconfigurations, etc.

Failure Mirrors Reliability

Failures are reflective of reliability. "The word reliability refers to both the guarantee of delivery of a message from source to destination, and the integrity of the message itself," says Dick Caro, president of CMC Associates (www.cmc.us), chairman of ISA-50 and recognized industrial protocol expert. "In most cases, message integrity is assured by the error-check codes in the messaging protocol, and acknowledgement of the received message. The usual method used to correct transmission error and failure of a message to be acknowledged by the receiving node is to retransmit the message."

Larry Thompson, owner and general manager of Electronic Systems Development and Training (ESdatCo), automation industry author and instructor, is exposed to a wide range of situations and confirms that running parallel programs, watchdog timers, checksums and so forth all aid in the detection of communication problems. "In most cases, encrypting network information will provide a more than sufficient check of network information reliability and integrity at the costs of additional processing and points of failure," he says. "However, one thing must be considered: Ethernet does not allow for duplicate packets, and any system of duplication must take this into account—hence, the need for software at the Data Link layer to make sure this does not happen."

With Ethernet-based networks and protocols, redundancy is the most commonly used method to maintain maximum uptime and still be able to deal with minor outages and failures. The statistics from Datacom (Figure 1) show why it is no surprise that the three main areas for Ethernet-based controls redundancy are physical, data link and network layer hardware and software.

Fast Recovery Essential

Automation environment recovery time typically needs to be less than 100 ms. Layer 2 Redundancy protocols do two things: Identify all the possible paths among the networking devices, and place the redundant extra paths in a blocking state to remove network loops. The Spanning Tree Algorithm (IEEE 802.1D) ensures only one path for Ethernet packets, but with recovery times as long as 15 seconds it is too slow. So equipment used for controls typically also support one or more of the following:

• Rapid Spanning Tree Protocol—Currently standardized as IEEE 802.1w 2004, RSTP is an evolutionary leap for STP with failover times from about 250 ms to 12 seconds through industrial processes.
• Multiple Spanning Tree Protocol—IEEE 802.1Q 2003 protocol MSTP allows multiple instances of Spanning Tree Protocol per Virtual LAN.
• Link Aggregation Control Protocol—IEEE 802.3ad protocol LACP allows the user to configure multiple Ethernet ports between Ethernet switches into a single virtual link. This permits load sharing of information between the links and is extremely fast in moving data between a failed port and an adjacent port if there is a link failure.

"A basic redundancy requirement for control systems is that every part of the communication network should be hooked up to a backup power supply with redundant power inputs," says Nick Sandoval, Moxa field application engineer. "The power supply is typically far more likely to fail than a switch." In addition, a completely redundant system consists of redundant switches, redundant communication ports and redundant device pairs. Table I summarizes the methodology Moxa (www.moxa.com) uses to determine the level of redundancy in its network designs.One of Moxa's newest technologies to economically meet these redundancy requirements, Turbo Chain, connects several Ethernet switches together to form a daisy chain, in which a head switch and a tail switch (the edge switches at the two sides of the chain) are configured first (Figure 2). The remaining switches are configured as member switches. The two ends of the chain are connected to an Ethernet network such as Moxa's Turbo Ring. The network system will recover in less than 20 ms by activating the blocked path and backup path in the ring. Turbo Chain also allows for integration with other technologies such as RSTP and Turbo Ring networks.

Don't Skimp on Spare Ports

Sven Burkard, strategic and product marketing manager at Hirschmann Automation and Control (www.hirschmann-usa.com), recommends that "you plan for 5-10% spare ports and don't forget that managed switches are also a requirement for redundant media/data paths."

Dominic Iadonisi, industrial market manager, RuggedCom (www.ruggedcom.com), reaffirms this. "Since an unmanaged switch is in effect a blind switch, it is not possible to see how the network is performing and perform predictive maintenance based upon what you cannot see," he says. "Also, the ability to use port mirroring on a managed switch can assist with troubleshooting Application Level issues because you can use a protocol analyzer to see the application in operation."

Another advantage of a totally switched network is that nodes only communicate with the switch and never directly with each other. In switched Ethernet, the devices are the only potential devices that can access the medium connected to the network and can forgo the collision-detection process and transmit at will, which assists in the determinism of the message. In the case of full-duplex communications, the end stations can transmit to the switch at the same time that the switch transmits to them, achieving a collision-free environment.

Interestingly enough, you can go overboard on redundancy as well. Too many connections between Ethernet switches can cause slowdowns in reconvergence of a network if there is a lost link or switch. Ring topologies typically use two interswitch links per switch; mesh topologies can use three links or more. It is not a good idea to have more than three links for edge switches in a mesh network environment.

Node of Confusion

A fieldbus expert and industrial network consultant since 1993, Rob Hulsebos is an engineer at controls manufacturer Delem (www.delem.com), and also has his own fieldbus consultancy eNode (www.enodenetworks.com). He puts the Ethernet for automation dilemma into some perspective, saying, "A lot has happened the past five years, especially in industrial Ethernet. Vendors, especially network equipment suppliers, are competing furiously to launch ever-quicker recovering redundant networks. Unfortunately, there’s not much standardization and the faster ones are still proprietary. Nobody seems to care much."

Because most systems are proprietary, it makes it hard to validate vendor claims on reliability and recovery times. Some give only worst case figures, thinking that if you have a smaller network than the worst case, then figure it’s going to be faster.

There are many redundant variations in Ethernet. Redundancy can be achieved via software or hardware, or both. Many people assume ring is the only possible option for fast recovery, but Powerlink and EtherNet/IP have separate structures (even with multiple redundancy). EtherCat has other ideas and Profinet yet others, just proving that there might be a standard at the Physical Layer, but after that it’s a free-for-all. Even if you use the same software everywhere, the system is not capable of handling the same fault in both sections of the network. Software (and different compilers, CPUs, underlying hardware, etc.) must be implemented twice for this type of fault to have even higher reliability. So there’s redundancy and then more redundancy—if you want to pay for it, design it, and most importantly maintain it once it is installed.

Regardless of how well you design your system, failures will happen. To plan for recovery of the switch and its configuration, Burkard says, the system needs a configuration recovery mechanism (USB flash storage or similar) that will give the user a Mean Time to Repair/Replace (MTTR) of less than 3 minutes by untrained personnel at 3 a.m.

The Physical Move

All the information and data packets still need to somehow get from A to B, and that's the job of the Physical Layer. The most commonly used options are copper (Cat 5/5e/6) or fiber, though wireless is starting to make some inroads. With all these choices at the Physical Layer, what is the impact of choosing one over the other?

"Resilient cables do not increase reliability; they only protect against cable breaks," Caro notes. "Since only one cable of the pair is used at a time, the reliability is the same as for a non-resilient cable."

Burkard has a range of recommendations in that regard. "Keep cable run distances to a minimum by using decentralized network architectures," he says. "Instead of a switch in the engineering office that requires multiple long runs onto the plant floor, bring the switch to the plant floor and use short patch cables to connect it and the devices."

Mesh networks use more fiber than ring networks, but typically can survive more network hits. Whether Cat 6 or Cat 5/5e, network performance effectively boils down to the signal-to-noise ratio at the receiver. The main result is that Cat 6 provides about 12 dB (or 16 times) better signal-to-noise ratio compared to Cat 5/5e over a wide frequency range. The installed cost for Cat 6 cabling can be about 20% higher than Cat 5e. Cost is also the largest deterrent to the use of fiber. Costs are presently about $80/port for copper 100 Mb/s, compared with about $400/port for a 100BaseFX switch. The added cost of media converters for multimode fiber is about $112/port for Fast Ethernet and $385/ port for Gigabit Ethernet. The cost of optical networking equipment generally is two to four times higher than copper. [pullquote]

With copper, the choice of solid or stranded conductors needs consideration. Solid is appropriate for most installations, while stranded provides extra flexibility for better handling in close environments and for robotic/continuous flex applications.

Jacob Jackson, design engineer at Assurity Design Group (www.assuritydg.com), always uses VLANs for building automation projects. "Many automation devices have poor TCP/IP stacks, making it hard for them to withstand or recover from a broadcast storm that might need a physical reset of the device to recover," he says.

Once a typical automation network gets beyond 30-35% saturation, the devices will not be able to talk, Jackson says. The way to prevent this is to manage ports on the VLAN and lock down any unused ports. Security is why they don't use wireless and why the VLAN is important, he says, citing an example of an auditorium projector that connected wirelessly: "Students logged on and took control of the projector during class. Fun for the students, but not so much so for the instructor or potentially the rest of the devices on the network. This reinforced why to not use wireless access points for control with our building automation clients.

In the end, Burkard says, reliable networks all boil down to the six Ps: Proper Prior Planning Prevents Poor Performance. "And as we have shown, there is no magic wand to guarantee 100% reliability of your network," he says. "But if you execute good engineering design practices, you can get close."