"It's Armageddon," the frazzled operator said on the phone. "Our network is down and we're losing data. Help!"
OK, then. It was game on.
I felt confident that I could solve their problem because of the self-education I've immersed myself in over the past number of years.
It's no secret that a system integrator or automation pro has to keep pace with technology developments such as Ethernet and other networks. But how about Cisco IOS or configuring a router and an access control list?
My customer has a network IT guy with whom I interface every week. I described the issue to him, and he thought it might be a broadcast storm that they had experienced a few times before, which caused a similar outage.
The difference here was that three wireless sites were brought online about an hour before the system went to hell in a handbasket.
We talked about the hardware and whether perhaps a port had gone bad, so the outage was situational and coincidental to the addition of the three sites. We used a free tool called Wireshark to determine that the packet traffic was somewhat normal, although he did discover a rogue IP trying to break into their FTP server. Bonus find.
So if it wasn't a broadcast storm, then must it be one of the three new installs? I involved the supplier of the wireless setup.
"All is fine here," was the stock answer, but I am sure that some investigation was done.
I also had promoted some new code the day before, so you know what I was thinking. I stayed up until 4 a.m. poring over the VB code to be sure that I wasn't the cause, even though the system worked well for 24 hours before the three sites came online.
The main symptom that I observed using Wireshark and from my own application logs was more latency in the packet traffic from most of the 14 sites connected.
As a last resort, I made the timeouts on the software connection and transaction properties much higher, and it seemed to help, at least for 10 minutes. Then boom, down again.
So here are two IT guys and me trying to solve an issue critical to the water chlorination control and monitoring operations for more than 10,000 people.
All the normal responses weren't getting us anywhere. We discussed the architecture's detailed network drawings, as well as a real-time look using WhatsUp Gold and IntraVue. There was nothing to give us an indication of what to do next.
I mentioned that we had Netgear switches at each location, and maybe a port blew up and was flooding the system. As I said that, it was obvious that I was grabbing at straws since Wireshark and IntraVue would have shown us that.
The problem lasted through a second day. Hair-pulling time. With all the talk about switches, and routing tables that are kept on board, we wondered whether these tables got scrogged or have multiple entries causing errant messaging. It wasn't evident in the Wireshark capture, or in IntraVue.
Another consideration was that lying between the customer's DAQ system and the wireless remote PLCs was fiber to a third-party network provider. More switches and routers and power supplies — additional points of failure to consider.
The apparent solution to this problem was in the network operating center (NOC) of the customer. New switches were installed a few months before. While that in itself isn't a red flag, it's something to consider since this experience is unique. Can they be trusted? I've had to reboot my managed switch at home a few times because the tables get scrambled. I wondered if rebooting the switches in the NOC would change anything.
Well, who knew. We were in the clear after 12 hours of controlled testing and some sleep.
While we weren't throwing darts, we were scratching our heads. Without the tools we used, things would have been worse. I suspect that if we used the tools more effectively, it might have been better. Any thoughts about that?