I lied. The network nightmare I wrote about in January really wasn't fixed after all.
My customer is a water municipality with 15 remote sites running local PLC-based control for some operations along with some local control for chlorine and treated water. We communicate with the mother ship using public network 802.11a.
In 2005, we used a lower-cost PLC with only serial communications. The server used custom software I'd written that used virtual serial port drivers to communicate with hardware remotely over the broadband wireless network.
Ethernet device servers connected to the remote serial ports on the PLCs at each location. Simple.
Well, yes, except that wireless latency was a really big issue for the virtual serial ports, and the resulting application software.
I enlisted the help of Lynn Linse, who is the guru of communication software for all things wireless. He said this sort of latency required some very special code to take just about everything into account. My code was good, but not good enough. We had various routines to telnet into the devices to reset, and to monitor communications to reboot the server if necessary because of repeated issues.
The system had resident drivers installed that connected between the devices and the application code. We found that, regardless of the reliability of the wireless network, we had all kinds of issues with gathering data.
So, as I said, we moved from a virtual serial port device endpoint to a true Ethernet solution. This meant changing the application code to use the new environment. The serial port drivers and all code were uninstalled. The new code was running, and should have been well with the world. Then it all blew up on us a day later.
I had no ideas or options to solve this problem. Back to the drawing board. I made a change to a remote PLC, and noticed a very important piece of the puzzle: The network communications went to hell in a hand basket. No idea why that would happen.
I rebooted the server and all came back as normal. So what I thought was a network issue became a network problem and/or a server problem.
The kicker is that when the server had an issue talking with the PLCs out in the field over the radio network and with all the remote sites, the secondary server had the same issues. In other words, there was a new server that was running in a totally different environment that saw the communication problems and the timing issues of the PLC protocol. It can't be server-based because of this, I figured.
We brought the rest of the sites online despite the problems. We used IntraVue, Wireshark, and normal ICMP stuff. Nothing seemed amiss.
We completed all our work in the field, and then went back to the server office. We still had severe network issues and were unable to connect to the remote sites reliably for any given time.
Has to be the network, right? Has to be connectivity issues. Can't be my application software. And, most importantly, it can't be local, but maybe it's the wireless network?
I used a piece of software from Sysinternals called TCPView. I wondered why there were connections to ports and IP addresses that didn't exist. In the conversion, we changed the IP addresses and the VLAN domains to different entities. I wondered why there were a bunch of packets destined to the old IP addresses.
In checking with the device manager in Windows, the drivers for the serial port virtual drivers were still there. Wow. So I removed them again and, what do you know, the years of sporadic network communications disappeared.
We had many red herrings, including that a secondary machine had comm issues from a different environment at the same time as the original server. But the resident drivers from a previous installation were the root cause.
I felt really incompetent. I got over it. Who knew a local issue could cause a global network meltdown? I do now.