Home

Data in Moderation

Dec. 5, 2011

We Have More Data Than We Know What to Do With

Key Performance Indicators (KPIs) are signposts in any process or sequence. Data acquisition for KPIs is an age-old process that started on a clipboard. Process people would gather data hourly and write numbers into squares on forms that ended up in someone's lap for "mining."

Over the past 10 years or so, data has been retrieved from control devices, using Ethernet or bus systems, and now we have more data than we know what to do with. The migration to high-speed, high-bandwidth networks has permitted the data pile to grow exponentially.

This is the first and most common mistake, according to John Weber, president of Software Toolbox. The company provides users and OEMs with data acquisition software products and services.

"I see too many people who log data because someone asked them to," Weber says. "Later, they learn no one is doing anything with it." Of course, with this amount of data to mine through, some worthwhile data points will get lost in the shuffle.

So are there best practices for data acquisition? Opinions are many, but there are some practices that are considered gospel.
First, identify what you want to look for, then allocate bandwidth and space to that task. Ethernet has fostered the idea that I get as much as I want, when I want, as fast as I want. Treat your network with respect, and it will serve you well.

[pullquote]It is very easy to bog down a communication network, Weber suggests. Respecting how fast you need your data is paramount to being a good network steward. Reading temperatures from a large process every second is not an essential thing to do.
If you are logging 200 points from a PLC, does the communication driver request each point as a separate request, or does it bundle many data points into a single request? OPC tries to be smart about how it grabs data, but it depends on the vendor and the drivers. Network saturation can happen quickly if you are not careful.

A PLC holds us hostage for data because it needs to be polled. Even if SCADA says you can just log an exception or a change, the driver still has to talk to the device to figure out if the data has changed so it can write it out to the database. If you can, set up an unsolicited communication method so that the device tells the SCADA when something has changed.

Weber disagrees with me. He maintains that most users want to see the full sample set, not just when it changed for reasons that only the users know. I don't think we need a mountain of data points to get to where we need to be. But it is important to be sure that we get the right stuff.

Using Microsoft Access for a gazillion-record database isn't the right thing to do. Use SQL as a minimum. Once the data is in there, you need to do something with it. Normally, it is a report that is based on a query. Is the database on a network with multiple users? It all makes a difference.

Open source is coming on strong in the database world. The Apache Hadoop framework allows distributed applications to be written so users can use big data from local and cloud sources.

Reporting tools such as Excel aren't as feature-rich as certain open-source tools. In fact, a content-managed web page (from Wordpress, for instance) can serve as a real-time report using various tools and data. So it is a good thing to figure out how much data you have or need, the enterprise it will live in, and the database format the data will reside in. The devices and network will limit your scope.

Reporting on that data is the next step, and it could be real-time or historical. And there are historians everywhere, which are just databases, essentially. Once you have the data and reports, then what? Liability issues, process issues and maintenance issues can all be understood with good data.

Data is like good wine. Have just enough and you'll be a happy camper. Have too much and you'll get a hangover.