Are big data errors throwing off your analytics?

As devices, machines and people are being connected to well-planned cloud architectures, are we just creating big data errors that will throw off our analytics?

By Dave Perkon, technical editor

I’ve seen big data defined as huge sets of data that can be analyzed, using powerful computers and programs, to discover trends, patterns and associations. It's being used today to help to understand human behavior and choices—to sell you something. It also feeds the analytics of predictive maintenance and other real-time knowledge to help to improve and optimize production. Nowhere have I seen it defined as accurate data.

There are many uses and warnings related to big data analytics. No doubt collection and analysis of the data will develop and improve over time. At the same time, there are many risks with using the data.

I hope the designers and programmers can limit the manual data collection because it is too easy to enter errors accidentally, on purpose or because they think it doesn't matter.

As the yottabytes (1 trillion terabytes) of big data continue to expand, how many data errors are we efficiently collecting? Yottabytes say data is as data does. I'm sure errors are present in a significant percentage of the big data, if my recent experiences while buying a house and visiting a doctor are any indicators.

The home mortgage application started out with much of the data guessed at by the loan officer. He said it isn't important for the information to be accurate. The errors included an incorrect social security number and income and even my wife's race and ethnicity were incorrect. Two of the errors were a big deal and almost immediately caused loan qualification issues. He then requested we just electronically sign the incorrect documents. We didn't sign anything until it was corrected. I'm sure many sign anyway, locking in the digitized data errors.

Big data errors are also being efficiently created by our doctors and their staffs. My wife and I both have a diagnosis in our electronic medical records that can only be meant for other patients. Even the pharmacy we use has errors, as well, and it cannot seem to correct them. How hard can it be to change a phone number? Apparently, multiple somewhat duplicate big-data sources and easy connections make it difficult. I won't even bother to tell you about the data errors the homeowners insurance company had on my house. Much of the data is garbage, and maybe it has always been.

ALSO READ: Understanding smart machines and how they will shape the future

It's time to call out the elephant in the room. Hopefully the elephant is not on the plant floor. The big data that should be simple to accurately record digitally is very wrong, everywhere. I pointed out to the creators of the big data errors, all humans, that if they are taking the time to record the data, it should be accurate, so we will be taking the time to correct it.

Big accuracy

I wonder what happens when there are yottabytes of possible errors in big data. Garbage in and garbage out, and clearly this cannot happen on the plant floor. It seems risky to make decisions based on data sets that are collected by people who think errors are okay. However, since the data can typically be collected manually, semi-automatically and automatically, there is hope.

Step one, before collection, management or analysis, is to ensure the big data is correct, or we are wasting a lot of time and creating future errors. The big data mindset must encourage accurate data, or a whole software industry will be created to remove the big-data errors.

I hope the designers and programmers can limit the manual data collection because it is too easy to enter errors accidentally, on purpose or because they think it doesn't matter. The same is true for data collected semi-automatically via a pushbutton on an HMI, for example. Select any cause-of-failure answer, right or wrong, and the machine will continue to help production, right?

Even if the data is collected automatically, it must be properly defined and configured. And it must be worth collecting, or we'll be collecting worthless data with errors. Spend the time early in the design process to carefully identify and specify how data will be collected and how the accuracy of the data will be validated. Only then can management and analysis of the data follow.

I'm pretty sure that even if the big data set is error-free, it will create errors. Just because someone said the results show it will work doesn't mean it's true or correct. It's the same with big data analytics' results. Add a big pile of correlations that big data analytics loves to provide, enough that statistically something must match purely by chance, and the risk of creating very believable result errors is statistically likely. And the big data errors will likely multiply these big analytics errors.

People sometimes create the errors, but at the same time, people can translate the results of big data analytics into usable information. Don't just believe what the big data is telling you. It's not always right. If it were, I'd be rich from predicting future stock values by now. Make sure the data is as accurate as possible; question and confirm the analysis of it; and only then use it to make decisions and maybe a few predictions.



Homepage image courtesy of photoexplorer at

Show Comments
Hide Comments

Join the discussion

We welcome your thoughtful comments.
All comments will display your user name.

Want to participate in the discussion?

Register for free

Log in for complete access.


No one has commented on this page yet.

RSS feed for comments on this page | RSS feed for all comments