Technology -> Big Data
By: Charles King, President, Pund-IT
Published: 21st March 2013
Copyright Pund-IT © 2013
GIGO (garbage in/garbage out) is one of the oldest and pithiest acronyms in IT. Reportedly coined by George Fuechsel, an IBM 305 RAMAC technician/instructor in New York, the term became widely used in the early 1960s and was emblematic of the fledgling IT industry. But the concept of GIGO is just as relevant today as it was half a century ago or perhaps even more so.
Why is that the case? Because in the burgeoning world of big data, the value of analyses intimately depends on the quality of the material analyzed. Consider it this way: Access to a large, even limitless amount of building materials can enable you to build a bigger house. But if the timber is rotten or the foundation stones are flawed, the entire structure is undermined.
That being the case, it seems odd that in the fervent enthusiasm around big data the subject of data quality governance doesn’t come up more often. It’s not rocket science by any means—understanding and defining data standards, and then consistently managing and cleansing information to meet those standards are common practices in traditional data warehouse and business intelligence efforts.
It may simply be that big data and the maintenance/management requirements of unstructured and semi-structured information assets are still evolving. But it also seems clear that without effective stewardship and governance, big data results and end users’ trust will suffer. Modern technologies may allow and encourage the analysis of increasingly complex, ever larger volumes of information. However, if data quality management is ignored and resulting insights risk ending up on the garbage heap.
Posted: 22nd March 2013 | By James Kobielus :
Good discussion. Big Data doesn't change the "single version of the truth" equation--it only highlights the need to maintain data trustworthiness at extreme scales.
I call this imperative "peta-governance." Contrary to what many might think, you can indeed govern petabytes of data in a coherent manner. There is no inherent trade-off between the volume of the data set and the quality of the data maintained within.
Some believe that you can't scale out into the petabyte range without filling your Hadoop cluster, massively parallel data warehouse, and other nodes with junk data that is inconsistent, inaccurate, redundant, out of date, or nonconformed. That's simply not true.
The source of data quality problems in most organizations is usually at the source transactional systemsâwhether those are your customer relationship management system, general ledger application, or whatever. These systems are usually in the terabytes range.
Any IT administrator who fails to keep the system of record cleansed, current, and consistent has lost the half the battle. Sure, you can fix the issue downstream (to some degree) by aggregating, matching, merging, and cleansing data in intermediary staging databases. But the quality problem has everything to do with inadequate controls at the data's transactional source, and very little to do with the sheer volume of it.
Downstream from the source of the problem, you can scale your data cleansing operations with a massively parallel deployment, but don't blame the cure for an illness that it didn't cause.
I'd like to call your attention to a two-part article on Big Data quality and governance that Tom Deutsch and I authored last year.
part 1: http://ibmdatamag.com/2012/08/big-data-data-qualitys-best-friend/http://ibmdatamag.com/2012/08/big-data-data-qualitys-best-friend/
part 2: http://ibmdatamag.com/2012/08/big-data-data-qualitys-best-friend-3/?goback=.gde_4014567_member_144080790http://ibmdatamag.com/2012/08/big-data-data-qualitys-best-friend-3/?goback=.gde_4014567_member_144080790
Posted: 26th March 2013 | By Charles King :
Good points, lucidly stated, James. I'll be sure to read your and Tom's articles.
Posted: 26th March 2013 | By David Corrigan :
This is a great article. We've noticed the same phenomenon of rushing to big data analysis without proper understanding or governance of the data - which is always a recipe for failure. There seems to be an assumption that governance, or quality, is only about structuring the data - and that the freedom from structure of big data technologies would "liberate" them from traditional governance needs. That is like changing your mode of transportation from a bicycle to an airplane but assuming that the law of gravity no longer applies - a bad assumption!
We have noticed a more mature view of governance taking hold. First, that it is an enabler of successful big data projects through profiling and understanding all data (traditional and new sources), and determine which data is of value for analysis. Or, in some cases the exact opposite, which data is definitely not useful for analysis, and then allowing analytics to do its work. Once data is understood, the appropriate level of governance (data quality, master data, lifecycle management, or privacy and security) may be applied. The word "appropriate" is key. It isn't about governing all data to an absolute correct outcome. In many big data cases, a minimum level of governance is required before proceeding to analysis. The level of governance depends on the use case, but they all have some need for governance. In keeping with your building analogy, we often borrow the building adage "Measure Twice, Cut Once" - where measuring is understanding and applying appropriate governance, and cutting is big data analytics. As you point out, too many rush to make the cut and end up with unusable materials. Taking time to understand your big data sources and determine what, if much, governance is required ultimately makes big data analytics projects successful.
The messages above were all contributed by IT-Director.com readers. Whilst we take care to remove any posts deemed inappropriate, we can take no responsibility for these comments. If you would like a comment removed please contact our editorial team.
We automatically stop accepting comments 180 days after a post is published. If you would like to know more about this subject, please contact us and we'll try to help.
Published by: IT Analysis Communications Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761