I want to start this piece by giving the most important take-away for IT readers: They should take care that data governance does not get in the way of Big Data, and not the reverse.
This may seem odd, when I among others have been pointing out for some time that better data cleansing and the like are badly needed in enterprise data strategies in general. But data governance is not just a collection of techniques—it’s a whole philosophy of how to run your data-related IT activities. Necessarily, the IT department that focuses on data governance emphasizes risk—security risk, risk of bad data, risk of letting parts of the business run amok in their independence and create a complicated tangle of undocumented data relationships. And that focus on risk can very easily conflict with Big Data’s focus on reward—on proactive identification of new data sources and digging deeper into the relationships between the data sources one has, in order to gain competitive advantage.
While there is not necessarily clear evidence showing that over-focus on data governance can impede Big Data strategies and thereby the success of the organization, there is some suggestive data. Specifically, a recent Sloan Management Review reported that the least successful organizations were those that focused on using Big Data analytics to cut costs and optimize business processes, while the most successful focused their Big Data analytics on understanding their customers better and using that understanding to drive new offerings. Data governance, as a risk-focused philosophy, is also a cost-focused and internally-focused strategy. The task of carefully defining and controlling metadata seeks to cut the costs of duplicated effort and unnecessary bug fixes inherent in line-of-business Wild-West data store proliferation. It therefore can constrain the kind of proliferation of usage of new externally-generated data types like social media data that yield the greatest Big Data success for the enterprise.
Who’s to be master?
So, if we need to take care that data governance does not interfere with Big Data efforts, and yet things like data cleansing are clearly valuable, how can we coordinate the two better? I often find it useful in these situations to model the enterprise’s data handling as a sausage factory, in which indescribable pieces of data 'meat' are ground together to produce informational 'sausage'. I like to think of it as having five steps (more or less):
- Data entry – in which the main aim is data accuracy
- Data consolidation – in which we strive for consistency between the various pieces of data (accuracy plus consistency, in my definition, equals data quality)
- Data aggregation – in which we seek to widen the scope of users who can see the data
- Information targeting – in which we seek to make the data into information fitted to particular targeted users
- Information delivery – in which we seek to get the information to where it is needed in a timely fashion
- Information analysis – in which we try to present the information to the user in a format that allows maximum in-depth analytics.
Note that data governance as presently defined appears to affect only the first two steps of this process. And yet, my previous studies of the sausage factory suggest that all of the steps should be targeted, as improving only the first two will only offer minor improvements in a process which tends to 'lose' ¾ of the valuable information along the way, each step losing quite a bit more.
How does this apply to Big Data? The most successful users of Big Data, as noted above, actively seek out external data that is dirty and unconsolidated and yet is often more valuable than the organization’s 'tamed' data. Data governance, as the effective front end of the sausage factory, must therefore not exclude this Big Data in the name of data quality—it must find ways of making it 'good enough' that it can be fed into the following four steps. Or, as one particular database administrator told me, 'dirty' data should not just be discarded, as it can tell us about what our sausage factory is excluding that we need to know.
Data governance should also not, if at all possible, interfere with the four steps following data quality assurance. Widening scope widens security risks; but the benefits outweigh the risks. Information delivery that involves a new data type risks creating a 'zone of ignorance' where database governors don’t know what their analysts are doing; but the answer is not to exclude the data type until that distant date when it can be properly vetted.
Much of this can be done by using a data discovery or data virtualization tool to discover new data types and incorporate them in an enterprise metadata store semi-automatically. But that is not enough; IT needs to ensure that data governance accepts that Big Data exclusion is not an option and that the aim is not pure data but rather the best balance of valuable Big Data and data quality.
In one of the Alice in Wonderland books, a character uses the word “glory” in a very odd way, and Alice objects that he should not be allowed to. “The question is,” the character replies, “Who’s to be master, you or the word?” In a similar way, users of data governance and Big Data need to understand that you, with your need for Big Data customer insights from the outside world, need to be master, not the data governance enforcer.