The main thrust of Big Data is analytics. The simple fact is that we never had the possibility of analyzing vary large data heaps until recently becasue of practical factors, like the cost of buying the hardware and networking it, the absence of analytics software to run in parallel across server grids, fast scale-out databases, the lack of cloud deployment options and so on. Most of what was needed has gradually emerged and because of that Big Data is off and running.
The Upside and the Downside
We can look at this in two ways, from the positive side and the negative side. The positive side requires little explanation. Companies can quickly assemble Big Data pools that were previously too expensive or too slow to work with. In ananlyzing them they may discover extremely valuable knowledge that they can apply to good effect. This is the promise of Big Data and everyone knows it.
I'll deal with the downside simply by providing a list:
- There is a whole new software stack to get to grips with, starting—for many companies—with Hadoop and its many children (Hbase, Pig, Hive, Mahout, Flume, etc.) Hadoop is a little like the old woman who live in a shoe; "She had so many children, she didn't know what to do." Right now Hadoop is highly capable, but immature. There are effective ways to use it, but it's easy to abuse it.
- There is a data flow issue. The old data flow of: transactional systems -> data cleansing and staging -> data warehouse -> data marts -> personal data extracts, is now being displaced. That might be fine if it was a matter of rip and replace.
- Only people who have extreme courage and an overdose of optimism will rip and replace, for many reasons. Ripping and replacing databases is rarely simple and usually gives you few business payoffs aside from reducing the number of legacy systems.
- There is no proven "new" data flow model. The old data flow was about internal data. The new world embraces external data (partner data, unstructured data, social media data, web data, even event data) It is no longer clear what corporate data actually is.
- In-memory processing is a new factor in all of this. The nice thing it provides is speed, but it is not yet clear which in-memory technologies will prove to be strategic. We have the possibility now of holding quite large databases in memory and having memory as the prime source of that data, and processing the data far faster than ever before. But no concensus has yet emerged about the best way to leverage memory.
- Master Data Management was a credible idea even though in practice it has been hard to pull off. When you add in unstructured data, social media data, other external data, etc. it is not clear how you achieve a concensus on metadata and enable broad usage of such data.
- If you are wondering why I've not yet mentioned Data Governance, it's because I was saving the best to last. There are four strands worth worrying about: data security (who can use what data and who can even view it), compliance (there are data laws, there are usage rules), data cleansing (a thorny problem when it's data from other sources—there needs to be an audit trail) and data lifecycle (when do we throw the data away).
The world of events
Behind all of this, a fundamental change is occurring in the world of IT. We are moving from the processing of transactions to the processing of events. Most machine generated data, for example, is event data. Transcations are events too, but now they are in the minority. When we move to the "Internet of Things" there will be an explosion of event data way beyond its current volume, which already exceeds transaction data by a wide margin.
Put simply, Big Data is about events. They have become the atoms of data.