Big Data is mentioned in IT articles everywhere; it even makes appearances in the national press and on TV. It is famous. Naturally, IT professionals are mesmerized by this. There are challenges. There are opportunities. There are prizes to be won. But sadly, many IT professionals do not have to manage and exploit very large data heaps. What gives?
Is something dramatically new happening with Big Data. Specifically: “Is the current growth in data volumes greater than it has been historically? The short answer to this is “No, not really.”
The problem of data obesity
Examine data growth estimates and measurements by IDC and Gartner and you see that, historically, the volume of data that we store grows at about 55% every year. When the economy is in recession, it grows less, and in boom times it grows more, but it doesn’t vary much from that figure. That’s how it has been for years and it is also the case now. It grows at a hell of a clip all the time.
Nevertheless there are companies; Google, Yahoo!, Facebook, LinkedIn and quite a few others, that experience data growth rates much higher than that. But "there's nothing new under the sun" as Solomon once said. There always were such 'data-obese' companies. In previous IT eras some companies (banks, telcos, retailers, etc.) experienced data growth that was well above the average, just as some companies experienced modest growth rates.
Take a look at the graph above, which was constructed by averaging the decline in the costs of various forms of storage over the past 5 years and extrapolating from there. The dotted red line, which represents data growth, easily outpaces the decline in the raw costs of all storage media. The costs of storage media, by the way, do not represent the whole cost of storage. In fact they represent maybe 25–30% of the costs; other costs include manual labor (the cost of that is not falling, it’s gradually rising) the cost of data center space, power costs and the costs of downtime and recovery.
The question then is whether a company has a genuine 'Big Data' project going on involving much larger volumes of data than normal, or whether they are simply experiencing average data growth. Either way, there is an unrelenting pressure to manage an ever increasing collection of data.
Shoop, Shoop, Hadoop
When we think of data, we tend to think of the databases and files where the data is stored. But over the past two decades most tranactional data has tended to have multiple uses rather than just by the application that creates it. It is shared. it has a life cycle. This leads us to the area of data warehouse. The once beguiling idea that we could syphon off data from transactional systems and accumulate it into a central data store that fed multiple BI applications has experienced a bruising encounter with reality in recent years.
In some sites, the attempt at a single data warehouse was sunk beneath the need to also have operational data stores (that had more up-to-date data) and the disease of 'spread marts,' with joyously empowered users and corporate departments breeding data marts like rabbits. The single data warehouse—some companies had multiple ones anyway—turned into a staging point for the creation of other data heaps. Big fleas had little fleas on their backs to bite 'em, and little fleas had lesser fleas and so ad infinitum From a service-to-the-user perspective this may have been healthy, but from a data management perspective, it was not. As transactional data volumes grew, so did the data warehouse volumes and the data mart volumes naturally followed suit. Data growth happened everywhere.
So, with an impressive drum roll, Enter Hadoop, and its oddly named children, HBase, Hive, Pig, Mahout, Flume, and many more (There was an old woman who lived in a shoe...). Whether a company uses Hadoop for new data, such as click streams from web sites or social media data or floods of data from RFID tags, or whether it is simply being used to accommodate data that would otherwise be archived, is not the issue here. The issue is that Hadoop is a data lake that can take data from anywhere and everywhere without any need of preparation. It is a data lake that has tributaries and it can be used as a data reservoir. And as a consequnce, the traditional data warehouse idea is shattered.
This is what is new about Big Data. And it is reinforced by the fact that Hadoop, although limited in many ways when compared to a large traditional database engine, is capable of ingesting any kind of data. It is capable of impersonating a database and it is capable of impersonating a file system, and even mixing the two together. Its popularity among IT folk is a testament to that, even though, in many ways, the whole software stack is immature. Hadoop is has become a natural part of the IT ecosystem, and it is not going to suddenly disappear.
Consider the data flows
Imagine for one glorious moment that we have a single distributed data store that serves every single application we run. Imagine that different applications; office applications, transactional applications, BI applications, mobile applications—in short, all applications—use this one distributed data store to serve their every need. Within that imaginary data store there will be a great deal of activity, partly to keep all data consistent as it is updated and partly to create fast caches of data for specific applications—BI applications and analytic applications for example—so that the data is served at the right latency. It will also need to provide backup and recovery.
That was a pleasant dream; now let’s return to reality. Oddly enough, that's the kind of data arrangement we actually have 'logically'. Sadly, the data is not in a single distributed data store, it is in multiple different data stores, some of which are built to provide the right speed of data access. In between these various data stores (traditional databases, column store databases, Hadoop, document stores, etc.) we have a data transport system. Constant data flows feed staging areas or directly feed databases or data stores. And like the databases, these data flows have service levels that they need to meet.
That’s difficult enough, but just add in the average 55% data growth rate and you will quickly conclude that the database performance will deteriorate and the data flows will go slower as the data volumes increase, so there’s a constant need to buy more iron and to have faster software just to keep pace. It’s a balancing act, and if you fail to balance it, the service to the user begins to deteriorate.
So in summary, or perhaps in consolation, I suggest that irrespective of the actual volume of data, you have Big Data if you are encountering the problems I just described. Big Data is only big if you are having big problems managing it. Really Big Data means really Big Problems.