By: Robin Bloor, Co-Founder, The Bloor Group
Published: 20th March 2013
Copyright The Bloor Group © 2013
Big Data is mentioned in IT articles everywhere; it even makes appearances in the national press and on TV. It is famous. Naturally, IT professionals are mesmerized by this. There are challenges. There are opportunities. There are prizes to be won. But sadly, many IT professionals do not have to manage and exploit very large data heaps. What gives?
Is something dramatically new happening with Big Data. Specifically: “Is the current growth in data volumes greater than it has been historically? The short answer to this is “No, not really.”
The problem of data obesity
Examine data growth estimates and measurements by IDC and Gartner and you see that, historically, the volume of data that we store grows at about 55% every year. When the economy is in recession, it grows less, and in boom times it grows more, but it doesn’t vary much from that figure. That’s how it has been for years and it is also the case now. It grows at a hell of a clip all the time.
Nevertheless there are companies; Google, Yahoo!, Facebook, LinkedIn and quite a few others, that experience data growth rates much higher than that. But "there's nothing new under the sun" as Solomon once said. There always were such 'data-obese' companies. In previous IT eras some companies (banks, telcos, retailers, etc.) experienced data growth that was well above the average, just as some companies experienced modest growth rates.
Take a look at the graph above, which was constructed by averaging the decline in the costs of various forms of storage over the past 5 years and extrapolating from there. The dotted red line, which represents data growth, easily outpaces the decline in the raw costs of all storage media. The costs of storage media, by the way, do not represent the whole cost of storage. In fact they represent maybe 25–30% of the costs; other costs include manual labor (the cost of that is not falling, it’s gradually rising) the cost of data center space, power costs and the costs of downtime and recovery.
The question then is whether a company has a genuine 'Big Data' project going on involving much larger volumes of data than normal, or whether they are simply experiencing average data growth. Either way, there is an unrelenting pressure to manage an ever increasing collection of data.
Shoop, Shoop, Hadoop
When we think of data, we tend to think of the databases and files where the data is stored. But over the past two decades most tranactional data has tended to have multiple uses rather than just by the application that creates it. It is shared. it has a life cycle. This leads us to the area of data warehouse. The once beguiling idea that we could syphon off data from transactional systems and accumulate it into a central data store that fed multiple BI applications has experienced a bruising encounter with reality in recent years.
In some sites, the attempt at a single data warehouse was sunk beneath the need to also have operational data stores (that had more up-to-date data) and the disease of 'spread marts,' with joyously empowered users and corporate departments breeding data marts like rabbits. The single data warehouse—some companies had multiple ones anyway—turned into a staging point for the creation of other data heaps. Big fleas had little fleas on their backs to bite 'em, and little fleas had lesser fleas and so ad infinitum From a service-to-the-user perspective this may have been healthy, but from a data management perspective, it was not. As transactional data volumes grew, so did the data warehouse volumes and the data mart volumes naturally followed suit. Data growth happened everywhere.
So, with an impressive drum roll, Enter Hadoop, and its oddly named children, HBase, Hive, Pig, Mahout, Flume, and many more (There was an old woman who lived in a shoe...). Whether a company uses Hadoop for new data, such as click streams from web sites or social media data or floods of data from RFID tags, or whether it is simply being used to accommodate data that would otherwise be archived, is not the issue here. The issue is that Hadoop is a data lake that can take data from anywhere and everywhere without any need of preparation. It is a data lake that has tributaries and it can be used as a data reservoir. And as a consequnce, the traditional data warehouse idea is shattered.
This is what is new about Big Data. And it is reinforced by the fact that Hadoop, although limited in many ways when compared to a large traditional database engine, is capable of ingesting any kind of data. It is capable of impersonating a database and it is capable of impersonating a file system, and even mixing the two together. Its popularity among IT folk is a testament to that, even though, in many ways, the whole software stack is immature. Hadoop is has become a natural part of the IT ecosystem, and it is not going to suddenly disappear.
Consider the data flows
Imagine for one glorious moment that we have a single distributed data store that serves every single application we run. Imagine that different applications; office applications, transactional applications, BI applications, mobile applications—in short, all applications—use this one distributed data store to serve their every need. Within that imaginary data store there will be a great deal of activity, partly to keep all data consistent as it is updated and partly to create fast caches of data for specific applications—BI applications and analytic applications for example—so that the data is served at the right latency. It will also need to provide backup and recovery.
That was a pleasant dream; now let’s return to reality. Oddly enough, that's the kind of data arrangement we actually have 'logically'. Sadly, the data is not in a single distributed data store, it is in multiple different data stores, some of which are built to provide the right speed of data access. In between these various data stores (traditional databases, column store databases, Hadoop, document stores, etc.) we have a data transport system. Constant data flows feed staging areas or directly feed databases or data stores. And like the databases, these data flows have service levels that they need to meet.
That’s difficult enough, but just add in the average 55% data growth rate and you will quickly conclude that the database performance will deteriorate and the data flows will go slower as the data volumes increase, so there’s a constant need to buy more iron and to have faster software just to keep pace. It’s a balancing act, and if you fail to balance it, the service to the user begins to deteriorate.
So in summary, or perhaps in consolation, I suggest that irrespective of the actual volume of data, you have Big Data if you are encountering the problems I just described. Big Data is only big if you are having big problems managing it. Really Big Data means really Big Problems.
Posted: 20th March 2013 | By Azana Baksh :
Nice article Robin. One other open source technology to mention is HPCC Systems, a data-intensive supercomputing platform for processing and solving big data analytical problems. Their open source Machine Learning Library and Matrix processing algorithms assist data scientists and developers with business intelligence and predictive analytics. Its integration with Hadoop, R and Pentaho extends further capabilities providing a complete solution for data ingestion, processing and delivery. In fact, a webhdfs implementation, (web based API provided by Hadoop) was recently released. More at http://hpccsystems.com/h2h
Posted: 21st March 2013 | By Robin Bloor :
Azana, thanks for drawing my attention to HPCC Systems, I had not run across it. I personally think that machine learning is going to prove a very powerful direction for data analytics. We are only at the start of this trend right now, so it will be interesting to see how it develops. As regards machine learning, there is also Mahout and KNIME and commercial products like Skytree that are pushing the boundaries of what is possible. It may even become a crowded market.
Posted: 22nd March 2013 | By James Kobielus :
It's not clear how the reported IDC trend-- volume of data that we store grows at about 55% every year--relates directly to Big Data. I suspect most of that new data is NOT flowing into big data, DWs, and other analytic databases. A lot of it is probably going into OLTP databases, content management systems, archives, and so forth. My educated hunch is that the volume of structured-data-centric DWs grow at a Moore's-Law-equivalent 50% every 18 months, not every year, and that the growth of unstructured-data-centric big-data platforms is much faster: perhaps 50% every 6 months.
Also, it's not clear that "data-obese" is a meaningful label. The term "obese" implies a normative judgment of "too big," But what's yardstick of "too big" here? Consumes too much of your storage budget? How much is too much? What is "normal" anyway? The IDC graph doesn't state the most important metric: is the percentage of data storage costs in the IT budget going up, down, or staying the same?
I disagree with assertion that the role of a consolidated data platform is unique to Hadoop. Clearly, it's long been the defining role of an enterprise data warehouse. Consequently, it doesn't "shatter" the "traditional data warehouse idea"--it reinforces and evolves it into the era of Big Data. Also, the fact that more data consolidation will happen on various Big Data platforms--MPP EDW, Hadoop playing EDW-like role, etc.--will control storage costs from verging into the "data-obesity" territory you cite.
I disagree with the assertion that the ideal of a "single data warehouse was sunk beneath the need to also have ODSs...and 'spread marts'." The ideal is still alive, but has become known, in some quarters, as a "logical data warehouse" with multiple specialized tiers, logical data marts, common data stewardship infrastructure, and virtualization/federation infrastructure. Physical centralization of all enterprise data into a single store has almost never been anybody's practical ideal. Citing that is invoking a convenient strawman.
Also, it's misleading to say that " Hadoop is a data lake that can take data from anywhere and everywhere without any need of preparation." In fact, many enterprises do high-volume multistructured-source ETL on Hadoop to prepare the data for loading into downstream analytic platforms. And many data-science sandboxes built on Hadoop rely on having multiple sources of data that are collected, cleansed, and otherwise prepared (via Hadoop-based ETL) on the same platform that supports model bulding and scoring.
On the latter point, see this blog I wrote last year: Data Scientists: How Big is Your Big Data Sandbox? (http://www.ibmbigdatahub.com/blog/data-scientists-how-big-your-big-data-sandbox)
Posted: 27th March 2013 | By Philip Howard :
Robin, I agree with Azana about HPCC (from Lexis Nexis): definitely worth looking at. The product has been around for a decade but only went open source in 2011 (Reed Elsevier, the parent company, don't really "get" big data or understand what they've got in HPCC, and blocked earlier moves to go open source).
The messages above were all contributed by IT-Director.com readers. Whilst we take care to remove any posts deemed inappropriate, we can take no responsibility for these comments. If you would like a comment removed please contact our editorial team.
We automatically stop accepting comments 180 days after a post is published. If you would like to know more about this subject, please contact us and we'll try to help.
Published by: IT Analysis Communications Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761