• Jump to Left Menu
  • Jump to Right Menu
  • Jump to Main Content
  • Jump to Footer
  • Accessibility Page
IT-Director.com Logo

 

Main navigation - go to a section of this website:

  • ARCHIVE
  • PAPERS
  • EVENTS
  • NEWSWIRE
  • BLOGS

  

Register For Membership | Member Login

 
 
DOMAINS
  • Business Issues
  • Channels
  • Enterprise
  • Services
  • SME
  • Technology
    • Applications
    • Big Data
    • Data Management
    • Infrastructure
    • Mobile
    • Personal Productivity
    • Security
    • Storage
    • Systems Mgmt
FEATURED EVENTS
  • Free Webinar - ISO 22301: The New Standard for Business Continuity Best Practice
    23rd May
    Webinar (online)
  • Telecoms Tech World
    4th June - 5th June
    London, United Kingdom
POPULAR PAPERS
  • FM, IT and Data Centres by Quocirca
  • Beyond Big Data - The New Information Economy by Quocirca
USEFUL LINKS
  • Last 7 Days
  • Archives
  • Top Articles
SHARE THIS PAGE
  • Delicious Icon Delicious
  • Digg Icon Digg
  • reddit Icon reddit
  • Facebook Icon Facebook
  • StumbleUpon Icon StumbleUpon
CONTENT FEED

Technology -> Data Management
RSS Feed:

RSS Icon

What is RSS?

RANDOM QUOTE
Raw Wit - "I live so far out of town the mailman mails me my letters." - Henny Youngman

PAGE TOOLS
ADVERTISEMENT
MORE FROM AUTHOR
  • May 2013
    IBM JSON
  • May 2013
    Busy bees at MapR
  • April 2013
    BLU Acceleration
  • April 2013
    What is Actian actually doing?
  • April 2013
    IBM boo-boo on big data
  • April 2013
    Big data storage options
  • March 2013
    De-duplication in a graph database
Analysis

What is Hadoop?

Philip Howard By: Philip Howard, Research Director - Data Management, Bloor Research
Published: 2nd November 2011
Copyright Bloor Research © 2011
Logo for Bloor Research

This is the first of I don't know how many articles about "Big Data". I was going to start the first article by saying that big data is not the same as Hadoop but then I realised that I had better describe what Hadoop is first, as otherwise that statement wouldn't make sense. So you can take this as a sort of preface to the whole series.

Hadoop is a) an Apache Software Foundation development project and b) the original developer's son's stuffed toy elephant (hence the Hadoop logo and the menagerie of complementary projects named after animals). It has three bits to it: a storage layer, a framework for query processing and a function library. The last is not terribly important so I'll concentrate on the first two. There are also loads of things built on top of or around Hadoop (Hive, Pig, Zookeeper et al) that are not part of Hadoop per se - I'll come back to these in a further article.

The storage layer is HDFS (Hadoop distributed file system) and it stores data in blocks. Unlike a relational database, in which block sizes are typically 32Kb or less, in HDFS block sizes are, by default, 64Mb. This supports serial processing, as opposed to the random I/O that you find in a relational database. It is good for reading large volumes of data but useless for transaction processing or operational business intelligence.

Further, HDFS has redundancy built-in. Designed to run across a cluster of hundreds or thousands of low cost servers, HDFS expects those servers and their disk drives to break down on a regular basis. For this reason each data block is stored, by default, three times within a Hadoop environment. Note the implication of this: however much data you want to store you will require at least three times as much capacity as data (bearing in mind factors such as compression). This also has implications for processing functions, which are despatched to the servers on which the data resides rather than bringing the data to the processing, which is the norm in relational environments. A further point to note is that new data is always appended to what is already present: you cannot insert data.

OK, enough about HDFS for the time being, what about MapReduce? MapReduce is a programming framework for parallelising queries, which greatly speeds up the (batch - this is not suitable for real-time) processing of large datasets running, in conjunction with HDFS, on low-cost hardware. While it is usually thought of as having two steps (Map and Reduce) it actually has three. The first, the Map phase, reads the data and converts it into key/value pairs (a discussion of key/value databases will be the subject of a further article or articles). An example of a key/value pair might be "Name: Howard" "Employee number: 666" (just kidding) where Name is the key and Employee number is the value. Imagine that you had multiple employee files, then you would have a separate map task for each file.

Secondly, there is a process called Shuffle, which assigns the output from mapping tasks to specific Reduce tasks, which combine these results. All keys with the same value must be sent to the same reduce task.

There is obviously more to it than I have outlined here but these are the essentials. So, to return to the initial question: what is Hadoop? The short answer is that it is a support framework for MapReduce. The key word is "a": you can run MapReduce on Aster Data (Teradata) so you don't need Hadoop at all, you can deploy HBase, which is a column-oriented but non-relational database management system on top of HDFS, or you can deploy RainStor alongside HDFS, or you can replace HDFS with IBM's GPFS-SNC or there are various other options. Why you might want to do any of these things are subjects for another day but the important thing to remember at this stage is that the key element is the query processing provided by MapReduce rather than anything else.

Reader Comments

We have not received any comments against this entry. Why not be the first?

We automatically stop accepting comments 180 days after a post is published. If you would like to know more about this subject, please contact us and we'll try to help.

  • Contact
  • | Site Map
  • | Terms of Use
  • | Privacy Policy
  • | Cookie Policy

Published by: IT Analysis Communications Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761