• Jump to Left Menu
  • Jump to Right Menu
  • Jump to Main Content
  • Jump to Footer
  • Accessibility Page
IT-Director.com Logo

 

Main navigation - go to a section of this website:

  • ARCHIVE
  • PAPERS
  • EVENTS
  • NEWSWIRE
  • BLOGS

  

Register For Membership | Member Login

 
 
DOMAINS
  • Enterprise
  • SME
  • Business Issues
  • Technology
  • Services
  • Channels
FEATURED EVENTS
  • London Evening Standard Business Connections Event, 'Use Technology to Boost Your Business'
    23rd May
    London, United Kingdom
  • 24th Annual FIRST Conference on Computer Security and Incident Response
    17th June - 22nd June
    Portomaso St. Julians, Malta
POPULAR PAPERS
  • Unifying electronic communications for enhanced security by Bloor Research
USEFUL LINKS
  • Last 7 Days
  • Archives
  • Top Articles
SHARE THIS PAGE
  • Delicious Icon Delicious
  • Digg Icon Digg
  • reddit Icon reddit
  • Facebook Icon Facebook
  • StumbleUpon Icon StumbleUpon
CONTENT FEED

Sitewide
RSS Feed:

RSS Icon

What is RSS?

RANDOM QUOTE
Famous Slights - "His ignorance covers the world like a blanket and there's scarcely a hole in it anywhere." - Mark Twain

PAGE TOOLS
RECENT POSTS
  • What exactly is in-memory?
  • Graph databases and the warehouse
  • Service virtualisation
  • YarcData
  • Neo4j
  • Graph databases and NoSQL
BLOG ARCHIVE
  • May, 2012
  • April, 2012
  • January, 2012
  • October, 2011
  • June, 2011
  • April, 2011
  • March, 2011
  • February, 2011
  • January, 2011
  • November, 2010
  • October, 2010
  • September, 2010
Blogs > Bloor IM Blog

Hive, DataRush and Hadoop

Philip Howard By: Philip Howard, Research Director - Data Management, Bloor Research
Published: 8th August 2011
Copyright Bloor Research © 2011
Logo for Bloor Research

Hive provides SQL access to big data stored in Hadoop. However, it is extremely limited. For example only equijoins are supported, neither indexes nor temporal types (dates, timestamps etc) are supported, sub-queries in WHERE clauses are not allowed, and ORDER BY is run on a single reducer, which means that it is very slow. And this is just a few examples. Moreover, native Hive isn't even multi-threaded (though ZettaSet's implementation is). So, it is limited and performs badly.

You could say the same about Hadoop itself. If there is one thing that it is not famous for it is its performance (or maybe lack of manageability, ease of use or high availability-unless you use a distribution from someone like MapR or complementary software from a company such as ZettaSet).

So performance is an issue both for Hive and Hadoop.

Pervasive Software is aiming to address both of these issues for anything data-intensive, through the use of DataRush. I have written about this before, as has my colleague David Norfolk, but a brief refresh is in order. Basically, DataRush is a high performance engine that you could use for just about anything, but in this case is used to speed up the performance of either Hive or Hadoop or both. DataRush is seriously (and I mean seriously) fast and the reason is because it has been designed to exploit all the parallelism implicit in multi-core servers.

As an aside, this is a major problem with almost all software vendors: they may write great software that parallelises across multiple servers but almost no-one writes software that can parallelise across multiple cores as well and scale to take advantage of however many cores you have. What's worse, performance typically degrades as you add cores if the software has not been designed for multicore. To take a simple example, my son frequently complains that one of his favourite games runs slower on his quad core laptop than it did when the game first came out running on a single core. There is an overhead involved in having multiple cores and unless software is written specifically to take account of those cores then it will slow down.

Anyway, back to DataRush. DataRush is an engine that takes care of all that multi-core parallelism for you and lets you scale to however many cores you want (the company has 48 core servers running in its labs). Basically, you write your application and DataRush runs it for you so that all that parallelism is hidden. Of course, the reason why most software developers don't write for multi-cores is that it's difficult. DataRush hides this complexity from you. In the case of Hive, getting back to the point, Pervasive reckons that in its first release, what it calls TurboRush for Hive will run queries three times faster than using Hive on its own. And you don't have to understand DataRush programming: that's all hidden away. Note that this doesn't fix the deficiencies in Hive: just makes it run faster.

On the Hadoop front you can take either of two approaches. Either you can use it to make MapReduce run faster by calling out to DataRush or you can implement DataRush in a distributed fashion in lieu of MapReduce, in conjunction with HDFS. Using the former approach runs some 4 times faster according to the Malstone B test (a standard benchmark for performance-looking for security intrusion information within web logs) while the latter approach is an order of magnitude faster (10 minutes against 0.5Tb versus 33 minutes versus 135 minutes with the same Hadoop cluster).

These figures are impressive. However, it's not just the performance. Of course you can build a 4,000 node Hadoop cluster but if you can scale up the number of cores in each server, by using DataRush, then you will need fewer servers: wouldn't you rather have a 500 node cluster with 32 core machines than those 4,000 quad core machines? It will cost you less money and take up less floor space, require less cooling and less power. Combine that with a ten or twelve times performance boost and you could be talking about a cluster of servers that is only measured in tens not hundreds or thousands and that makes even more sense.

I have been impressed with DataRush for some time but, to be honest, it seemed like a solution looking for a problem. Well, the performance of Hive and Hadoop (and HBase-you can use DataRush as a high speed loader) is a problem that needs addressing and DataRush is doing exactly that.

Reader Comments

We have not received any comments against this entry. Why not be the first?

We automatically stop accepting comments 180 days after a post is published. If you would like to know more about this subject, please contact us and we'll try to help.

  • Contact
  • | Site Map
  • | Terms of Use
  • | Privacy Policy

Published by: Electronicdawn Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761