Technology -> Big Data
By: Philip Howard, Research Director - Data Management, Bloor Research
Published: 21st March 2013
Copyright Bloor Research © 2013
It is apparent that there is significant demand for SQL in the NoSQL world. Whether it is HQL (the Hive version of SQL) or CQL (the Cassandra version) there are clearly people who like a declarative as opposed to a procedural approach to accessing big data.
However, there is a big problem here: the reason why SQL works as well as it does in traditional, usually relational, environments is that the databases it is accessing have optimisers. Those optimisers fundamentally do two things: first, they rewrite crappy SQL code into something that is as efficient as it can be. That's a big advantage: anybody can write poor code, whether you're using MapReduce or SQL, but with an optimiser that gets handled for you. Of course, there's a downside: it encourages sloppy programming but that's another story.
The second thing that a database optimiser does for you is to determine how best to perform joins and other SQL functions. Of course, you may not have the possibility of a join within a NoSQL database but that in itself is a restriction. Anyway, the optimiser is also optimising the performance of all the other SQL capabilities that are available and some of these, at least, will be available in NoSQL databases.
So the bottom line is that you aren't going to get much in the way of performance out of simply bolting SQL onto a NoSQL source unless you also design in an optimiser.
Interestingly, this is essentially what EMC has done with Hawq SQL in its Pivotal HD product, which is essentially a re-working of the EMC Greenplum parallel engine to support Hadoop. By coming from a conventional database background that already has an optimiser, extending or re-working it to support Hadoop environments is a great deal easier than simply creating a Hadoop optimiser from scratch, and this is precisely what Hawq offers.
Going a step further, it is worth remembering that Teradata's polymorphic storage for Aster Data provides you with the ability to use either a native Aster storage engine or an HDFS storage engine (or both) under Aster Data. And, of course, to do this effectively you need to extend the optimiser so that it supports the whole environment. See also my article "DB2: a relational epithet is no longer enough" in which I described the fact that DB2 already has three storage engines, two of which are NoSQL engines (XML storage and a graph store - though you would typically use XQuery and SPARQL to access these), and that it is likely to add more such stores in the future. Again, one of the reasons why adding new storage engines is no trivial matter is precisely because of the need to extend the optimiser to understand the different storage. However, once you do so you are likely to get much better performance than you would do from a NoSQL database with SQL but no optimiser.
Posted: 22nd March 2013 | By Robin Schumacher :
Interesting article. As an FYI: CQL with Cassandra is actually now faster than the prior Thrift interface (which still may be used with Cassandra) when prepared statements are used. See http://www.datastax.com/wp-content/uploads/2012/08/C2012-CQL-EricEvans.pdf for examples (performance graphs near the bottom of the presentation).
Posted: 28th March 2013 | By Niklas Bjorkman :
It is very true that more and more users of NoSQL realizes that we need better (standardized) ways to access the data. There are NoSQL databases that has build-in SQL optimizers, like Starcounter for instance. Running standard SQL on top of the objects in the database, using an optimizer, not only gives great performance, it also simplifies the way for users to access data.
We also noted that supporting standard SQL (92 or later), rather than a home brew version of SQL, is much appreciated by developers.
One problem is that the variety of products in the NoSQL market are huge. It can't be easy for a new ISV to select a database unless they really do their homework. Getting a NoSQL database that supports SQL is easy, but as you point out - are the SQL optimized? Then understanding how ACID applies to different databases. The ways of adding prefix or suffix to the C in ACID to squeeze "ACID compliant" into the description is sometimes on the cusp to ingenious.
Posted: 6th April 2013 | By Glenn :
I don't think that Philip is claiming that thrift is faster than CQL, rather that the NoSQL philosophy of speed over safety gets undone when you access NoSQL with something that looks like SQL. http://www.dynamicalsoftware.com/nosql/cassandra/cql goes into this in more depth and warns against developers starting to believe that csl is sql.
The messages above were all contributed by IT-Director.com readers. Whilst we take care to remove any posts deemed inappropriate, we can take no responsibility for these comments. If you would like a comment removed please contact our editorial team.
We automatically stop accepting comments 180 days after a post is published. If you would like to know more about this subject, please contact us and we'll try to help.
Published by: IT Analysis Communications Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761