Teradata announced version 6.0 of Teradata Aster last year, which, following a beta programme, should be available shortly. If you have read any of the popular computing press articles about this then you may have come away from that experience having been misled, since there are a number of misinformed statements in at least some of the pieces that I have read. So, to begin with, let’s clarify the main features of this release which are with respect to a) the SNAP Framework and b) SQL-GR.
To start with: the SNAP Framework. What this allows is different storage options, namely row-based, column-based (that’s relational columns not a NoSQL column store) or a file store. The file store is not HDFS (it's ADFS) but it is described as similar to HDFS (it is also a block store but it also supports co-location and snapshots, which HDFS does not) and Teradata describes it as complementary to Hadoop, though this may be more marketing speak than anything of great practical value.
It is when SQL-GR is introduced into the discussion that some articles have missed the point. This is a graph processing engine just as the company already has SQL and MapReduce processing engines: it is not a graph store or a graph database. The idea is that Teradata will provide pre-built graph functions that you can embed into SQL programs. Fine. Very nice. Now comes the tricky bit.
Teradata has stated that SQL-GR is based on BSP. BSP stands for Bulk Synchronous Parallel, which is a parallel computing model that was first put forward back in the 80s. Until recently it has been nothing much more than a theoretical exercise, primarily because of the performance overheads involved in what is known as barrier synchronisation (if any reader can explain, in words of one syllable, what this actually means, I would be grateful if they could append an appropriate comment to this article – I confess that this is an area somewhat outside of my comfort zone). However, there are now a number of graph implementations based on BSP – which I will discuss in a further article – and it appears, at least for graph analytics, that this limitation has been overcome.
There are two further points that Teradata has been making, which are also liable to misinterpretation. The first is that using a graph engine, as opposed to a graph database, is better for discovery purposes – that is, what data scientists do. And the second is that, unlike other approaches, its solution is not memory-bound.
The first point is only partly true. For example, in a recent presentation to BBBT (Boulder BI Brains Trust) and probably also the Teradata Influencer Day (unfortunately I was not well and was not able to attend) the company stated that graph databases are directed (that is, they support environments where you already know something about the relationships in the data) and are therefore not best suited to discovery. The examples they quoted were DB2 (which has just a triple store with no extra functionality right now) and Neo4j. That’s fair enough, except that YarcData’s Urika, which they didn’t mention, is a non-directed graph database. So the general statement they were making was incorrect. The company also lumped in SPARQL (but without the capitals) with the graph databases which some of the press seem to think is yet another graph database when it is actually a query language, so that’s also misleading.
Secondly, other BSP-based graph products (Pregel, GoldenOrb and Giraph) run their process in-memory. So does Urika. This certainly makes scaling more expensive but it doesn’t preclude scalability, which has been the thrust of Teradata’s comments. In my view what Teradata Aster is offering is flexibility: you can scale up the memory if you want or need to but you don’t have to.
However, there is another point to consider. Suppose that you want to conduct sentiment analysis. And you want to do this by combining data from social media together with information from your call centre. The former you might store in Hadoop or, in Aster’s case, in its file store, while the latter is probably coming from a more conventional environment. This is where the SNAP Framework comes in, because you can combine data from these two sources in a single SQL-GR query. Using a graph database per se, you would have to move all of the data into the graph store, transforming the data as you did so. Thus, from this perspective, the approach taken by Teradata makes a lot of sense.
To summarise: I am very pleased to see the introduction of SQL-GR and I expect it to gain significant traction, but I could wish Teradata had been a little more careful about with its announcements, presentations and statements. I could wish, too, that some members of the press were rather better informed. More generally, it is becoming clear that there is a significant difference between graph databases and graph analytics: I will return to this point in due course.