Technology -> Data Management
By: Philip Howard, Research Director - Data Management, Bloor Research
Published: 23rd November 2007
Copyright Bloor Research © 2007
Vertica, as most readers probably know by now, offers a column-based approach to data warehousing. To be honest, I have written about the advantages of using columns to support analytic environments so many times since the late 90s (when I first encountered it) that I have got bored with explaining it. If you don’t understand that using columns means better performance, less disk space (partly due to the fact that it is easier to compress by column and partly because you don’t need indexes) and less administration then you haven’t been listening.
In any case, there are so many column-based databases now (Sybase IQ, Vertica, ParAccel, Calpont soon, Alterian, Kx Systems, SAND, Sensage) that this can now be regarded as a standard market in its own right.
So, leaving aside the column versus row argument, what is different about Vertica? The short answer is that it has been designed to support a grid architecture (connected by one or two Gigabit Ethernet interconnects) that exploits many low cost nodes, each with local disk storage, rather than necessarily being implemented on top of a conventional architecture with massively parallel processing at the back-end. You can also implement Vertica on top of a SAN configured as direct attached storage if you wish.
However, it is the way that Vertica uses this grid that is important. Vertica distributes data across the nodes in the grid using what it calls “projections,” which are effectively the same thing as materialised views. As I mentioned previously, column databases like Vertica compress data very aggressively (often achieving 90% compression ratio), and Vertica will use as much of the conserved space as you give it to store multiple sets of overlapping columns (projections), with different sort orders, on different nodes in the grid. To put it simply, what this means is that you can have a group of columns sorted in one order on one node and in another order on another node.
Vertica uses the active redundancy built into the projections to parallelise querying (for better concurrency) and to support failover and recovery. So, the idea is that if you have a query that involves a join across several columns then your query is directed to the node that has the particular sort order needed for that combination of columns, or as close to it as possible. Clearly, if columns are pre-sorted in a way that best suits a particular query then you will get much better performance.
Of course there are other notable features of Vertica: it processes the data while still compressed (though it is not alone in this), it makes use of in-memory capabilities for writing data to the database, it comes with its own DBDesigner that will generate an optimised physical schema based on the logical data model, a training set of data and queries, the number of nodes in the grid, and the number of concurrent node failures the database must be able to tolerate to provide a highly available environment.
A further particularly nice feature that will be introduced in the future is intelligent query monitoring. Put simply, what this will do is to monitor the actual queries that hit the database so that you can build up an understanding of the pattern of those queries, which can then be linked back to the projections in use, along with DBDesigner, so that you can optimise sort orders, for example, to best meet today’s query patterns. Note that the potential for this capability was designed into the environment in advance: it would be extremely difficult to retrofit such a facility.
On the commercial side, the product is available either as pure software (including a free “try before you buy” offer) or as a pre-installed package on HP hardware with Red Hat Linux. Something over half of the company’s beta sites have so far converted into paying customers and I understand that some of these are now fully deployed as production systems. At present the company’s largest configurations are measured in tens of terabytes (raw data: the actual size will be significantly less than this because of compression) and Vertica expects to exceed 100Tb within 12 months. On the user scalability side the company reports successful tests with hundreds of concurrent users.
All in all then: a very promising start. But given the advantages of column-based approaches I am not surprised.
Posted: 23rd November 2007 | By Triad :
Thanks for this summary. This is the first time I have seen this properly explained in a concise manner.
The messages above were all contributed by IT-Director.com readers. Whilst we take care to remove any posts deemed inappropriate, we can take no responsibility for these comments. If you would like a comment removed please contact our editorial team.
We automatically stop accepting comments 180 days after a post is published. If you would like to know more about this subject, please contact us and we'll try to help.
Published by: IT Analysis Communications Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761