Technology -> Data Management
By: David Norfolk, Practice Leader - Development, Bloor Research
Published: 3rd January 2008
Copyright Bloor Research © 2008
We know it's fashionable to find an IT crisis to worry about every year or so, but we really are facing one now—the end of the "free lunch". For years, you've been able to write poor quality, bloated code, secure in the knowledge that next years' computers will be a bit faster and will run your bloated application fast enough.
Not any longer. This is because, although computers are still getting faster, they're not doing this by cranking up the clock speed of the CPU any more. Instead, CPUs will run more slowly but there will be more of them, so overall throughput will still increase year on year. This approach is inescapable, because heat production goes up with increasing clock speed rather faster than processing power does; and the largest datacentres (used by people like Amazon and eBay) are finding that the power available from the electricity grid (not just to run those CPUs but also the associated air-conditioning) is limiting growth.
The crisis comes because many applications written in the past, many of which are still in use, can't multiprocess very well. Put them on your new state of the art computer and they only run on one of its CPUs and therefore slow down (the clock speed is lower). Writing programs that run on arbitrarily many CPUs is hard and many programmers can't do it very well. When they try, they find that things sometimes run in the wrong order (especially when the system is heavily loaded) or lock up solid as one process waits for another process to release resources—if the other process is itself waiting for resources held by the first process. A lot of software isn't designed for multiprocessing (multi-CPU) environments and many programmers aren't equipped (whether from lack of ability or lack of training) to rewrite it. It's a crisis recognised by Intel (which invented the free lunch with Moore's Law) and even by Microsoft, which once appeared to endorse the writing of bloatware, as a rational response to the free lunch.
However, there is hope. J2EE can handle multiprocessing quite well, for small online transactions, and people are producing multiprocessing frameworks that let programmers write simple single-threaded programmes which are then run on multiple CPUs, more or less transparently to the programmers. It is simply too dangerous to let programmers loose on locks and semaphores—there are too many opportunities for subtle bugs that only turn up in production, when things get stressed.
One good example of such a framework comes from what is perhaps an unexpected place—Pervasive Software, vendor of embedded databases going right back to Btrieve. In fact, its DataRush framework is a logical outcome of its database expertise. It is aimed at large batch oriented database applications of the sort that are starting to require unacceptable amounts of time to run on conventional computers as data volumes shoot past the terabyte barrier. The DataRush approach is based on dataflow processing and an understanding of the properties of the data—data analysis or database specialists understand the approach easily; programmers sometimes take a little longer. There is a DataRush FAQ here.
Nevertheless, we aren't going to get a new free lunch to replace the one we mentioned earlier as now ending. People sometimes talk as though the multi-CPU issues will go away just as soon as we get cleverer compilers that can multi-process serial programmes automatically—but this is an unrealistic expectation. Compilers will improve in their exploitation of parallel processing, but they can't (in general) anticipate patterns in the data the compiled program will be fed, nor can they know about activities which can or can't be parallelised safely for business reasons—unless you tell them with compiler ‘hints’ (and these imply that the programmer understands parallel processing and future data patterns).
Even DataRush, however, isn't a free lunch. Taking advantage of its parallelisation techniques currently involves writing Java code "customizers", which are invoked during the DataRush compile cycle. The customizers can take advantage of information such as the number of processors configured to partition data and control the parallelisation appropriately. These techniques are useful for what are traditionally called batch-oriented or data analytics applications.
The "dataflow" technique of computer processing it uses (based on Kahn networks and Parks scheduling—see the article by Jim Falgout and Matt Walker here and its references) has been around for some time, although there isn't space to go into its technicalities in this article. However, it requires a new way of thinking about pipelined applications, a new process description language (DRXML) and a new graphical design approach in the Eclipse IDE (and programmers don't like change). DataRush doesn't make you throw away your existing programmes and rewrite them from scratch, but it isn't a simple recompile or automated port either—some rewriting of code is necessary. And it probably requires a degree of professionalism from your programmers, who'll have to follow established good practice.
That said, there aren't any alternative magic solutions out there that we can see; other solutions have their own problems (typically, complexity of programming and/or expense). For the subset of problems DataRush is suited to, it delivers orders of magnitude improvements in throughput. For new applications, Jim Falgout (DataRush Solutions Architect at Pervasive) says that its approach fits well, in practice, with innovative technologies such as the Vertica column-oriented database and (Azul's 768 core appliances). He has even contemplated producing a DataRush data processing appliance, superficially similar to those produced by Netezza but, in DataRush's case, running on commodity hardware and targetting data-intensive applications.
The issue of changing the programming culture is, in part, being addressed by using existing open source and standards-based environments. Its GUI is based on Eclipse 3.3, it supports JMX performance monitoring, it exploits the parallel processing features of Java 6 and it now includes support for several scripting languages. It is supported on Windows XP, Windows Server 2003, Vista, Linux (Red Hat, Suse and Azul), HP-UX, AIX and Solaris and is currently in Beta 2 (you can download it from here—registration required). However, organisations will need to address the cultural issues internally as well—probably by providing interactive (face to face) training and by encouraging programmers to take part in the DataRush community.
The issue of parallel processing on multi-cpu processors (and think in terms of hundreds of CPUs, not just 4 or 8 way processors) is a real one and will require new skills from programmers and significant cultural change. DataRush seems, to us, to promise a cost-effective way of processing very large data volumes on modern multi-cpu processors—without the need to load your data into a heavyweight data warehouse-style database. And (despite the framework as a whole being in beta) it is already the basis for a shipping product called Pervasive Data Profiler, which performs computationally intensive calculations (such as Sum, Avg, Min, Max, Frequency Distributions, Tests, regulatory compliance checks, etc.) on arbitrary columns, in all the rows/records of a table simultaneously. And, intriguingly, perhaps DataRush's underlying processing paradigms will offer a more generalised way of thinking about applications in future, now that the shortcomings of the, essentially serial, Von Neumann architecture are being recognised.
We automatically stop accepting comments 180 days after a post is published. If you would like to know more about this subject, please contact us and we'll try to help.
Published by: electronicdawn Ltd.