It is a curious idiosyncrasy of today’s standard analytical systems that the larger a dataset, the more guesswork there is involved in its analysis.
The reason for this is that the amount of data that can be analysed at one time is constrained by the technical limitations of the database in which it is contained. So for larger datasets, statistical analysis must rely on proportionally smaller samples of data.
Until recently, this limitation was not too much of a handicap, because sampling is an appropriate statistical tool for the kinds of analysis generally required or considered possible, such as customer segmentation. Put simply, this is because those analyses seek broad trends, and if a trend is significant it should appear in a reasonably sized sample.
However, an increasing number of applications require systems that can identify not only trends but specific patterns among a few data points out of a superset of billions. Fraud detection is one: in order to catch fraud as it happens, banks and credit card companies must be able to spot the telltale signs among the many millions of transactions that are under way at any given time. It is not enough to take a sample of the data and hope that evidence of fraudulent activity happens to fall into that sample.
Similarly, for Internet companies to target advertising to users based on their browsing activity, it is not good enough to analyse a sample of the web traffic data. It was this challenge that led search giant Google to develop MapReduce, a software framework that allows massive analysis jobs to be split up into component tasks and distributed across a number of computers (the ‘map’ function), then recombines the results of those component tasks to complete the analysis (the ‘reduce’ function). This is known as massively parallel processing (MPP).
Aster Data is a Silicon Valley-based company that has so far made its name by combining the analytical power of MPP with the SQL databases in which a large proportion of business data is contained. It does this by ‘pushing’ the MapReduce functionality out to the data using SQL queries. Companies including social networking giant MySpace use Aster Data’s nCluster system to analyse their multi-terabyte data warehouses.
Now, though, the company claims that it is taking MPP technology a step further. Aster Data describes the latest version of nCluster (version 4.0), as the world’s first ‘parallel data-application server’. What this means, Aster claims, is that it can push application functionality of any kind, whether it is based on Java, C, C++, .NET, Perl or Python code, down into the database and execute it in a massively parallel fashion.
According to the company, this allows analytics-intensive applications to run at a scale and speed that has not previously been possible, in their existing form and at considerably less cost than comparable data warehouse systems.
Companies already using ‘version 4.0’ include Telefónica, a Spanish telecommunications provider that uses the system to analyse the call patterns of its customers. Here is an example of an analytical application for which sampling will not do: to truly analyse the call behaviour of a specific individual, every single call made in the network must be considered.
Other early adopters are FullTilt Poker, an online gambling site that uses ‘version 4.0’ to seek out fraudulent activity in real time, and comScore, the Internet market research company, which analyses 160 billion rows of data a month.
According to Aster Data CEO Mayank Bawa, ‘version 4.0’ allowed comScore to perform its analytics workload three times faster than any competitive offering at half the hardware cost. Plus, he says, by switching to nCluster the company cut response times for its analytics applications from five minutes to between five and ten seconds.
It is not long since appliance makers such as Netezza and DATAllegro caused a modicum of disruption in the data warehouse industry. If Aster Data’s claims are true, however, the disruption has barely begun.