An ideological schism has turned the once-peaceful niche of data management software into an open battlefield. On one side sit the traditionalists, for whom the relational database is the only viable platform for enterprise data.
On the other are the upstarts, who see non-relational databases, parallel processing framework MapReduce and distributed file system Hadoop – the core elements of so-called ‘big data’ analytics – as the one true vision of the future.
“I was at a large database conference at Stanford University recently, and there were shouting matches,” explains Stephen Brobst, chief technology office of data warehouse vendor Teradata. “The Hadoop guys were saying, ‘relational databases are dead, SQL programming is for dinosaurs, long live the new kings Hadoop and MapReduce’.
“Then the relational database guys were saying, ‘You Hadoop guys are reinventing technology we threw away in the 1960s. You have no optimisation, no scheme, no security; what are you thinking?”
“As an engineer, my view is that when you see this kind of religious zealotry on either side, both sides are wrong,” Brobst says. “A good engineer is happy to use good ideas wherever they come from.”
One particularly useful idea from the MapReduce world is called ‘late binding’, Brobst says.
In a traditional analytics system, data is extracted from source systems in a way that anticipates the structure of that data. This is fine if the structure of the data does not change rapidly, but according to Brobst, data structures are increasingly fluid.
“A lot of people are talking about the ‘velocity of big data’ but if that just means that data values are updating quickly, it’s nothing new,” he says. “What’s new is the velocity of change in the structure of data.”
With late binding, data is extracted from source systems in its raw form and the structure is only imposed when a query is sent. “It requires a different kind of programming model to pick data apart at query execution time,” explains Brobst. “That’s where MapReduce comes into play.”
Auction site eBay uses MapReduce to exploit this effect for web analytics, Brobst says.
“eBay has over 10,000 tags that they track in their weblog data, and they want to track new things every day,” he says. “In the traditional world, the database administrators (DBA) would have to update their extract, transfer and load (ETL) programmes every day and they wouldn’t be able to keep up.”
MapReduce means that eBay’s DBAs can extract the data in raw form, leaving it to the company’s analysts to define the structure at the point of query.
“The data scientists know what they are looking for – they want to see the correlation between the colour of text in a box and the click-through rate, for example,” he explains. “Late binding means they can apply the structure to the data at the time of query execution.”
It is benefits such as these that have prompted some technologists to reject relational databases altogether – the so- called ‘NoSQL’ movement. But what prompted Teradata to acquire big data analytics provider Aster Data in 2010, Brobst says, was the fact that it had not done this.
“What we liked about the Aster Data was that they didn’t take the extreme view on either side,” he explains. “They were able to extend the SQL model with MapReduce and this late binding feature without throwing away the productivity, the tools and all the good things associated with relational databases.”
This is not to say that there is no role for Hadoop, the popular distributed file system that incorporates MapReduce, Brobst says. “Unlike the relational database bigots, we think Hadoop plays an important role and we are investing in integrated it,” he says. “We see our customers using Hadoop as a very low cost repository to store all their data forever.”
When a new technology triggers an ideological schism as Hadoop and MapReduce appear to have done, the temptation is to take a side. If an organisation is to derive the maximum value from its technology investments, however, it would be advised to take a more enlightened view.