What is big data?

The bookselling industry is centuries old, but its established operating model is crumbling around its ears. Having already been turned on its head by the Internet, the industry now faces even greater disruption from electronic books. But the rapid digitisation of what was once the quintessential paper-driven industry has also had benefits for booksellers.

According to Marc Parrish, vice president of retention and loyalty management at US bookseller Barnes & Noble, readers are buying more e-books than physical volumes. Such is the growth of e-book adoption that Parrish expects e-book sales to overtake paper sales within two years, he said at GigaOM’s Structure 2011 conference, held in New York in March.

That inversion of the purchasing process has huge ramifications for the business, but luckily for Barnes & Noble one advantage of e-book sales is that they create reams more data.

For example, having compared its electronic and physical sales, the company now knows that e-book buyers are more conservative in their book choices, sticking to a narrower range of titles than lovers of bound paper.

Simultaneously, through analysing data collected from its physical and online stores, Barnes & Noble has established that e-book buyers who continue to buy a broad range of books also tend to visit their stores more. The ability to track customer behaviour online and in store is vital to Barnes & Noble going forward, says Parrish.

And as the shift toward e-books becomes more pronounced, its high street stores will be more valuable as a marketing tool, he notes. Eventually, even the shop tills could become redundant.

Book publishing is just one of many industries facing a tidal wave of operational data that poses a golden opportunity. If businesses can collate, process and analyse that data, they can detect and exploit trends to stay ahead of the market.

That data explosion, and the new generation of technologies required to make use of it, is what the IT industry now commonly refers to as ‘big data’.

Businesses have been accumulating growing volumes of data for decades, so what makes big data a new trend?

To answer that question, it is helpful to look at the businesses that experienced the phenomenon first: web companies such as Google, Yahoo and Facebook.

For these companies, even the slightest click by a creates a stream of data, and their users number in the hundreds of millions. All that data must be captured, managed and processed at speeds that far exceed the capacity of today’s standard enterprise data management tools.

According to Bill McColl, CEO of analytics software vendor CloudScale, that is a world away from the traditional enterprise model of business intelligence, where companies typically track a few thousand entities (e.g. customers, stock items, devices). “What’s happened in the past five years has been an incredible exponential explosion,” he says.

Facebook has to keep track of hundreds of millions of separate entities in real time, creating an enormous number of events every second, he says. And the volumes that the world’s largest social network handles today will soon become commonplace.

Utility companies are deploying Smart Grids, made up of thousands of meters monitoring energy use throughout their network. Logistics companies are using sensors to track and manage the progress of goods in transit. Retailers are offering customers deals and discounts in exchange for the right to identify when they are in store via their smartphone.

IBM is collaborating with medics at Columbia University Medical Center and the University of Maryland School of Medicine to apply big data analytics tools to medical practice, while California-based start-up Apixio aims to improve information sharing between doctors by analysing everything from CT scans to emails.

What links these big data applications is the need to track millions of events per second, and to respond in real time. Utility companies will need to detect an uptick in consumption as soon as possible, so they can bring supplementary energy sources online quickly. If retailers are to capitalise on their customers’ location data, they must be able to respond as soon as they step through the door.

In a way, the term ‘big data’ fails to highlight the critical development. Yes, this trend will accelerate the expansion of data volumes, but those have always been growing. The significant change is the way in which the data is produced, and the way it must therefore be collected and analysed.

Most traditional BI vendors would argue, of course, that their high-end tools are well able to process large volumes of data and in real time. But proponents of big data technologies argue that these traditional BI tools were not designed for data created in the manner described above.

In the conventional model of business intelligence and analytics, data is cleaned, cross-checked and processed before it is analysed, and often only a sample of the data is used in the actual analysis. This is possible because the kind of data that is being analysed – sales figures or stock counts, for example – can easily be arranged in a pre-ordained database schema, and because BI tools are often used simply to create periodic reports.

According to Anjul Bhambhri, vice president for ‘big data products’ at IBM, this is not true of the kind of data that is now exploding in volume, such as website clicks, continuous meter readings or GPS tracking data.

With big data, “organisations are attempting to make sense of masses of data that has no real structure, no schema”, says Bhambhri. “This is data that can’t just be pushed into a structured repository.”

Any organisation having to make sense of these torrents of data using traditional tools will be overwhelmed, says Cloudscale’s McColl. “We need a different approach. The idea of storing it all in a database and then querying it is not going to work.”

Just as the web giants were the first to experience the big data phenomenon, they were also the first to build tools with which to handle it.

Google’s landmark publication of the principles of its GFS file system and MapReduce algorithm in 2004 laid the foundation for a new breed of data capture and management tools based on massively parallel processing. Both Yahoo and Facebook have added momentum by sharing some of their own expertise, mainly through the Apache Hadoop programme/

Big data techniques may have been developed in the engineering departments of these web powerhouses, but they may well have remained there were it not for a class of companies making the technology accessible to mainstream businesses, says Donald Feinberg, a distinguished analyst at industry watcher Gartner.

“Most large organisations have one or two mathematicians or statisticians who would be capable of writing MapReduce algorithms,” but commercialised products from vendors such as DataStax and Cloudscale make these technologies a realistic proposition within the enterprise, he says.

Traditional data management vendors have been quick to spot an opportunity. Over the past year, these vendors have been on a big data acquisition spree: IBM snapped up Netezza for $1.6 billion; HP and EMC bought the privately held Vertica and GreenPlum for undisclosed fees; and in March 2011, Teradata spent $263 million on the 89% of in-database analytics start-up Aster Data it didn’t already own.

The combination of big data technologies and established enterprise IT vendors could prove a potent one. The strength of today’s big data tools is their ability to manage and analyse data that may not fit the existing data management infrastructure, says IBM’s Bhambhri. But in order to create something valuable, they need to be able to integrate that new information with existing data repositories within the enterprise.

“That way, businesses are really finding out things about their customers that they could never have known before,” she says. Despite its industry-shaking potential, there is one rather prosaic reason that companies are contemplating big data, Gartner’s Feinberg says.

Until recently, few firms could justify the outlay needed to store gigantic volumes of data, knowing that much of it would be essentially useless. “Today, I can go online and buy a server with 1 terabyte of RAM for less than 100,000,” says Feinberg. “Big data is suddenly affordable.”

Henry Catchpole

Henry Catchpole runs Inform Direct, a company records management software company which simplifies the process of dealing with Companies House. The business was set up in 2013. More by Henry Catchpole

Henry Catchpole

Related Topics

Related Stories

How do you build an adaptable data platform?

Charting the AI-fuelled evolution of embedded analytics

Data maturity and the squeezed middle – the challenge of going from good to great

How to stop data mesh turning into a data mess

Related Stories

How do you build an adaptable data platform?

Charting the AI-fuelled evolution of embedded analytics

Data maturity and the squeezed middle – the challenge of going from good to great

Looking at the Earth with fresh eyes