How to build a big data infrastructure

Big data has dominated the technology headlines over the last couple of years. But in reality, most businesses are only just starting to get to grips with understanding what big data can actually do for them, let alone wrestling with how to build infrastructure systems able to deal with big data.

Vendors are moving at such a pace in terms of the introduction of new big data technologies, just understanding what businesses need to do from an IT perspective to manage it, store it and make it useful is confusing and complex.

Common definitions of big data talk about data volume, variety and velocity, but there is one important characteristic that is missing: complexity.

Data management is really not a new thing. But today businesses have access to more sources of data than ever before, including social and online, to mix with the data they already have about our customers and their behaviours. 

In addition, there are a host of new technologies that enable them to analyse this data much faster than was possible before. It’s the variety of the data, the number of sources – is it structured or unstructured, and how do CIOs put it together to build a pictures or develop relational insights?

It’s no longer simple transactional behaviour and so it involves a rethinking of the infrastructures, tools and systems that IT has relied on for so many years. All this leads to increased complexity and while this is a challenge for IT, it promises huge potential benefit for the business.

From an IT perspective, big data is about solving complex data problems that go beyond that which can be solved by current technologies, such as a single relational database.

Solving these computing problems at scale, and at speed, is also critical, as companies now expect near real-time insight from their big data (this is especially pertinent in the digital marketing space where campaigns are becoming increasingly targeted and personalised). As a result of this, optimising infrastructure for both hardware and software is crucial.

>See also: Intel and Cloudera: a big boost for big data?

One primary challenge comes as big data can fall into two separate use cases: real-time, transactional data analytics, and deeper dive offline data analysis more often operating in batch mode. Despite the fact many organisations want both processes in place, the optimal infrastructure for each is distinctly different.

Real-time data use-cases include website interactive session tracking, session management, real-time and predictive analytics (like recommendation engines) and the like.

These are ideal candidates for NoSQL- or SQL-based technologies, which are specifically optimised for real-time performance. They are most likely best suited for dedicated, bare metal deployments, or specific use-case tuned hardware stacks, built to enhance the performance of the data application.

For example, running a NoSQL database like MongoDB that requires high I/O operations on flash or SSD storage, ensures the best performance where it really matters – speed.

When analysts want to delve deeper into the data, looking to aggregate larger datasets for correlations to build predictive models, for example, batch style analytics are well suited. Because these batch jobs may run infrequently, cloud-based solutions can be ideal, allowing significant cost savings by building infrastructure when it’s needed, rather than keeping it running all the time.

>See also: 5 tips for turning big data into a valuable asset

Storing data in a cloud-based object store, and running Hadoop across cloud servers for map-reduce analysis is a great example where the economics of cloud are a huge benefit workloads that do not need to run all the time (non-persistent workloads).

When talking about the platforms that deal with both the persistent and non-persistent workloads that a big data solution calls for, a hybrid cloud model is ideal. Hybrid cloud allows businesses to run big data processes on the public cloud, private cloud and dedicated servers working alone or integrated together, so managing the requirements of real-time transactional data and offline analysis.

Regardless of the specific data problem being addressed today, one thing is clear: there are more technologies and methods to gather, analyse and report on data than ever before.

And while organisations are increasingly coming to understand the improvements big data can effect on their businesses, they need to have a practical plan when it comes to building an IT infrastructure that can keep up with the use-cases identified today, as well as provide a platform for the use cases of tomorrow.

 

Sourced from Toby Owen, head of technical strategy, Rackspace

Avatar photo

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and...