Having moved on from a period when the focus for leading edge businesses was on experimenting with new capabilities such as Hadoop, 2014 looks set to be characterised as the year of action, not experimentation – running valuable analytics for tangible results.
The good news for later adopters is that there are many lessons they can learn from the innovators. Many of the practical issues are not about the technology; they are about basic data management principles and a focus on tangible business challenges. However, this is not about a costly ‘big bang’ approach to managing big data.
Analytics has played an important role in business for many years. However, for many there have been limitations around going beyond basic reporting and dashboards to using it to drive innovation and business value.
More specifically, whereas traditional analytics has focused on finding the answers to specific, known questions (why is my customer defection rate increasing?), big data offers the chance to enter into unchartered territory and discover new questions and answers (what are the factors that are driving defection and how can I address them?).
This step-change in analytical capability is best achieved by first considering the types of specific and broader queries which the business wants to address (for example, improving customer retention or competitive performance).
Then, from this, explore the range of information sources that may contribute to finding an answer, whether internal or external, propriety or open, existing in a data warehouse or ‘unstructured’ and lying in a test Hadoop system.
Only then should the organisation consider whether its existing technology is able to help answer these known and potentially unknown questions. First step: define a good business problem area, specific but not too narrow.
The elephant in the room
Hadoop is widely and increasingly proving to be an effective tool to provide a low-cost, flexible solution for acquiring and storing huge volumes of data, in part due to it being open source. Organisations that have been trialling Hadoop now have a good feel for its strengths and weaknesses.
However, from an analytics point of view, it has inherent limitations and can’t cover all requirements.
For example, there are many elements of analytics, in particular tactical queries to support operational decisions (what offer should I make in real time to someone on the phone or browsing my website, or even serving up a dashboard to hundreds of managers) that need an assurance of response time.
The relational data warehouse is much better suited to these known, predictable queries, in part because of the benefits a structured data model has on query performance, combined with relational databases having been highly refined to have maximum efficiency.
Hadoop, on the other hand, is less refined, but has the advantage of high-performance parallelism on commodity hardware. It also has the benefit of storing data in raw form (it’s a file store not a database). It adopts the concept of late binding, only creating a data model at point of query, and saving unnecessary data manipulation.
For complex tasks (such as analysing multi-faceted survey data or complex web browsing data), or instances where the user has a complex query which isn’t easily coded by traditional SQL, the underlying MapReduce framework of Hadoop does, however, offer more flexibility than relational databases.
Equally, where a programmatic approach is needed, (i.e. the user needs to iterate through the data, such as when analysing social networks), Hadoop can also be extremely efficient.
For this reason, Hadoop has maximum usefulness when combined with traditional analytics in a data warehouse, and an optimal approach is to ‘unify’ data across these environments, recognising and utilising the strengths of each environment, whilst efficiently managing data across them, so overall activity is enhanced with minimal administration.
At a wider business and economic level, probably the most significant outcome of the big data revolution is that analytics is now being discussed at a senior level. Most c-level executives now acknowledge the importance of big data analytics and are increasingly looking to base decisions on fact-based, measurable data analysis.
However, for the business to truly get the best out of any investment in big data, the analytics system needs to be configured in a way that makes it is easy to move data from one database to another.
A good way to achieve this is to integrate different platforms that help the business realise the maximum potential of different types of data. For example, ‘hot’ customer data should be stored in a traditional database that is easily available to a large number of users, whereas unstructured data could be stored and analysed in Hadoop.
This enables the business to pursue data-driven decisions whilst working to expand this capability to other parts of the business, or discover new customer insights to enhance the existing decision making process based on proven success.
Duncan Ross is the director of data science at Teradata UK