A guide to data mining with Hadoop

The idea of gaining knowledge through specialised analysis of mass data started with data collection in the 1960s, and has steadily increased both in the amount of data processed and the sophistication of questions businesses try to answer.

Through this progression from static to dynamic and now to proactive provision of information, knowledge discovery through databases (KDD) still has the goal to extract useful intelligence from this data by using database and data management aspects like cluster analysis, classification or regression.

Early adopters of data mining were certain sectors that already had an affinity to data, such as the financial services industry and insurance companies. Retailers followed soon, using it for tracking inventory and various forms of customer relations management. Now utility companies use smart meters to predict energy consumption, and health care providers use RFID chips in name tags to track how often doctors wash their hands during rounds to help prevent the spreading of disease.

>See also: The Internet of Things will turn Hadoop architectures on their head

With the advent of the Internet of Things and the transition from an analogue towards a digital society with an increasing number of data sources that create data at almost every interaction, data mining can become a commodity for almost every company.

How to start data mining

For a typical medium-sized business to benefit from their available data, the first step is to start collecting and storing the data, of course. Depending on the amount and application, this can be done on a rather small scale at first.

Most companies already have some form of enterprise data warehouse (EDW) in place, using it to create reports, like quarterly comparisons, for executive personnel and senior management.

The second step, as important if not more as infrastructure, is the architecture to compile and sift through the data. Until recently, only a few large companies – such as IBM, Microsoft and SGI – provided architecture for data mining. Now, open source solutions have become increasingly viable and popular, among the most promising is the framework Apache Hadoop.

Advantages of using Hadoop

Apache Hadoop is a core component of any modern data architecture, allowing organisations to collect, store, analyse and manipulate even the largest amount of data on their own terms – regardless of the source of that data, how old it is, where it is stored, or under what format.

Most companies need to modernise in order to use larger data sets for more advanced data mining methods like web or text mining, and predictive analytics. Instead of replacing the complete infrastructure, most people start optimising with Hadoop as it seamlessly augments with existing EDW structure.

By its design, Hadoop can be grown as needed. If more data is available, it is very easy to increase the amount of commodity hardware to run clusters on. As it requires no specialised systems to run on, adding new servers is a rather inexpensive task. A common pitfall is to not start with data mining because companies overestimate the initial CAPEX required by a factor of ten.

Hadoop can be implemented with most programs, including household names such as Microsoft Excel, which can be used to visualise the mined data quite easily. Its open source nature also provides a constant influx of new features, and it offers the ability to work in almost any code language on the Hadoop framework.

Hadoop was developed to analyse massive quantities of unstructured data, thus it is very adept at it and requires fewer resources for data mining, making it a natural choice for big data applications.

Be creative – and create new data

The third crucial step is to refrain from ‘cooking the data’ by not only not using all of it, but also not storing or even throwing away a large portion of the available data, because it is deemed useless. So as before, some of the biggest issues with data mining remain to be evaluation and interpretation of the collected information.

>See also: 4 data science predictions for 2015

Retailers traditionally limited themselves to basket analysis, thus tracking what people bought in order to regulate inventory. With new data available, such as geo location data from GPS in smart phones, this also makes analysing the browsing behavior of customers possible – which in turn helps to create the store lay out to gain from the most travelled pathways.

Another interesting example of new uses due to increased availability of data can be found in facility management. Before, data might have been used to simply control the airflow of the air condition to keep the building temperature at optimum. Now, using a larger array of data sources, the goal is to save energy by turning on the air conditioning only when the employee enters the building via his chip card, and turning off parts of the building lights or airflow completely and automatically when they are empty. The same system can also set the temperature in the office to the exact level that the occupant usually prefers, thus adding comfort to providence.

If approached with an open mind and using an ever-increasing data set, along with the necessary hardware and a capable architecture such as Hadoop to extract the buried information, companies can indeed profit in a multitude of ways from data mining and realise unseen potential.


Sourced from Jim Walker, Hortonworks

Avatar photo

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and...