5 ways to get more out of Hadoop

For every business, time is of the essence – a minute can mean millions in some cases.

As organisations increasingly look to speed time to market, anticipate and respond to customers’ needs, and introduce new products and services, they need to have peace of mind in knowing that their decisions are based on information that’s fresh and true.

Operating with data that’s even a week, a day or even hours old, can be fatal.

Hadoop, the big data processing tool, is now being used extensively to help businesses achieve business insight in real-time.

>See also: Gartner reveals bleak outlook for Hadoop

Here are five ways developers can sharpen their use of the Hadoop framework.

1. Go faster

Just by moving from data integration jobs built with MapReduce to Apache Spark, you will be able to complete those jobs two and a half times faster.

Once you convert the jobs, if you add Spark-specific components for caching and positioning, you can increase performance an additional five times.

From there, if you increase the amount of RAM on your hardware, you can do more things in-memory and actually experience a ten-fold improvement in productivity.

Overall, when you combine Hadoop with your traditional bulk-and-batch data integration jobs, you can dramatically improve performance.

2. Go real-time

It’s one thing to be able to do things in bulk and batch – it’s another thing to be able to do them in real-time.

This is not about understanding what your customers did on your website yesterday. It’s about knowing what they are doing right now – and being able to influence those customers’ interactions immediately – before they leave your site.

One of the best things about Spark – and Spark streaming – is that you now have one tool set that allows you to operate in bulk, batch and in real-time.

With data integration tools, you can design integration flows with one tool set across all of these systems, so that you can pull in data from historical data sources, from Oracle and Salesforce, and then come in with real-time streaming data from websites, mobile devices and sensors.

The bulk-and-batch information may be stored in Hadoop, while real-time information can be stored in NoSQL databases. Regardless of the data source, you can use a single query interface using Spark SQL from mobile, analytic and web apps to search across all data sources for the right information.

3. Get smart

So, now you can process data in real-time – but how about intelligently processing data in real-time?

To improve the IQ of your query, Spark utilises machine-learning which, for example, allows you to personalise web content to each shopper in order to nearly triple the number of page views.

Spark’s machine learning capabilities also allow you to deliver targeted offers, which can help double conversion rates. So, you are not only creating a better customer experience, but you are also driving more revenue.

For example, German retailer OTTO Group is using Spark to predict which online customers will abandon their shopping carts and then present them with incentive offers.

If you are a $12 billion company and have the industry-standard rate of a 50% to 70% abandonment of carts, then even a small improvement can result in millions of dollars in extra revenue, or even billions of dollars.

These simple design tools make it possible for companies of any size – not just those large retailers like OTTO – to do real-time analytics and deliver an enhanced customer experience.

4. Stop hand coding

Everything discussed in the tips above can be programmed in Spark, Java or Scala. But there’s a better way. If you are using a visual design interface, you can increase development productivity ten times or more.

>See also: A guide to data mining with Hadoop

When you’re designing jobs with a visual UI, it makes it so much easier to share work with colleagues. People can look at it and understand what the integration job is doing – making collaboration straightforward and the ability to re-use development work simple.

5. Get a head start

You can start straight away by using a big data sandbox, a virtual machine with Spark pre-loaded and with a real-time streaming use case. And if you need it, there’s a simple guide that walks you through a step-by-step process, making it easy to pick things up and hit the ground running.


Sourced from Ashley Stirrup, Talend

Avatar photo

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and...

Related Topics

Big Data