Is Hadoop’s legacy in the cloud?

Jens Graupmann, VP of product management at Exasol, discusses whether Hadoop could now be considered legacy technology, as the enterprise of just five years ago differs so vastly from the enterprise of today Is Hadoop’s legacy in the cloud? image

It was sometime around 2010 that “big data” became a buzzword. Data was an untapped resource and technology was going to unlock the riches with it. Leading the hype was an open-source project called Hadoop. By using Hadoop, you could safely store and manipulate large amounts of data with commodity hardware, it was massively powerful, scalable, and a large community grew around it and developed it.

‘These days we don’t talk about commodity hardware so much. In fact, we don’t talk about hardware at all – by using the cloud, compute and storage have become things that you buy on demand. Analytics is a service, to be bought by the hour.

So, what’s happened to Hadoop? Does it have a place in the cloud? Why have so many companies abandoned their on-premise Hadoop installation in favour of the cloud?

Hadoop – the elephant that could

Hadoop’s origins can be traced to the Apache Nutch project in the early 2000s. Nutch was an open source web crawler to index the web, part of the Apache Software Foundation which itself was one of the pioneers of open source software.

At the time, the Nutch project was struggling to parallelise its web crawler – it worked well on one machine but to get it handling millions of webpages – “web-scale” – was out of reach. In December 2004, Google released a paper called “MapReduce: Simplified Data Processing on Large Clusters” which described how Google had managed to index the rapidly growing volume of content on the web by spreading the workload across large clusters of commodity servers.

The data lake continues to grow deeper and wider in the cloud era

In the modern world of data lakes, CDOs and CIOs will face three major challenges: how to migrate their users, how to live with a hybrid infrastructure for a while and how to future-proof their data platform

It was the perfect fit for Nutch’s problems, and by July 2005 its core team had integrated MapReduce into Nutch. Not long after, the novel filesystem and MapReduce software were spun out into its own project called Hadoop – famously named after the toy elephant that belonged to the project lead’s son.

The project accelerated in 2006 when Yahoo! used Hadoop to replace its search backend system. Soon after, it was adopted by Twitter, Facebook and LinkedIn too – in fact, it became the de facto way to work with web-scale data.

The technology was revolutionary at the time. Before Hadoop, storing large amounts of structured data was difficult and expensive. Most organisations just kept the most valuable data and discarded the rest. What Hadoop did was reduce the burden of data storage – for the first time it became cost-effective to store lots of data – “big” amounts of data.

Realisation – Hadoop is an ecosystem, not a solution

Lots of businesses both large and small set up Hadoop clusters and hoped to gain business insights or new data-based capabilities from their data. However, for many of them, the results were a disappointment.

More-often-than-not, the Hadoop cluster was installed before they had a good idea of a use-case for it. When they tried to execute on an idea – which was often business intelligence or analytics-based – they found Hadoop to be too slow for interactive queries.

How to climb the data maturity scale

Data has become a modern-day asset to organisations. Using analytics to take full advantage of data is becoming an increasing priority across every business sector

What many people failed to realise is that Hadoop itself is more of a framework than a big data solution. Plus, with its broad ecosystem of complementary open source projects for most businesses Hadoop was too complicated. It needed a level of configuration and programming knowledge that could only be supplied by a dedicated team to fully leverage it.

Even when there was a dedicated internal team, it sometimes needed something extra. For instance, one of Exasol’s clients, King Digital Entertainment, makers of the Candy Crush series of games, couldn’t get the most out of Hadoop. It wasn’t quick enough for interactive BI queries that the internal data science team demanded. They needed an accelerator on a multi-petabyte Hadoop cluster which allowed their data scientists to interactively query the data.

Hadoop in the cloud

The world of data warehousing has changed in recent years, and Hadoop has had to adapt. The IT infrastructure of 2009-2013, when Hadoop was at the peak of its fame, differs greatly from the IT infrastructure of today. The public cloud didn’t even exist when Hadoop was created in early 2006, AWS only launched in March 2006. So, the IT landscape in which Hadoop had its formative years has changed immeasurably.

This has caused the way Hadoop is used to evolve. Most public cloud infrastructure providers now actively maintain and integrate a managed Hadoop platform. The most widely used example is AWS Elastic Map Reduce, but Azure has HDInsight and Google Cloud Platform has DataProc. These days the Hadoop-based cloud platform is most often used for machine learning, batch processing or ETL jobs.

Hadoop: the rise of the modern data lake platform

Hadoop, according to Matt Hutton, director, R&D Think Big/Teradata, is difficult to get right

Moving to the cloud has benefited Hadoop. The complicated set-up is taken care of, and it’s ready to be used immediately on-demand. But Hadoop has competition, it is no longer the only option for secure, robust, cheap data storage – so it’s finding itself used for particular workloads rather than being the centre of the data universe, it’s usual on-premise incarnation.

What’s the future for Hadoop?

For certain organisations, Hadoop is still a great on-premise solution. We still see strong demand for on-premise solutions in our installations, including those integrating Hadoop clusters. The demand isn’t going away anytime soon. The simple fact is if it’s working well, then there’s often no need to change it, and Hadoop is relatively easy to scale so it can grow with your business.

However, it seems the majority of businesses are now looking to run their own data warehouse using public cloud services. We just launched our enterprise-grade data warehouse on AWS, and this was entirely driven by customer demand, more and more businesses are asking for it. And for most of these businesses, Hadoop is just another tool in the cloud toolbox. When you need to run a job at scale it’s a great option, and in the cloud there’s a level of ease of use for Hadoop that hasn’t been enjoyed previously.

So, what does the future hold for Hadoop?

Hadoop was designed as a tool for a job. Originally, it was the means of building a web crawler to index the web. These days it’s best suited for batch processing, data enrichment jobs, or ETL at scale. It’s a great on-premise solution for those businesses which understand its strengths and weaknesses and need to store large amounts of data on commodity hardware.

Many on-premise technologies are finding themselves demoted to legacy technology. However, Hadoop’s legacy may well be in the cloud, where it has longevity and staying power. It’s a fantastic tool to have in your cloud toolbox when you need to run a batch job at scale. With the cloud, Hadoop enjoys a level of ease of use that hasn’t been enjoyed previously.

Written by Jens Graupmann, VP of product management at Exasol

Latest news

divider
Events
Data Leadership Summit: 12 months on – how GDPR influenced business

Data Leadership Summit: 12 months on – how GDPR influenced business

23 May 2019 / Reflecting on the past 12 months in a panel discussion this morning, Neil Currie, head [...]

divider
Digital Transformation
Digital transformation remains impossible without solving the WAN problem

Digital transformation remains impossible without solving the WAN problem

23 May 2019 / For the last few years, digital transformation has become a major rallying cry for organisations [...]

divider
Case Studies
Fitbit: from start-up to global health phenomenon

Fitbit: from start-up to global health phenomenon

22 May 2019 / Fitbit was founded 12 years ago by Eric Friedman, the current CTO and James Park, [...]

divider
Business Skills
AI and machine learning driving skills revolution in business intelligence

AI and machine learning driving skills revolution in business intelligence

22 May 2019 / An explosion in the growth of emerging technologies such as AI and machine learning is [...]

divider
Data Analytics & Data Science
Making an organisation data literate: Jason Teoh from Openreach, part of BT, talks to Information Age

Making an organisation data literate: Jason Teoh from Openreach, part of BT, talks to Information Age

22 May 2019 / We run the “UK’s digital network business” says Jason Teoh, when he spoke to Information [...]

divider
Data Analytics & Data Science
New report highlights issues around productivity in data science and analytics

New report highlights issues around productivity in data science and analytics

22 May 2019 / Tens of millions of data workers face productivity woes as complexity grows in data science [...]

divider
EMEA
Technology could help UK add 140 billion to GDP

Technology could help UK add 140 billion to GDP

22 May 2019 / Technology in the UK could help boost productivity. The Cisco Productivity Index has found that [...]

divider
DevOps
DevOps and SecOps: how to close the gap between them?

DevOps and SecOps: how to close the gap between them?

22 May 2019 / The International Organisation for Standardisation has published an Open Systems Interconnection reference model for the [...]

divider
The City & Wall Street
Torii secures $3.5m from seed round to bolster SaaS management

Torii secures $3.5m from seed round to bolster SaaS management

21 May 2019 / Torii enables organisations to stay on top of their SaaS use by improving visibility and [...]

Do NOT follow this link or you will be banned from the site!

Pin It on Pinterest