Is Hadoop’s position as the king of big data storage under threat?

Hadoop, the next-generation data platform, is pitched as the king of big data storage and processing. As with normal rulers there are potential usurpers to the throne, but Hadoop has defended it valiantly and now has a new set of allies in the form of emerging technologies

Hadoop: the data storage king

‘Hadoop is a unique architecture designed to enable organisations to gain new analytic insights and operational efficiencies,’ said Carole Murphy, product director for data security at HPE Security. ‘The resulting flexibility, performance and scalability are unprecedented.’

Collecting, understanding and utilising big data is a fundamental requirement for businesses wishing to remain competitive. Those organisations that master big data will be kings and queens of their respective industry, by improving operational efficiency and customer experience.

Often businesses spend 95% of their time looking for the relevant data and only 5% of the time using it. This is neither efficient nor productive, and could quite conceivably lead to an organisation’s fall, or at the very least stagnation.

Hadoop for a long time was seen as the most effective solution software for tackling big data. Its function: to store and process big data in an easy, simple and fast manner. But is it the case any more?

As more and more businesses become reliant on big data to thrive and survive, software like Hadoop will become an increasingly valued commodity.

Efficiency drive

In its most unflattering form, Hadoop is defined on searchcloudcomputing.com as an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment.  It is part of the Apache project managed by the Apache Software Foundation.

This is an accurate and fair description, but it is hardly a love sonnet. Businesses that operate Hadoop will perhaps lean towards a more romantic view of the software that has made their business into a more efficient beast.

‘Hadoop is a unique architecture designed to enable organisations to gain new analytic insights and operational efficiencies,’ said Carole Murphy, product director for data security at HPE Security. ‘The resulting flexibility, performance and scalability are unprecedented.’

>See also: How to explain the business value of Hadoop to the C-suite

The platform has significant business benefits in storing and processing big data through, as Murphy reveals, ‘the use of multiple standard, low-cost, high-speed, parallel processing nodes operating on very large sets of data’.

Storing data is the first function Hadoop offers, as Forrester analyst Mike Gualtieri explains: ‘Hadoop lets you store files that are bigger than what can be stored on one particular node or server. So you can store very, very large files. It also lets you store many, many files.’

It allows a business to store data that was previously too expensive to keep. MapReduce is the second function of Hadoop, and processes the data or provides a framework to process the data. It is here where Hadoop excels.

Moving data over a network can be painfully slow. MapReduce, operating within Hadoop, provides the solution to this painstaking process by moving the processing software to the data – it operates from the inside.

In terms of how Hadoop works technically, Apache Software Foundation describes it as a framework that is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Rather than relying on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, thereby delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

Again, this is not the most romantic or coherent – for those with a non-scientific background – description.

What is important, some would suggest, is not how it works, but what benefits it creates for businesses seeking to tap into big data.

Gaining the advantage

So, how important is tapping into the proverbial big data gold mine for businesses?

‘Imperative, if you will,’ Jules Damji, spark community evangelist at Databricks, told Information Age.

‘Data is everywhere, and it’s growing in velocity, volume and variety, in all sectors of business and in all industry verticals. Big data is the new competitive advantage, just as automation in manufacturing enables production at scale or just as IT innovation enables productivity at scale.’

Effectively utilising big data is fundamental in improving operational insights and efficiencies, and in understanding customer usage and behavioural patterns. Describing it as the Holy Grail for business is not an understatement.

Hadoop offers businesses the ability to use this big data, with endless possibilities, as Naser Ali, marketing director, EMEA at MapR, told Information Age. It provides the opportunity to ‘write new applications for fraud detection, customer churn or product recommendation, or create whole new business models such as a replacement for taxi cabs via a smartphone app – leveraging GPS and payment systems – all of which need a common, distributed next-generation data platform’.

>See also: Top 8 trends for big data in 2016

It provides, Ali went on, the same capabilities that supercomputing does but for a fraction of the cost. Apache Hadoop’s immense storage and processing power supports the big data movement by providing a flexible performance scalability for businesses looking to make the digital leap.

However, Hadoop was built with one flaw: data security ‘was not a key design criterion’, he said. As business operations become increasingly connected, with a reliance on big data, protecting data becomes more difficult and more important.

Unfortunately, Hadoop was not designed with security as its primary function.

‘By its nature,’ said Murphy, ‘Hadoop poses many unique challenges to properly securing this environment, not least of which include automatic and complex replication of data across multiple nodes once entered into the HDFS data store.’

Sitting target

The platform is vulnerable to cyber attacks and data leaks – it is too expansive and open to be fully protected.

It is, therefore, a desirable target for hackers because of the range and quantity of data it holds. This is a problem for businesses that will face much harsher sanctions in a world dominated by stricter regulations, like the impending EU General Data Protection Regulation (GDPR).

Murphy explained that a data-centric security approach to the problem is a possible solution. This entails ‘de-identifying the data as close to its source as possible, replacing the sensitive data elements with usable, yet de-identified, equivalents that retain their format, behaviour and meaning’.

‘This is also called “end-to-end data protection” and provides an enterprise- wide solution for data protection that extends beyond the Hadoop environment,’ explained Murphy.

So, the adoption of Hadoop does present security concerns relating to the massive amount of stored data that resides in its system.

But there are security solutions in the form of IT infrastructure protection and, as mentioned, end-to-end data protection.

If these security systems can be effectively implemented then Hadoop has a crucial role to play in the future storage and processing capabilities of big data for businesses. Or does it?

Pretenders to the throne

Despite the imperious status of Hadoop and the clear benefits of using it to effectively store and process big data, alternative platforms have arisen in recent years that threaten to dethrone the king.

Damji suggests that Apache Spark ‘offers faster data processing capabilities’ and is ‘an easier and faster alternative to Hadoop’s MapReduce’. In terms of storage too, says Damji, there are other software systems just as capable, such as cloud storage, key-value stores or traditional RDBMS systems and warehouses.

>See also: 5 ways to get more out of Hadoop

Equally, Hadoop seems to falter in terms of functionality, suggests Patrick McFadin, chief evangelist for Apache Cassandra at DataStax. It is not a suitable technology for data transaction, he said, adding – unsurprisingly – that Apache Cassandra and other NoSQL offerings are better suited to the task than Hadoop or more traditional relational databases.

He also suggested that Cassandra and Spark can provide near real-time analytics when transactions are taking place, whereas Hadoop cannot.

‘Graph databases are better at handling relationships between objects and datasets, and making that data understandable,’ says McFadin. ‘Hadoop is great for batch-style analytics, but that’s not the only way that big data can be used.’

What does the future hold?

Hadoop, however, is by no means a spent commodity. Technologies and ecosystem platforms like Hadoop and the cloud are constantly evolving to meet the growing data demands of businesses and their customers.

The potential opportunities Hadoop presents for IT are expanding, in line with emerging technologies. With the emergence of the cloud and the host of applications it creates for new and increased revenues, for example, Hadoop offers the ability to manage these cloud-based benefits as a converged offering.

As more businesses move online to the cloud, this service becomes even more paramount.

Hadoop moving forward has vast potential to engage, maintain and improve emerging technologies and their data, such as data lakes, self-service data and the continuous data being streamed through the Internet of Things.

Hadoop is being used, and will continue to be used, to store and process the vast amount of data that has become so critical to business decisions and operations.

Monitoring the security of Hadoop’s platform is fundamental, and the question does remain: is there a way to utilise emerging technologies with Hadoop in a secure manner with regard to data and the customers that this data represents?

Damji suggests, however, that other alternatives are ‘unaffordably expensive, slow and likely to fail’.

He added, ‘Traditional systems cannot scale, cannot handle newer types of semi- structured data, cannot work well over complex hybrid environments, do not have developer mindshare and were designed for a different era.’

On this basis, Hadoop really is the king, and businesses are wise to this.

Comments (0)