Big data is probably one of the most misused words of the last decade. It was widely promoted, discussed and spread by business managers, technical experts, and experienced academics. Slogans like ‘data is the new oil’ were widely accepted as unquestionable truth. But with this hype, different ideas and solutions have been considered, and then rejected. Event stream processing may provide the answer.
The rise and fall of Hadoop
These beliefs helped push Hadoop technologies — an open source distributed processing framework that manages’ data processing and storage for big data. Its stack, formerly developed by “Yahoo!” and lately owned by the Apache Software Foundation, was recognised as ‘The’ big data solution.
Many companies started to offer commercial, enterprise-grade and supported versions of Hadoop, until it was eventually adopted across industries, and by both large, Fortune 500 companies, and medium-sized companies.
The three considerations of data: standardise data, data strategy and data culture
There’s data cataloging, data bench lining, and a single view of data. Getting data right requires multiple considerations. Information Age spoke to Greg Hanson from Informatica and he outlined three considerations: to standardise data, data strategy and data culture.
The prospect of analysing huge amounts of data generated by heterogeneous sources in an attempt to boost competitiveness and profitability, provided and alluring reason to use Hadoop.
There was another factor; Hadoop was seen as providing a way to replace expensive legacy data warehouse installations, in the process, improving both performances and data availability while simultaneously reducing operational costs.
However, in recent years, a growing number of analysts who were focused on the big data market published articles suggesting that Hadoop was not all it had been cracked up to be.
Their critique can be summarised as follows:
● The deployment model is moving from on-premises solutions to hybrid, full and multi-cloud architectures. Hadoop technology isn’t made to be completely cloud-ready. Furthermore, cloud vendors had been selling cheaper and easy to manage and use solutions for years;
● Machine Learning technologies and platforms are quickly reaching a level of maturity. The Hadoop stack was not designed around Machine Learning concepts even if support in this area had grown over the years
● The advanced and real-time analytics market is rapidly increasing, but the Hadoop stack doesn’t seem to be the best fit to implement those innovative forms of of analytics.
In short, analysts began to fear that Hadoop was no longer an innovative technology. To solve future challenges, something different was required.
Our own experience chimed with this, we found that solutions based on the Hadoop stack were hard and expensive to develop and maintain. Furthermore, it was hard to recruit professionals with the right skills and any proven experience.
Consequently, moving systems from proof of concent and prototypes statuses to real productiveness seemed like an unreachable finish line.
There was another issue. Hadoop vendors tended to focus on the data lake. While big and complex organisations need a unique de-normalised data repository, the projects designed to feed data lakes tend to take before reaching maturity. Most of these initiatives turned out to be expensive, from both an economic and project governance point of view.
These complex repositories, or data lakes, are filled with historical data referring. If you are lucky, they can provide a series of snapshots of the last closing day. While that could be acceptable in many business scenarios, the enterprise world needs to react rapidly to wider developments. For this reason, companies are increasingly asking for more accurate and rapid insight to immediately forecast the possible outcomes and scenarios generated by the available set of input actions.
Databases vs data lakes: Which should you be using?
Event stream processing architecture is one potential solution
It is also clear that event stream processing could become invaluable in the following ways:
● the implementation of multi-cloud architectures (real-time or near-real-time integration of distributed data across different data centers and cloud vendors);
● the deployment and monitoring of machine learning models, enjoying the power of real-time predictions;
● real-time data processing without losing accuracy while analysing historical data.
For these reasons, event streaming technologies are improving every day.
The majority of Hadoop vendors chose to answer to these challenges by incorporating one of the streaming frameworks offered by the open-source landscape into their big data distribution. The selected solutions were normally Apache Storm or Apache Spark Streaming.
Unfortunately, by doing this, they often added add even more complexity into their stack; the offered products eventually included a wide number of computational engines, making the choice of the right tool for the job painful for operative figures like architects and developers.
Other vendors are trying to follow new ways of dealing with the combination of bounded (e.g. a file) and unbounded (e.g. an infinite incoming sequence of tweets) data sources through the employment of stream engines also for batch processing.
The difference between a data swamp and a data lake? 5 signs
As companies collect increasing amounts of data and store it, they risk creating data swamps. Sometimes, what started as a data lake turns into a data swamp. Data lakes and data swamps are both data repositories, but data swamps are highly disorganised.
What is the relationship between stream and batch processing?
While it’s almost impossible to run a stream processing job on top of a batch processing framework, the opposite is largely feasible. For instance, we can read a text file using a stream processing framework, translating each file row into a single event and process it. On the other hand, a batch processing framework cannot work on every single event, although it can process a set of events: to reach a similar result it has to be continuously scheduled.
In conclusion, the usage of an event stream processing engine can provide considerable benefits. It can:
● work on both bound data (data on-rest) and unbound data (data on-motion);
● process data offering a tunable low latency (ranging from milliseconds to seconds) still acting with a high throughput;
● offer different processing semantics (at-most-once, at-least-once or exactly-one);
● process heterogeneous data in a distributed fashion scaling out the systems horizontally.
However, there is a dark side. It’s not as easy as it might seem to architect and develop solutions based on stream processing. Although such technologies are lightweight and they usually need a less complex stack, they are not straightforward to be used in the right way at first.
Instead, considering its importance and benefits, event stream processing should be democratised by tackling the impediments, with the use of high-level self-service tools, enforcing best practices and patterns by leveraging the big data stacks, which are often already present in the companies and trying to preserve the investments made in the past.
Roberto Bentivoglio is the CTO of Radicalbit, a specialized Software House based in Milan. It’s mission is to help companies with streaming technologies by offering tools able to make the implementation of vertical solutions based on such technologies simple. Moreover, one of our key points is to combine the emerging Event Stream Processing technologies with the power of Artificial Intelligence.