Logo Header Menu

Why event stream processing is leading the new ‘big data’ era

From the downhill of key technologies to new innovative solutions on the plate: Roberto Bentivoglio takes us swimming in the stream, with a technical explanation on event stream processing architectures as a future-proof approach to take full control over your data strategy Why event stream processing is leading the new ‘big data’ era image

Big data is probably one of the most misused words of the last decade. It was widely promoted, discussed and spread by business managers, technical experts, and experienced academics. Slogans like ‘data is the new oil’ were widely accepted as unquestionable truth.  But with this hype, different ideas and solutions have been considered, and then rejected. Event stream processing may provide the answer.

The rise and fall of Hadoop

These beliefs helped push Hadoop technologies — an open source distributed processing framework that manages’ data processing and storage for big data. Its stack, formerly developed by “Yahoo!” and lately owned by the Apache Software Foundation, was recognised as ‘The’ big data solution.

Many companies started to offer commercial, enterprise-grade and supported versions of Hadoop, until it was eventually adopted across industries, and by both large, Fortune 500 companies,  and medium-sized companies.

The three considerations of data: standardise data, data strategy and data culture

There’s data cataloging, data bench lining, and a single view of data. Getting data right requires multiple considerations. Information Age spoke to Greg Hanson from Informatica and he outlined three considerations: to standardise data, data strategy and data culture.

The prospect of analysing huge amounts of data generated by heterogeneous sources in an attempt to boost competitiveness and profitability, provided and alluring reason to use Hadoop.

There was another factor; Hadoop was seen as providing a way to replace expensive legacy data warehouse installations, in the process, improving both performances and data availability while simultaneously reducing operational costs.

However, in recent years, a growing number of analysts who were focused on the big data market published articles suggesting that  Hadoop was not all it had been cracked up to be.

Their critique can be summarised as follows:

● The deployment model is moving from on-premises solutions to hybrid, full and multi-cloud architectures. Hadoop technology isn’t made to be completely cloud-ready. Furthermore, cloud vendors had been selling cheaper and easy to manage and use solutions for years;

● Machine Learning technologies and platforms are quickly reaching a level of maturity. The Hadoop stack was not designed around Machine Learning concepts even if support in this area had grown over the years

● The advanced and real-time analytics market is rapidly increasing, but the Hadoop stack doesn’t seem to be the best fit to implement those innovative forms of of analytics.

In short, analysts began to fear that Hadoop was no longer an innovative technology.  To solve future challenges, something different was required.

Our own experience chimed with this, we found that solutions based on the Hadoop stack were hard and expensive to develop and maintain. Furthermore, it was hard to recruit professionals with the right skills and any proven experience.

Consequently,  moving systems from proof of concent and prototypes statuses to real productiveness seemed like an unreachable finish line.

There was another issue. Hadoop vendors tended to focus on the data lake. While big and complex organisations need a unique de-normalised data repository, the projects designed to feed data lakes tend to take before reaching maturity. Most of these initiatives turned out to be expensive, from both an economic and project governance point of view.

These complex repositories, or data lakes, are filled with historical data referring.  If you are lucky, they can provide a series of snapshots of the last closing day. While that could be acceptable in many business scenarios, the enterprise world needs to react rapidly to wider developments.  For this reason, companies are increasingly asking for more accurate and rapid insight to immediately forecast the possible outcomes and scenarios generated by the available set of input actions.

Databases vs data lakes: Which should you be using?

As the transformational power of data is realised, the debate around whether to choose databases or data lakes has intensified

Event stream processing architecture is one potential solution

It is also clear that event stream processing could become invaluable in the following ways:

● the implementation of multi-cloud architectures (real-time or near-real-time integration of distributed data across different data centers and cloud vendors);
● the deployment and monitoring of machine learning models, enjoying the power of real-time predictions;
● real-time data processing without losing accuracy while analysing historical data.

For these reasons, event streaming technologies are improving every day.

The majority of Hadoop vendors chose to answer to these challenges by incorporating one of the streaming frameworks offered by the open-source landscape into their big data distribution. The selected solutions were normally Apache Storm or Apache Spark Streaming.

Unfortunately, by doing this, they often added  add even more complexity into their stack; the offered products eventually included a wide number of computational engines, making the choice of the right tool for the job painful for operative figures like architects and developers.

Other vendors are trying to follow new ways of dealing with the combination of bounded (e.g. a file) and unbounded (e.g. an infinite incoming sequence of tweets) data sources through the employment of stream engines also for batch processing.

The difference between a data swamp and a data lake? 5 signs

As companies collect increasing amounts of data and store it, they risk creating data swamps. Sometimes, what started as a data lake turns into a data swamp. Data lakes and data swamps are both data repositories, but data swamps are highly disorganised.

What is the relationship between stream and batch processing?

While it’s almost impossible to run a stream processing job on top of a batch processing framework, the opposite is largely feasible. For instance, we can read a text file using a stream processing framework, translating each file row into a single event and process it. On the other hand, a batch processing framework cannot work on every single event, although it can process a set of events: to reach a similar result it has to be continuously scheduled.

In conclusion, the usage of an event stream processing engine can provide considerable benefits. It can:

● work on both bound data (data on-rest) and unbound data (data on-motion);
● process data offering a tunable low latency (ranging from milliseconds to seconds) still acting with a high throughput;
● offer different processing semantics (at-most-once, at-least-once or exactly-one);
● process heterogeneous data in a distributed fashion scaling out the systems horizontally.

However, there is a dark side. It’s not as easy as it might seem to architect and develop solutions based on stream processing. Although such technologies are lightweight and they usually need a less complex stack, they are not straightforward to be used in the right way at first.

Instead, considering its importance and benefits, event stream processing should be democratised by tackling the impediments, with the use of high-level self-service tools, enforcing best practices and patterns by leveraging the big data stacks, which are often already present in the companies and trying to preserve the investments made in the past.

Roberto Bentivoglio is the  CTO of  Radicalbit, a specialized Software House based in Milan. It’s mission is to help companies with streaming technologies by offering tools able to make the implementation of vertical solutions based on such technologies simple. Moreover, one of our key points is to combine the emerging Event Stream Processing technologies with the power of Artificial Intelligence.

Latest news

divider
Education
University of Dundee partners with TechnologyOne

University of Dundee partners with TechnologyOne

16 September 2019 / The University of Dundee today announced it has moved its core financial functions to enterprise [...]

divider
Events
Tech Leaders Awards 2019 – winners revealed

Tech Leaders Awards 2019 – winners revealed

13 September 2019 / The UK’s top tech leaders, innovators and disruptors were revealed last night at the inaugural [...]

divider
Blockchain
Demystifying blockchain to reveal its business benefits

Demystifying blockchain to reveal its business benefits

13 September 2019 / Chief Financial Officers are currently having a hard time. The stresses being imposed by the [...]

divider
AI & Machine Learning
The use of AI in robotics and hardware: what CTOs need to know

The use of AI in robotics and hardware: what CTOs need to know

13 September 2019 / In today’s ultra competitive environment, every business must implement advanced technologies to stay ahead of [...]

divider
Blockchain
Are blockchain regulatory frameworks a necessary safeguard or an inhibitor of growth?

Are blockchain regulatory frameworks a necessary safeguard or an inhibitor of growth?

12 September 2019 / Regulatory frameworks exist for different verticals and technologies. They help in defining the ground rules [...]

divider
AI & Machine Learning
A history of AI; key moments in the story of AI

A history of AI; key moments in the story of AI

12 September 2019 / The history of AI BC: Talos: The history of AI begins with a myth and [...]

divider
AI & Machine Learning
Finding talent for those hard-to-fill AI jobs

Finding talent for those hard-to-fill AI jobs

12 September 2019 / According to research by SnapLogic, the software company, although 93% of IT decision-makers in the [...]

divider
Smart Cities
Why we should dispel the negative image of smart cities

Why we should dispel the negative image of smart cities

12 September 2019 / Whether it’s stories about people being treated like ‘lab rats in a surveillance experiment’ or [...]

divider
AI & Machine Learning
Developing your AI skills: what AI courses are available?

Developing your AI skills: what AI courses are available?

12 September 2019 / As the technology moves across the hype cycle, AI skills will become an increasingly important [...]

Do NOT follow this link or you will be banned from the site!

Pin It on Pinterest