Apache Spark: what it is, what it does, and why it matters

One of the newest (and perhaps most talked about) entrants to the big data landscape is Apache Spark, a tool that many view as a more powerful and accessible alternative to Hadoop.

Others recognise that Spark is a powerful complement to Hadoop, with its own set of strengths, quirks and limitations. But what is the reality? Who uses Spark and how is it different from other data processing engines?

What is Spark?

Spark is an all-purpose data processing engine that can be used in a variety of circumstances. Application developers and data scientists can incorporate Spark into their applications to quickly query, analyse, and transform data at scale.

Since the start, Spark was optimised to run in memory, allowing it to process data far more quickly than alternative approaches like Hadoop’s MapReduce, which tends to write data to and from computer hard drives between each stage of processing.

Some claim that Spark running in memory can be up to 100 times faster than Hadoop MapReduce. Yet this comparison is not entirely fair as raw speed tends to be more important to Spark’s typical use cases than it is to batch processing, at which MapReduce-like solutions still excel.

What is Spark capable of?

Spark can handle several petabytes of data at a time, distributed across a cluster of thousands of cooperating physical or virtual servers. It supports languages such as Java, Python, R, and Scala with its flexibility, making it well-suited for a range of use cases.

> See also: What's eating Hadoop's lunch?

Spark is often used alongside Hadoop’s data storage module – HDFS -but it can integrate equally well with other popular data storage subsystems such as HBase, Cassandra, MapR-DB, MongoDB and Amazon’s S3.

Typical use cases include :

Stream processing

Application developers are increasingly having to manage and accommodate 'streams' of data. Such “streams” related to financial transactions, for example, can be processed in real time to identify – and refuse – potentially fraudulent transactions.

Machine learning

As data volumes grow, machine learning approaches become more feasible and increasingly accurate. Software can be trained to identify and act upon triggers within well-understood data sets before applying the same solutions to new and unknown data.

Spark’s ability to store data in memory and rapidly run repeated queries makes it a good choice for training machine learning algorithms.

Interactive analytics

Business analysts and data scientists want to explore their data by asking a question, viewing the result, and then either altering the initial question slightly or drilling deeper into results. Spark allows them to do this whether they are looking at stock prices or production line productivity.

Data integration

Data produced by different systems across a business is rarely clean or consistent enough to be simply and easily combined for reporting or analysis. Spark (and Hadoop) are increasingly being used to reduce the cost and time required to clean and standardise data, and then load it into a separate system for analysis.

What sets Spark apart?

Drawn by the potential of interactive querying and machine learning capabilities of Spark, a range of vendors have recognised the opportunity to extend their existing big data products.

Well-known companies have invested significant sums in the technology, and a growing number of start-ups are building businesses that depend in whole or in part upon Spark.

In addition, all the major Hadoop vendors have moved to support Spark alongside their existing products, and each vendor is working to add value for its customers. Courses on Spark are also in demand for the potential it can increase the value of a person’s skill set.

> See also: The future is prescriptive

There are many reasons to choose Spark but three are key:

Simplicity

Spark’s capabilities are accessible via a set of rich APIs, all designed specifically for interacting quickly and easily with data at scale. These APIs are well documented and structured in a way that makes it straightforward for data scientists and application developers to quickly put Spark to work.

Support

Spark supports a range of programming languages as well as including native support for tight integration with a number of leading storage solutions in the Hadoop ecosystem and beyond. In addition, the Apache Spark community is large, active, and international.

Speed

Spark is ultimately designed for speed which it provides by operating both in memory and on disk.

Undoubtedly, Spark has a great deal of potential and will continue to gain momentum as business analysts and data scientists recognise its capabilities.

As with many big data offerings, it doesn’t provide a one size fits all solution and may not be the best choice for every data processing task.

However, over the coming year we are likely to see even more discussion about its future with further examples of its use being brought to market.

Sourced from Tug Grall, technical evangelist, MapR Technologies

Avatar photo

Ben Rossi

Ben was Vitesse Media's editorial director, leading content creation and editorial strategy across all Vitesse products, including its market-leading B2B and consumer magazines, websites, research and...

Related Topics

Analytics
Data
Storage