A comparative analysis of two data processing engines, Hadoop and Spark

What is Spark?

It's a fast and general-purpose engine for large-scale data processing. Spark is an execution engine that can do fast computation on big data sets.

Spark Vs Hadoop

In this section, we will see how are Hadoop and spark different in terms of Speed, Storage, and Resource Management.

What is Hadoop?

Hadoop consists of two core components, HDFS and MapReduce. HDFS provides a reliable and scalable storage solution for big data sets. MapReduce is a programming model which helps with big data computation.

Spark run programs up to 100 times faster than Hadoop MapReduce when running in memory, or 10x faster when running on disk. But this comparison is drawn via running iterative machine learning models on both. So the actual performance might depend on the use case. Having said that, Spark is definitely the engine to use when working on Machine learning models. And with its ML-Lib library, it's an ideal tool for Data Scientists.

But Hadoop won't become obsolete with time as they both have their different applications or we simply can't prefer one over the other, it largely depends on the use case and application.

Why Spark is fast?

It has an advanced DAG execution engine that supports acyclic data flow and in-memory execution.

Storage?

Spark leverages existing distributed file systems like Hadoop HDFS or cloud storage solutions like AWS S3 or even Big Data Databases like Cassandra etc for large data sets. Spark can also read the data from the local file system but it's not ideal as it means we need to make data available in all nodes in the cluster.

MapReduce?

MapReduce consists of three main phases. i.e. map, reduce and shuffle.

MapReduce is a programming model which was first developed by Google to facilitate distributed computation of large data sets. Spark also uses MapReduce concepts. Spark's goal was not to replace the MapReduce paradigm but to replace Hadoop's implementation of it with it's faster and efficient one.

One of the major reasons for its speed is that it does the data processing in memory.

Resource Management?

While using Hadoop, when you want to run a MapReduce job, you need to have a Hadoop Cluster to execute it. It uses YARN for resource management. The application in Hadoop negotiates with YARN to get the resources it requires for execution.

On the other hand, Spark comes with inbuilt resource management so it doesn't require YARN for managing clusters, nodes, or memory.

A Hadoop cluster is not required to run Spark jobs, we can run it completely independently. We can have a Hadoop cluster for running HDFS and another cluster for executing Spark jobs but it requires more work in setting up and configuring. The second downside is every time the job will need to copy data from a Hadoop cluster to a Spark cluster as there is no data locality and moving data over the network adds a latency cost. Thus it makes sense to run your spark job on the same cluster as HDFS. That way, Spark can also leverage Hadoop's YARN to manage resources instead of using its out-of-the-box resource manager.

Now coming to the point!

Spark enhances and does not replace the Hadoop stack. Spark's main responsibility is to speed up the computation. So you can convert your Hadoop MapReduce jobs to Spark jobs.

What problem Spark tries to address?

Now let's see for what use cases Spark works best.

Spark solves the inefficiency in two areas that are, Iterative machine learning and Interactive data mining.

What is iterative processing?

It refers to executing a MapReduce job multiple times or in simple terms iterating through the data sets multiple times to get the desired output. Graph processing algorithms, Page Ranking and Logistic Regression are some of the examples of iterative data mining and machine learning algorithms.

In Hadoop, when we write a MapReduce job, in each iteration it will read data from the disk and write temporary data back to the disk. So the problem here is that reading the initial data and writing the intermediate output data back to the disk is unavoidable because you have to save this temporary data somewhere. The more intermediate data you read/write the slower your execution will be.

Spark optimizes this by keeping intermediate data in memory instead of disk. Of course, we will need to ensure that the cluster has enough memory. That's how Spark manages to process data faster.

What about interactive data mining?

To perform interactive data mining you need a rich set of functions that would allow you to perform a different set of operations on the data sets and Spark comes up with a range of built-in functions for this purpose.

With traditional MapReduce, you need to try and fit your logic into mapper and reducers programs and this sometimes feels like a restriction. But Spark provides you with some of the built-in operators that you can use to perform most of your data transformation tasks.

And secondly while performing data mining, you have large data sets with too many features so Spark helps us here with speeding up the computations and getting results in a short amount of time.

Reference

You may refer to this completely free course: Spark Starter Kit to know more and give a headstart to your Spark journey.

I hope you learned something about why and where to use Spark in this post. Please provide feedback in the comments section or via claps. I will surely get into the details about different Spark components in the upcoming blogs.

Attachments

  • Original document
  • Permalink

Disclaimer

Expedia Group Inc. published this content on 27 July 2021 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 18 August 2021 08:13:07 UTC.