Expedia : Why Spark Performs Better than Hadoop? (Opens in new window)

July 26, 2021

A comparative analysis of two data processing engines, Hadoop and Spark

What is Spark?

It's a fast and general-purpose engine for large-scale data processing. Spark is an execution engine that can do fast computation on big data sets.

Spark Vs Hadoop

In this section, we will see how are Hadoop and spark different in terms of Speed, Storage, and Resource Management.

What is Hadoop?

Hadoop consists of two core components, HDFS and MapReduce. HDFS provides a reliable and scalable storage solution for big data sets. MapReduce is a programming model which helps with big data computation.

Spark run programs up to 100 times faster than Hadoop MapReduce when running in memory, or 10x faster when running on disk. But this comparison is drawn via running iterative machine learning models on both. So the actual performance might depend on the use case. Having said that, Spark is definitely the engine to use when working on Machine learning models. And with its ML-Lib library, it's an ideal tool for Data Scientists.

But Hadoop won't become obsolete with time as they both have their different applications or we simply can't prefer one over the other, it largely depends on the use case and application.

Why Spark is fast?

It has an advanced DAG execution engine that supports acyclic data flow and in-memory execution.

Storage?

Spark leverages existing distributed file systems like Hadoop HDFS or cloud storage solutions like AWS S3 or even Big Data Databases like Cassandra etc for large data sets. Spark can also read the data from the local file system but it's not ideal as it means we need to make data available in all nodes in the cluster.

MapReduce?

MapReduce consists of three main phases. i.e. map, reduce and shuffle.

MapReduce is a programming model which was first developed by Google to facilitate distributed computation of large data sets. Spark also uses MapReduce concepts. Spark's goal was not to replace the MapReduce paradigm but to replace Hadoop's implementation of it with it's faster and efficient one.

One of the major reasons for its speed is that it does the data processing in memory.

Resource Management?

While using Hadoop, when you want to run a MapReduce job, you need to have a Hadoop Cluster to execute it. It uses YARN for resource management. The application in Hadoop negotiates with YARN to get the resources it requires for execution.

On the other hand, Spark comes with inbuilt resource management so it doesn't require YARN for managing clusters, nodes, or memory.

A Hadoop cluster is not required to run Spark jobs, we can run it completely independently. We can have a Hadoop cluster for running HDFS and another cluster for executing Spark jobs but it requires more work in setting up and configuring. The second downside is every time the job will need to copy data from a Hadoop cluster to a Spark cluster as there is no data locality and moving data over the network adds a latency cost. Thus it makes sense to run your spark job on the same cluster as HDFS. That way, Spark can also leverage Hadoop's YARN to manage resources instead of using its out-of-the-box resource manager.

Now coming to the point!

Spark enhances and does not replace the Hadoop stack. Spark's main responsibility is to speed up the computation. So you can convert your Hadoop MapReduce jobs to Spark jobs.

What problem Spark tries to address?

Now let's see for what use cases Spark works best.

Spark solves the inefficiency in two areas that are, Iterative machine learning and Interactive data mining.

What is iterative processing?

It refers to executing a MapReduce job multiple times or in simple terms iterating through the data sets multiple times to get the desired output. Graph processing algorithms, Page Ranking and Logistic Regression are some of the examples of iterative data mining and machine learning algorithms.

In Hadoop, when we write a MapReduce job, in each iteration it will read data from the disk and write temporary data back to the disk. So the problem here is that reading the initial data and writing the intermediate output data back to the disk is unavoidable because you have to save this temporary data somewhere. The more intermediate data you read/write the slower your execution will be.

Spark optimizes this by keeping intermediate data in memory instead of disk. Of course, we will need to ensure that the cluster has enough memory. That's how Spark manages to process data faster.

What about interactive data mining?

To perform interactive data mining you need a rich set of functions that would allow you to perform a different set of operations on the data sets and Spark comes up with a range of built-in functions for this purpose.

With traditional MapReduce, you need to try and fit your logic into mapper and reducers programs and this sometimes feels like a restriction. But Spark provides you with some of the built-in operators that you can use to perform most of your data transformation tasks.

And secondly while performing data mining, you have large data sets with too many features so Spark helps us here with speeding up the computations and getting results in a short amount of time.

Reference

You may refer to this completely free course: Spark Starter Kit to know more and give a headstart to your Spark journey.

I hope you learned something about why and where to use Spark in this post. Please provide feedback in the comments section or via claps. I will surely get into the details about different Spark components in the upcoming blogs.

Attachments

Original document
Permalink

Disclaimer

Expedia Group Inc. published this content on 27 July 2021 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 18 August 2021 08:13:07 UTC.

	1st Jan change	Capi.
EXPEDIA GROUP, INC.	-9.38%	18.57B
BOOKING HOLDINGS INC.	-0.63%	120B
TRIP.COM GROUP LIMITED	+42.35%	32.4B
MAKEMYTRIP LIMITED	+44.01%	7.36B
TONGCHENG TRAVEL HOLDINGS LIMITED	+53.74%	6.28B
TRIPADVISOR, INC.	+22.99%	3.62B
FLIGHT CENTRE TRAVEL GROUP LIMITED	+3.09%	2.97B
WEBJET LIMITED	+10.35%	2.06B
SEERA HOLDING GROUP	-2.78%	1.89B
TRAINLINE PLC	-3.67%	1.73B

1st Jan change

Capi.

EXPEDIA GROUP, INC.

-9.38%

18.57B

BOOKING HOLDINGS INC.

-0.63%

120B

TRIP.COM GROUP LIMITED

+42.35%

32.4B

MAKEMYTRIP LIMITED

+44.01%

7.36B

TONGCHENG TRAVEL HOLDINGS LIMITED

+53.74%

6.28B

TRIPADVISOR, INC.

+22.99%

3.62B

FLIGHT CENTRE TRAVEL GROUP LIMITED

+3.09%

2.97B

WEBJET LIMITED

+10.35%

2.06B

SEERA HOLDING GROUP

-2.78%

1.89B

TRAINLINE PLC

-3.67%

1.73B

Real-time Estimate Cboe BZX Other stock markets 09:59:19 2024-04-26 am EDT			5-day change	1st Jan Change
137.6 ^USD	+0.98%		+6.62%	-9.38%

Apr. 23	Expedia Cruises Drops Anchor in Camillus	CI
Apr. 22	Klarna Bank AB Expands Global Partnership with Expedia Group	CI

Expedia Cruises Drops Anchor in Camillus	Apr. 23	CI
Klarna Bank AB Expands Global Partnership with Expedia Group	Apr. 22	CI
Jefferies Adjusts Expedia's Price Target to $140 From $145, Keeps Hold Rating	Apr. 22	MT
DBS Bank Cuts Expedia's Price Target to $135 From $150	Apr. 17	MT
Airbnb's Marketing Efficiency, User Retention to Help Drive Margin Improvement, B. Riley Says	Apr. 03	MT
Expedia Appoints New Chief Marketing Officer	Mar. 28	MT
Expedia Group, Inc. Announces Executive Changes	Mar. 28	CI
Expedia Group Welcomes New Partners to Its Global Travel Ecosystem	Mar. 13	CI
EU's Digital Markets Act hands boost to Big Tech's smaller rivals	Mar. 08	RE
As Big Tech scrambles to meet EU rules, investigations seen as likely	Mar. 07	RE
Airlines, hotels warn Google changes may benefit large intermediaries	Mar. 06	RE
Ukraine teams up with travel Inc. to prepare for post-war tourism	Mar. 06	RE
Sicily by Car Portugal founded in Lisbon; fleet operational from May	Mar. 01	AN
Tripadvisor's post-Covid boom may not last	Feb. 28
Global markets live: Chevron, Expedia, Standard Chartered, Broadcom, Meta Platforms...	Feb. 27
Expedia Group to Lay off About 1,500 Employees	Feb. 27	MT
What could go wrong?	Feb. 27
Expedia Laying Off Around 1,500 Employees	Feb. 27	MT
News Highlights : Top Company News of the Day - Tuesday at 7 AM ET	Feb. 27	DJ
North American Morning Briefing : Stock Futures Pause as Investors Eye Inflation Data	Feb. 27	DJ
News Highlights : Top Company News of the Day - Tuesday at 5 AM ET	Feb. 27	DJ
News Highlights : Top Company News of the Day - Tuesday at 3 AM ET	Feb. 27	DJ
News Highlights : Top Company News of the Day - Tuesday at 1 AM ET	Feb. 27	DJ
EMEA Morning Briefing : Shares Seen Lower as Investors Brace for This Week's Econ Data	Feb. 27	DJ
News Highlights : Top Company News of the Day - Monday at 11 PM ET	Feb. 26	DJ

Expedia Group, Inc.

Equities

EXPE

US30212P3038

Leisure & Recreation

Expedia : Why Spark Performs Better than Hadoop? (Opens in new window)

Latest news about Expedia Group, Inc.

Chart Expedia Group, Inc.

Company Profile

Income Statement Evolution

Analysis / Opinion

Ratings for Expedia Group, Inc.

Analysts' Consensus

EPS Revisions

Quarterly earnings - Rate of surprise

Sector Travel Agents