ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

ç‚¹å‡»â€œç»§ç»åŠ å…¥æˆ–ç™»å½•â€ï¼Œå³è¡¨ç¤ºæ‚¨åŒæ„éµå®ˆé¢†è‹±çš„ã€Šç”¨æˆ·åè®®ã€‹ã€ã€Šéšç§æ”¿ç–ã€‹åŠã€ŠCookie æ”¿ç–ã€‹ã€‚

Fault tolerance in Apache Spark

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

å‘å¸ƒæ—¥æœŸ: 2018å¹´3æœˆ13æ—¥

We will see fault-tolerant stream processing with Spark Streaming and Spark RDD fault tolerance. We will also learn what is Spark Streaming write ahead log, Spark streaming driver failure, Spark streaming worker failure to understand how to achieve fault tolerance in Apache Spark.

Let us see Apache Spark installation on Ubuntu

Introduction to Fault Tolerance in Apache Spark

Before we start with learning what is fault tolerance in Apache Spark, let us revise concepts of Apache Spark for beginners.

Now letâ€™s understand what is fault and how Spark handles fault tolerance.

Fault refers to failure, thus fault tolerance in Apache Spark is the capability to operate and to recover loss after a failure occurs. If we want our system to be fault tolerant, it should be redundant because we require a redundant component to obtain the lost data. The faulty data is recovered by redundant data.

Spark RDD Fault Tolerance

Let us firstly see how to create RDDs in Spark? Spark operates on data in fault-tolerant file systems like HDFS or S3. So all the RDDs generated from fault tolerant data is fault tolerant. But this does not set true for streaming/live data (data over the network). So the key need of fault tolerance in Spark is for this kind of data. The basic fault-tolerant semantic of Spark are:

Since Apache Spark RDD is an immutable dataset, each Spark RDD remembers the lineage of the deterministic operation that was used on fault-tolerant input dataset to create it.
If due to a worker node failure any partition of an RDD is lost, then that partition can be re-computed from the original fault-tolerant dataset using the lineage of operations.
Assuming that all of the RDD transformations are deterministic, the data in the final transformed RDD will always be the same irrespective of failures in the Spark cluster.

To achieve fault tolerance for all the generated RDDs, the achieved data is replicated among multiple Spark executors in worker nodes in the cluster. This results in two types of data that needs to be recovered in the event of failure:

Data received and replicated â€“ In this, the data gets replicated on one of the other nodes thus the data can be retrieved when a failure.
Data received but buffered for replication â€“ The data is not replicated thus the only way to recover fault is by retrieving it again from the source.

Failure also occurs in worker as well as driver nodes.

Failure of worker node â€“ The node which runs the application code on the Spark cluster is Spark worker node. These are the slave nodes. Any of the worker nodes running executor can fail, thus resulting in loss of in-memory If any receivers were running on failed nodes, then their buffer data will be lost.
Failure of driver node â€“ If there is a failure of the driver node that is running the Spark Streaming application, then SparkContent is lost and all executors with their in-memory data are lost.

Apache Mesos helps in making the Spark master fault tolerant by maintaining the backup masters. It is open source software residing between the application layer and the operating system. It makes easier to deploy and manage applications in large-scale clustered environment. Executors are relaunched if they fail. Post failure, executors are relaunched automatically and spark streaming does parallel recovery by recomputing Spark RDDâ€™s on input data. Receivers are restarted by the workers when they fail.

Fault Tolerance with Receiver-based sources

For input sources based on receivers, the fault tolerance depends on both- the failure scenario and the type of receiver. There are two types of receiver:

Reliable receiver â€“ Once it is ensured that the received data has been replicated, the reliable sources are acknowledged. If the receiver fails, the source will not receive acknowledgment for the buffered data. So, the next time the receiver is restarted, the source will resend the data. Hence, no data will be lost due to failure.
Unreliable Receiver â€“ Due to the worker or driver failure, the data can be lost since receiver does not send an acknowledgment.

Learn more about top 5 Apache Spark certifications for your Spark Career

Read complete Article>>

See Also-

Apache Spark Cluster Manager
Spark Streaming Checkpointing

Mike Frampton

IT Contractor - Currently Looking For Opportunites

7 å¹´

You mentioned mesos as a spark cluster manager but what about dcos ?

èµž

å›žå¤

Piotr Czarnas

Founder @ DQOps open-source Data Quality platform | Detect any data quality issue and watch for new issues with Data Observability

7 å¹´

That is an interesting concept. We created our own support for a failover #Spark driver in #Querona. #Querona is a Data Virtualization engine and can use Spark or a whole cluster like Hortonworks HDP or Microsoft HDInsight for caching. When we deploy the driver on HDP or HDInsight, we are deploying and starting two drivers on both the head nodes (the active one and the passive). The drivers are sharing state by just using leader election code used by many other Hadoop applications - we are just sharing and monitoring a state in Zookeeper. Querona keeps connections to both drivers but only one driver actually starts a Spark context and executes queries on the cluster. In case of a node failure (for example when Azure was restarting a head node), one driver was just not reporting its state and the second driver was taking the "leader" role. Querona then resubmits the query and it is executed by the newly elected driver instance. So - when we compare it to this idea, we are not storing the state between executions but we just restart the execution. I am wondering how many projects actually have a failover driver architecture because we couldn't find any reference on the web when we were developing our solution.

èµž

å›žå¤

5 æ¬¡å›žåº”

Ram Chandra( Big Data Consultant )

Big data Solution Architect at Princeton

7 å¹´

Hi Malini, Thank you for your article , Here I could see one thing about replication factor.As per my knowledge Spark won't write the data in disk but it will be in-memory. So RDD it self recreate the data if it loss. Please correct me if I misunderstood. Thanks and Regards, Ram.

èµž

å›žå¤

Mohit Sharma(OCAJP?,ITIL?,AWSSAA?,ELK?)

Technical Architect, ELK,Bigdata and AWS & Azure Cloud Services at Tata Consultancy Services

7 å¹´

Worth Reading !!????

èµž

å›žå¤

1 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Malini Shuklaçš„æ›´å¤šæ–‡ç«

Top 9 Computer Vision Project Ideas for Beginners

2020å¹´1æœˆ21æ—¥

Top 9 Computer Vision Project Ideas for Beginners

Understand the visual world around us Computer Vision Projects Computer vision is the most powerful and compelling typeâ€¦
12 Cool Data Science project ideas with source code - "Strengthen your Resume"

2019å¹´11æœˆ13æ—¥

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

INTRODUCTION Data Science, a field that brings out wonders almost every second day and thatâ€™s why it is often regardedâ€¦

3 æ¡è¯„è®º
Python Coding Interview Questions for Experienced - Python FAQ's

2019å¹´9æœˆ30æ—¥

Python Coding Interview Questions for Experienced - Python FAQ's

Firstly, If you are here, you probably already have a interview scheduled so my friend all the very best with thatâ€¦
How Data Science is the Backbone of Retail?

2019å¹´7æœˆ16æ—¥

How Data Science is the Backbone of Retail?

Data Science is having an increasing impact on business models in all industries. And in todayâ€™s digital world, dataâ€¦
How to Get The Coolest & The Sexiest Job Of the Century- â€œBecome a Data Scientistâ€

2019å¹´7æœˆ9æ—¥

How to Get The Coolest & The Sexiest Job Of the Century- â€œBecome a Data Scientistâ€

â€œThe goal is to turn data into information, and information into insightâ€ Data Scientist is an analytical data expertâ€¦
Whatâ€™s the Best programming Language to Start a Career in Data Science?

2019å¹´6æœˆ25æ—¥

Whatâ€™s the Best programming Language to Start a Career in Data Science?

If you are thinking which programming languages should I learn to Master data Science in 2019? Then you are at theâ€¦

1 æ¡è¯„è®º
11 Reason Why TensorFlow is So Popular

2019å¹´6æœˆ15æ—¥

11 Reason Why TensorFlow is So Popular

TensorFlow Features | Why TensorFlow Is So Popular TensorFlow gives us an interactive multiplatform programmingâ€¦
20 Deep Learning Terminologies You Must Know

2019å¹´6æœˆ14æ—¥

20 Deep Learning Terminologies You Must Know

Deep Learning Terminologies a. Recurrent Neuron Itâ€™s one of the best from the Deep Learning Terminologies.

2 æ¡è¯„è®º
TensorFlow Performance Optimization â€“ Tips To Improve Performance

2019å¹´6æœˆ12æ—¥

TensorFlow Performance Optimization â€“ Tips To Improve Performance

Ways for TensorFlow Performance Optimization There a variety of ways through which you can optimize your hardware toolsâ€¦
Top 9 Reasons Why QlikView is Best in BI

2019å¹´6æœˆ11æ—¥

Top 9 Reasons Why QlikView is Best in BI

QlikView Features Below are the 9 Features of QlikView, which gives us the importance of QlikView, letâ€™s discuss them:â€¦

See all articles

Introduction to Fault Tolerance in Apache Spark

Spark RDD Fault Tolerance

Fault Tolerance with Receiver-based sources

Malini Shuklaçš„æ›´å¤šæ–‡ç«

Top 9 Computer Vision Project Ideas for Beginners

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

Python Coding Interview Questions for Experienced - Python FAQ's

How Data Science is the Backbone of Retail?

How to Get The Coolest & The Sexiest Job Of the Century- â€œBecome a Data Scientistâ€

Whatâ€™s the Best programming Language to Start a Career in Data Science?

11 Reason Why TensorFlow is So Popular

20 Deep Learning Terminologies You Must Know

TensorFlow Performance Optimization â€“ Tips To Improve Performance

Top 9 Reasons Why QlikView is Best in BI

ç¤¾åŒºæ´žå¯Ÿ

How to Get The Coolest & The Sexiest Job Of the Century- â€œBecome a Data Scientistâ€