登录查看更多内容

Spark Internals

Madhusudhan Rao Mulagala

Big Data Engineer

发布日期: 2023年5月4日

?? Learning through question-and-answer format has been ingrained in us since childhood, as it promotes active engagement and curiosity towards a subject. Despite using Apache Spark regularly, many users may not have a comprehensive understanding of how Spark works internally. I am going to cover spark internals in a comprehensive way:

Q). What is an Apache Spark?

Ans: Apache Spark is a general-purpose in-memory computing engine. Spark is a plug-and-play computing engine that needs two (2) things to work with:

1.???Storage – Local Storage, HDFS, Amazon S3.

2.???Resource Manager – YARN, Mesos, Kubernetes.

[ Note: Spark is a replacement for Map Reduce].

Q). What is the problem with Map Reduce?

Ans: For each Map Reduce Job we require 2 disk access (One is for reading and the other is for Writing). This was the bottleneck with Map Reduce. Because in an Iterative algorithm or something, every time it needs to access HDFS. This is the problem with Map Reduce.

领英推荐

Best Strategies for Upskilling in Data Science

Analytics Insight? 6 个月前

Data Science Curriculum for Total Beginner

Yacine Mahdid ?? 11 个月前

Can Anybody Learn Data Science?

Sankhyana Consultancy Services Pvt. Ltd. 4 个月前

[ Map Reduce works very slowly and the reason is a lot of DISK IO’s are required.]

Q). How does Spark resolve the Map Reduce Problem?

Ans: Spark reads the input from HDFS and stores the intermediate outputs in ‘in-memory’ only. When all processing is done then it transfers its final output to HDFS. [Spark solves the Map Reduce problem using the “in-memory” concept.]

[ But in the case of Map Reduce, Intermediate outputs are also stored in HDFS. Due to this, it needs more DISK IO’s. More DISK IO operations mean more time for computation.]

Q). How exactly the data is stored in Spark?

Ans: The basic unit which holds the data in spark is called an “RDD”. RDD stands for ‘Resilient distributed dataset’.

?Meaning of Resilient:

???Resilient means that it is resilient to failures. Suppose if we lose an RDD then we can again recover it back.

要查看或添加评论，请登录

Madhusudhan Rao Mulagala的更多文章

Classification of Attributes in DBMS

2024年6月16日

Classification of Attributes in DBMS

1. Simple Attribute: A simple attribute is an attribute that cannot be divided further.
Decide which architecture is your best fit:

2024年5月11日

Decide which architecture is your best fit:

For better or for worse, we have a far more complex process of selecting a data warehouse environment that's simply…
Hive: Transforming Data Warehousing for Modern Businesses

2023年11月25日

Hive: Transforming Data Warehousing for Modern Businesses

Hive is an open-source data warehouse. Hive is meant to solve analytical problems.
Dive into Databricks Clusters: The Engine for Data Revolution

2023年11月21日

Dive into Databricks Clusters: The Engine for Data Revolution

A cluster is a collection of virtual machines that helps to achieve distributed data processing. Clusters can be…
Mastering Data Management: A Battle of Transactional and Analytical Systems

2023年10月6日

Mastering Data Management: A Battle of Transactional and Analytical Systems

Transactional systems: Transactional systems are the ones where we deal with day-to-day data or present data. In such…
Mastering Scala: Unveiling the Power of Functional Programming and Functions

2023年8月20日

Mastering Scala: Unveiling the Power of Functional Programming and Functions

? ? ? Scala is a hybrid programming language that supports both object-oriented programming and functional programming.…
Harnessing Broadcast Join and Accumulator Magic!

2023年7月19日

Harnessing Broadcast Join and Accumulator Magic!

? ?? Broadcast Join: It is an optimization technique in the Spark SQL engine that is used to join two Data Frames. This…
HDFS Architecture

2023年6月17日

HDFS Architecture

? Master Node:(Name Node) ? The name node holds the namespace information or metadata (in the form of a table)…
Understanding YARN (Yet Another Resource Negotiator)

2023年5月31日

Understanding YARN (Yet Another Resource Negotiator)

For understanding YARN first, we need to understand Hadoop1/ MR1 Architecture: ?? From Storage Perspective – HDFS ?…

1 条评论

See all articles

Spark Internals

Madhusudhan Rao Mulagala

Big Data Engineer

领英推荐

Madhusudhan Rao Mulagala的更多文章

社区洞察

其他会员也浏览了

Top 11 ML Infrastructure Tools

"From Aspirations to Reality: How to Create a Vision Board and Learning Path for Your Data Science Career"

Introduction to Azure ML Studio

Future-Proofing Your Career in Databricks: The Skills You’ll Need in the Next 5 Years

Helpful Resources to Help Anyone Learn Data Science

Earn IBM Certificates & Badges for FREE? (AI, Big Data, Data Science, IBMCloud, DevOps)

Learn Pyspark Course in Hyderabad

How to Learn Data Science from Scratch ? - NareshIT

Taking Your Spark MLlib Skills to the Next Level: Top 5 Intermediate Courses

Mastering Collaborative Filtering with PySpark ALS Model: An Implementation Guide

领英推荐

Madhusudhan Rao Mulagala的更多文章

Classification of Attributes in DBMS

Decide which architecture is your best fit:

Hive: Transforming Data Warehousing for Modern Businesses

Dive into Databricks Clusters: The Engine for Data Revolution

Mastering Data Management: A Battle of Transactional and Analytical Systems

Mastering Scala: Unveiling the Power of Functional Programming and Functions

Harnessing Broadcast Join and Accumulator Magic!

HDFS Architecture

Understanding YARN (Yet Another Resource Negotiator)

社区洞察

其他会员也浏览了

Top 11 ML Infrastructure Tools

"From Aspirations to Reality: How to Create a Vision Board and Learning Path for Your Data Science Career"

Introduction to Azure ML Studio

Future-Proofing Your Career in Databricks: The Skills You’ll Need in the Next 5 Years

Helpful Resources to Help Anyone Learn Data Science

Earn IBM Certificates & Badges for FREE? (AI, Big Data, Data Science, IBMCloud, DevOps)

Learn Pyspark Course in Hyderabad

How to Learn Data Science from Scratch ? - NareshIT

Taking Your Spark MLlib Skills to the Next Level: Top 5 Intermediate Courses

Mastering Collaborative Filtering with PySpark ALS Model: An Implementation Guide