登录查看更多内容

Apache Spark

Kumar Preeti Lata

Microsoft Certified: Senior Data Analyst/ Senior Data Engineer | Prompt Engineer | Gen AI | SQL, Python, R, PowerBI, Tableau, ETL| DataBricks, ADF, Azure Synapse Analytics | PGP Cloud Computing | MSc Data Science

发布日期: 2024年8月1日

Apache cluster runtime Architecture:

In an Apache cluster, such as those used for Hadoop or Spark, the runtime architecture includes several key components working in unison:

Master Node(s): These manage and coordinate tasks. For Hadoop, this involves the NameNode for HDFS and the ResourceManager for YARN. For Spark, the Driver Node plays this role.
Worker Nodes: These execute tasks and handle data storage. In Hadoop, DataNodes and NodeManagers perform these functions. In Spark, Executors run tasks and manage storage.
Job Scheduling: Manages task distribution. Hadoop uses YARN for this purpose, while Spark relies on its internal Scheduler.
Data Storage: Distributed systems like HDFS handle large datasets across the cluster.
Communication: Nodes use protocols like RPC for data and task exchanges.
Resource Management: Allocates CPU, memory, and storage across tasks and applications.

These elements collectively enable the efficient processing and management of large-scale data.

Cluster

A cluster is a collection of interconnected computers or servers that work together as a single system to perform tasks more efficiently. In computing, clusters are used to enhance performance, reliability, and scalability.

The container runs the main method of the Application Manager, with two possibilities of : Spark and Pyspark.

If it is Pyspark, it will call the JVM main method using Py4j connection, as Pyspark is a python wrapper around Java wrapper of spark core, to call java application. And this in turn runs scala application in JVM.

So, in case of Pyspark, we've two drivers: Pyspark driver and JVM driver.

In case of Spark, we have only JVM driver.

After starting JVM driver, the driver goes to Yarn RM and ask for more containers on worker nodes. The driver will then run spark executors on each of the containers. The driver will assign work to the JVM on executors and monitor them.

领英推荐

In Apache Spark, there are two primary modes for running applications: client mode and cluster mode. Each mode determines how the application is deployed and managed within the cluster. Here's a concise overview:

Client Mode

Execution: The Spark driver runs on the machine where you submit the job (client machine), not on the cluster nodes.
Use Case: Ideal for interactive applications and development, where you need to monitor the job in real time.
Pros: Easier to debug and test; useful for scenarios where the job's progress needs to be monitored closely. (Since the driver is on your local machine, you can easily monitor the job's progress in real-time. This is particularly useful in interactive sessions where immediate feedback is necessary.)
Cons: The client machine needs a stable connection to the cluster, and the client machine must have sufficient resources to run the driver.

Cluster Mode

Execution: The Spark driver runs on one of the cluster nodes. You submit the job from the client, but the driver runs inside the cluster.
Use Case: Suitable for production workloads and long-running jobs. The cluster handles the driver's execution, making it less dependent on the client machine.
Pros: More robust for production environments; the client machine does not need to stay connected once the job starts. (Once the job is submitted, the client can disconnect, and the cluster will continue to run the job independently. This feature is beneficial for long-running jobs, as it doesn't tie up client resources or require a constant connection, reducing the risk of job failure due to client-side issues.)
Cons: Harder to debug and monitor the job since the driver runs inside the cluster.

Choosing between client and cluster mode depends on your specific needs for debugging, monitoring, and job execution.

Why Monitoring is Different When the Driver is in the Cluster:

Indirect Access to Logs: In cluster mode, the driver runs on a cluster node, so logs and other outputs are stored on that node. While these logs can be accessed through Spark's web UI or by logging into the cluster nodes, it requires extra steps compared to having everything available locally.
Limited Use of Local Tools: You can't directly use local debugging tools on the cluster node where the driver is running. This limitation can make it harder to perform in-depth debugging or to use specific debugging features like breakpoints.
Less Immediate Feedback: In cluster mode, there's typically a delay between submitting a job and getting feedback, as the job might be queued or have to wait for resources. This delay makes it less conducive to iterative development and real-time monitoring.

While cluster mode is more suited for production environments and can handle larger and longer-running jobs, the immediacy and ease of access provided by client mode make it better for debugging and developing applications.

Analytics Almanac

2,112 位关注者

Kumar Preeti Lata

7 个月

A node is not an executor, but a machine where spark runs a driver and one or many executors. You may also be running driver or executor on the same worker node,if needs be.

要查看或添加评论，请登录

Kumar Preeti Lata的更多文章

Shallow vs. Deep Pagination in GraphQL:

2025年3月4日

Shallow vs. Deep Pagination in GraphQL:

Pagination is a crucial technique in GraphQL for managing large datasets efficiently, especially for platforms like…
Pagination

2025年3月4日

Pagination

What is Pagination? Pagination is the technique of dividing a large set of data into smaller, manageable chunks or…
GraphQL

2025年3月4日

GraphQL

Imagine you’re at a restaurant. With a typical menu (like REST API), you have to choose a full meal even if you only…
Groq-3: The AI Accelerator That’s Changing the Game Like Never Before

2025年3月3日

Groq-3: The AI Accelerator That’s Changing the Game Like Never Before

In the world of AI, speed isn’t just nice to have — it’s everything. Training large language models and processing…
How DeepSeek Hunts Down Answers Like Never Before

2025年3月3日

How DeepSeek Hunts Down Answers Like Never Before

If you've been keeping an eye on AI advancements, you’ve probably heard the buzz about DeepSeek — the model that seems…
How ‘Attention Is All You Need’ Transformed AI Like Never Before

2025年3月3日

How ‘Attention Is All You Need’ Transformed AI Like Never Before

Back in 2017, a research paper with a bold title — "Attention Is All You Need" — quietly landed in the AI community…
Challenges and Risks of Agentic AI: Can AI Making Its Own Decisions Be Controlled?

2025年2月7日

Challenges and Risks of Agentic AI: Can AI Making Its Own Decisions Be Controlled?

Artificial Intelligence (AI) has come a long way—from simple rule-based automation to highly intelligent and adaptive…
When to Use a Simple AI Agent vs. an Agentic AI System

2025年2月6日

When to Use a Simple AI Agent vs. an Agentic AI System

As artificial intelligence continues to evolve, businesses and developers face an important question: should they use a…
AI Agent vs Agentic AI: Understanding the Difference

2025年2月6日

AI Agent vs Agentic AI: Understanding the Difference

The world of artificial intelligence (AI) is rapidly evolving, and new terminology continues to surface, often causing…
Data Lake vs. Data Warehouse: Which to Choose and When?

2025年1月10日

Data Lake vs. Data Warehouse: Which to Choose and When?

In the data-driven world of today, organizations are generating and collecting massive amounts of data. To extract…

1 条评论

See all articles

Apache Spark

Kumar Preeti Lata

Microsoft Certified: Senior Data Analyst/ Senior Data Engineer | Prompt Engineer | Gen AI | SQL, Python, R, PowerBI, Tableau, ETL| DataBricks, ADF, Azure Synapse Analytics | PGP Cloud Computing | MSc Data Science

Cluster

领英推荐

Client Mode

Cluster Mode

Why Monitoring is Different When the Driver is in the Cluster:

Analytics Almanac

2,112 位关注者

Kumar Preeti Lata的更多文章

社区洞察

其他会员也浏览了

WHAT IS SPARK

Spark Vs Hadoop Map Reduce

5 Fundamentals of Apache Spark

Getting started with Apache Spark

Lets learn Spark...

Unleashing the Power of Apache Spark: A Comprehensive Overview

Apache Spark Cluster Managers – YARN, Mesos & Standalone

Incremental Data Streaming from an Oracle Database to Apache Kafka using Python

Unleashing the Power of Apache Spark: Revolutionizing Big Data Processing at Anthill

APACHE SPARK

Cluster

领英推荐

Client Mode

Cluster Mode

Why Monitoring is Different When the Driver is in the Cluster:

Analytics Almanac

2,112 位关注者

Kumar Preeti Lata的更多文章

Shallow vs. Deep Pagination in GraphQL:

Pagination

GraphQL

Groq-3: The AI Accelerator That’s Changing the Game Like Never Before

How DeepSeek Hunts Down Answers Like Never Before

How ‘Attention Is All You Need’ Transformed AI Like Never Before

Challenges and Risks of Agentic AI: Can AI Making Its Own Decisions Be Controlled?

When to Use a Simple AI Agent vs. an Agentic AI System

AI Agent vs Agentic AI: Understanding the Difference

Data Lake vs. Data Warehouse: Which to Choose and When?

社区洞察

其他会员也浏览了

WHAT IS SPARK

Spark Vs Hadoop Map Reduce

5 Fundamentals of Apache Spark

Getting started with Apache Spark

Lets learn Spark...

Unleashing the Power of Apache Spark: A Comprehensive Overview

Apache Spark Cluster Managers – YARN, Mesos & Standalone

Incremental Data Streaming from an Oracle Database to Apache Kafka using Python

Unleashing the Power of Apache Spark: Revolutionizing Big Data Processing at Anthill

APACHE SPARK