Apache Spark

Apache Spark

Apache cluster runtime Architecture:

In an Apache cluster, such as those used for Hadoop or Spark, the runtime architecture includes several key components working in unison:

  1. Master Node(s): These manage and coordinate tasks. For Hadoop, this involves the NameNode for HDFS and the ResourceManager for YARN. For Spark, the Driver Node plays this role.
  2. Worker Nodes: These execute tasks and handle data storage. In Hadoop, DataNodes and NodeManagers perform these functions. In Spark, Executors run tasks and manage storage.
  3. Job Scheduling: Manages task distribution. Hadoop uses YARN for this purpose, while Spark relies on its internal Scheduler.
  4. Data Storage: Distributed systems like HDFS handle large datasets across the cluster.
  5. Communication: Nodes use protocols like RPC for data and task exchanges.
  6. Resource Management: Allocates CPU, memory, and storage across tasks and applications.

These elements collectively enable the efficient processing and management of large-scale data.

Cluster

A cluster is a collection of interconnected computers or servers that work together as a single system to perform tasks more efficiently. In computing, clusters are used to enhance performance, reliability, and scalability.


https://docs.cloud.sdu.dk/Apps/spark-cluster.html

The container runs the main method of the Application Manager, with two possibilities of : Spark and Pyspark.

If it is Pyspark, it will call the JVM main method using Py4j connection, as Pyspark is a python wrapper around Java wrapper of spark core, to call java application. And this in turn runs scala application in JVM.

So, in case of Pyspark, we've two drivers: Pyspark driver and JVM driver.

In case of Spark, we have only JVM driver.

After starting JVM driver, the driver goes to Yarn RM and ask for more containers on worker nodes. The driver will then run spark executors on each of the containers. The driver will assign work to the JVM on executors and monitor them.


In Apache Spark, there are two primary modes for running applications: client mode and cluster mode. Each mode determines how the application is deployed and managed within the cluster. Here's a concise overview:

Client Mode

  • Execution: The Spark driver runs on the machine where you submit the job (client machine), not on the cluster nodes.
  • Use Case: Ideal for interactive applications and development, where you need to monitor the job in real time.
  • Pros: Easier to debug and test; useful for scenarios where the job's progress needs to be monitored closely. (Since the driver is on your local machine, you can easily monitor the job's progress in real-time. This is particularly useful in interactive sessions where immediate feedback is necessary.)
  • Cons: The client machine needs a stable connection to the cluster, and the client machine must have sufficient resources to run the driver.

Cluster Mode

  • Execution: The Spark driver runs on one of the cluster nodes. You submit the job from the client, but the driver runs inside the cluster.
  • Use Case: Suitable for production workloads and long-running jobs. The cluster handles the driver's execution, making it less dependent on the client machine.
  • Pros: More robust for production environments; the client machine does not need to stay connected once the job starts. (Once the job is submitted, the client can disconnect, and the cluster will continue to run the job independently. This feature is beneficial for long-running jobs, as it doesn't tie up client resources or require a constant connection, reducing the risk of job failure due to client-side issues.)
  • Cons: Harder to debug and monitor the job since the driver runs inside the cluster.

Choosing between client and cluster mode depends on your specific needs for debugging, monitoring, and job execution.


Why Monitoring is Different When the Driver is in the Cluster:

  1. Indirect Access to Logs: In cluster mode, the driver runs on a cluster node, so logs and other outputs are stored on that node. While these logs can be accessed through Spark's web UI or by logging into the cluster nodes, it requires extra steps compared to having everything available locally.
  2. Limited Use of Local Tools: You can't directly use local debugging tools on the cluster node where the driver is running. This limitation can make it harder to perform in-depth debugging or to use specific debugging features like breakpoints.
  3. Less Immediate Feedback: In cluster mode, there's typically a delay between submitting a job and getting feedback, as the job might be queued or have to wait for resources. This delay makes it less conducive to iterative development and real-time monitoring.

While cluster mode is more suited for production environments and can handle larger and longer-running jobs, the immediacy and ease of access provided by client mode make it better for debugging and developing applications.


Kumar Preeti Lata

Microsoft Certified: Senior Data Analyst/ Senior Data Engineer | Prompt Engineer | Gen AI | SQL, Python, R, PowerBI, Tableau, ETL| DataBricks, ADF, Azure Synapse Analytics | PGP Cloud Computing | MSc Data Science

7 个月

A node is not an executor, but a machine where spark runs a driver and one or many executors. You may also be running driver or executor on the same worker node,if needs be.

回复

要查看或添加评论,请登录

Kumar Preeti Lata的更多文章

社区洞察

其他会员也浏览了