SPARK

SPARK

What is Spark?

Spark Architecture, an open-source, framework-based component that processes a large amount of unstructured, semi-structured, and structured data for analytics, is utilised in Apache Spark. Apart from Hadoop and map-reduce architectures for big data processing, Apache Spark’s architecture is regarded as an alternative. The RDD and DAG, Spark’s data storage and processing framework, are utilised to store and process data, respectively. Spark architecture consists of four components, including the spark driver, executors, cluster administrators, and worker nodes. It uses the Dataset and data frames as the fundamental data storage mechanism to optimise the Spark process and big data computation.

Apache Spark Features

Apache Spark, a popular cluster computing framework, was created in order to accelerate data processing applications. Spark, which enables applications to run faster by utilising in-memory cluster computing, is a popular open source framework. A cluster is a collection of nodes that communicate with each other and share data. Because of implicit data parallelism and fault tolerance, Spark may be applied to a wide range of sequential and interactive processing demands.

  • Speed:?Spark performs up to 100 times faster than MapReduce for processing large amounts of data. It is also able to divide the data into chunks in a controlled way.
  • Powerful Caching:?Powerful caching and disk persistence capabilities are offered by a simple programming layer.
  • Deployment:?Mesos, Hadoop via YARN, or Spark’s own cluster manager can all be used to deploy it.
  • Real-Time:?Because of its in-memory processing, it offers real-time computation and low latency.
  • Polyglot:?In addition to Java, Scala, Python, and R, Spark also supports all four of these languages. You can write Spark code in any one of these languages. Spark also provides a command-line interface in Scala and Python.

Two Main Abstractions of Apache Spark

The Apache Spark architecture consists of two main abstraction layers:

Resilient Distributed Datasets (RDD):

It is a key tool for data computation. It enables you to recheck data in the event of a failure, and it acts as an interface for immutable data. It helps in recomputing data in case of failures, and it is a data structure. There are two methods for modifying RDDs: transformations and actions.

Directed Acyclic Graph (DAG):

The driver converts the program into a DAG for each job. The Apache Spark Eco-system includes various components such as the API core, Spark SQL, Streaming and real-time processing, MLIB, and Graph X. A sequence of connection between nodes is referred to as a driver. As a result, you can read volumes of data using the Spark shell. You can also use the Spark context -cancel, run a job, task (work), and job (computation) to stop a job.

Spark Architecture

The Apache Spark base architecture diagram is provided in the following figure:

When the Driver Program in the Apache Spark architecture executes, it calls the real program of an application and creates a SparkContext. SparkContext contains all of the basic functions. The Spark Driver includes several other components, including a DAG Scheduler, Task Scheduler, Backend Scheduler, and Block Manager, all of which are responsible for translating user-written code into jobs that are actually executed on the cluster.

The Cluster Manager manages the execution of various jobs in the cluster. Spark Driver works in conjunction with the Cluster Manager to control the execution of various other jobs. The cluster Manager does the task of allocating resources for the job. Once the job has been broken down into smaller jobs, which are then distributed to worker nodes, SparkDriver will control the execution.

Many worker nodes can be used to process an RDD created in the SparkContext, and the results can also be cached.

The Spark Context receives task information from the Cluster Manager and enqueues it on worker nodes.

The executor is in charge of carrying out these duties. The lifespan of executors is the same as that of the Spark Application. We can increase the number of workers if we want to improve the performance of the system. In this way, we can divide jobs into more coherent parts.

Spark Architecture Applications

A high-level view of the architecture of the Apache Spark application is as follows:

The Spark driver

The master node (process) in a driver process coordinates workers and oversees the tasks. Spark is split into jobs and scheduled to be executed on executors in clusters. Spark contexts (gateways) are created by the driver to monitor the job working in a specific cluster and to connect to a Spark cluster. In the diagram, the driver programmes call the main application and create a spark context (acts as a gateway) that jointly monitors the job working in the cluster and connects to a Spark cluster. Everything is executed using the spark context.

Each Spark session has an entry in the Spark context. Spark drivers include more components to execute jobs in clusters, as well as cluster managers. Context acquires worker nodes to execute and store data as Spark clusters are connected to different types of cluster managers. When a process is executed in the cluster, the job is divided into stages with gain stages into scheduled tasks.

The Spark executors

An executor is responsible for executing a job and storing data in a cache at the outset. Executors first register with the driver programme at the beginning. These executors have a number of time slots to run the application concurrently. The executor runs the task when it has loaded data and they are removed in idle mode. The executor runs in the Java process when data is loaded and removed during the execution of the tasks. The executors are allocated dynamically and constantly added and removed during the execution of the tasks. A driver program monitors the executors during their performance. Users’ tasks are executed in the Java process.

Cluster Manager

A driver program controls the execution of jobs and stores data in a cache. At the outset, executors register with the drivers. This executor has a number of time slots to run the application concurrently. Executors read and write external data in addition to servicing client requests. A job is executed when the executor has loaded data and they have been removed in the idle state. The executor is dynamically allocated, and it is constantly added and deleted depending on the duration of its use. A driver program monitors executors as they perform users’ tasks. Code is executed in the Java process when an executor executes a user’s task.

Worker Nodes

The slave nodes function as executors, processing tasks, and returning the results back to the spark context. The master node issues tasks to the Spark context and the worker nodes execute them. They make the process simpler by boosting the worker nodes (1 to n) to handle as many jobs as possible in parallel by dividing the job up into sub-jobs on multiple machines. A Spark worker monitors worker nodes to ensure that the computation is performed simply. Each worker node handles one Spark task. In Spark, a partition is a unit of work and is assigned to one executor for each one.


#HUQUO #SPARK

要查看或添加评论,请登录

Yuvaraj Pandey的更多文章

  • DATA MANAGEMENT

    DATA MANAGEMENT

    Data management is the practice of ingesting, processing, securing and storing an organization’s data, where it is then…

  • SCALA

    SCALA

    What is Scala? A Robust and High-Caliber programming language that changed the world of big data. Scala is capable…

  • RISK MANAGEMENT

    RISK MANAGEMENT

    What is Risk Management? Risk management structures are tailored to do more than just point out existing risks. A good…

  • TERADATA

    TERADATA

    What Does Teradata Mean? Teradata is a fully scalable relational database management system produced by Teradata Corp…

  • POWER BI

    POWER BI

    What Is Power BI? Power BI is a set of Business Intelligence and Analytics Services from Microsoft. It offers…

  • DOT NET DEVELOPER

    DOT NET DEVELOPER

    What is .NET? .

  • DATA NETWORK

    DATA NETWORK

    Data networks refer to systems designed to transfer data between two or more access points via the use of system…

  • DATA EXTRACTION

    DATA EXTRACTION

    What is Data Extraction? Data extraction is the process of collecting or retrieving disparate types of data from a…

  • DATA VISUALIZATION

    DATA VISUALIZATION

    Before jumping into the term “Data Visualization”, let’s have a brief discussion on the term “Data Science” because…

  • JAVA DEVELOPER

    JAVA DEVELOPER

    Java Developer A Java developer is a specialized programmer who collaborates (working with two or more people) with…

    1 条评论

社区洞察

其他会员也浏览了