Apache Spark?-?Data Engineering

Apache Spark?-?Data Engineering

Overview of how data and compute resources are distributed across clusters, with optimization.

Spark is an alternative way of handling a big data ecosystem, built upon Hadoop’s MapReduce concept, designed to handle large datasets in a distributed environment.?

Unlike the master-slave architecture, Spark operates in a leader-member architecture.

Therefore, we introduce the concept of Resilient Distributed Datasets (RDDs).

Resilient Distributed Datasets (RDDs): This is the basic unit that stores or holds the data in Apache Spark for processing and transforming. An RDD is an immutable (read-only), collection of data units that can be operated on by many compute devices simultaneously (Spark parallel processing).

Each dataset in an RDD can be divided into logical partitions, which are then executed on different nodes of a cluster.

High-level Spark Architecture

Key Characteristics:

Horizontally scalable: This means the ability to add new compute resources to the system to increase capacity.

Levels of Operation in?Spark:

  1. Higher Level: Built upon the RDD (recommended for use). Generally involves working with DataFrames and SparkSQL.
  2. Lower Level: Known as the Spark Core API, also based on RDDs. Coding here can be done in Python, Scala, Java, or R, and it is more complex due to the parallelism and handling of large data concerning Spark architecture.

Other Higher-Level APIs Built on Spark?Core:

  • Structured Streaming API
  • MLlib
  • GraphX

Operations on Spark?RDD:

There are two major types of operations performed on Spark RDDs:

  1. Transformation: Operations that transform the data.
  2. Action: Operations that generate the final expected output.

A new RDD is created for each transformation and keeps track of them with the help of a Directed Acyclic Graph (DAG). Transformations are lazy, meaning they are not executed immediately but are optimized to minimize data shuffling and processing overhead. This allows transformations to be optimized and only executed when an action operation is performed.

For example, if you perform a heavy transformation followed by a filter operation, it is not an optimized approach. However, lazy evaluation allows the filter to be performed first, followed by the heavy transformation, optimizing the data processing flow.

Life Cycle of RDD

Types of Transformations:

  1. Narrow Transformation: Data is processed without shuffling.
  2. Wide Transformation: A costly operation where data is shuffled and processed.

We’ll discuss more about various transformations and their differences in the next blog. Stay tuned!

About me -

Hello! I’m Shoukath Ali, an aspiring data professional, with a Master’s in Data Science and a Bachelor’s in Computer Science and Engineering.

If you have any queries or suggestions, please feel free to reach out to me at [email protected]

Connect me on LinkedIn —www.dhirubhai.net/in/shoukath-ali-b6650576/

Disclaimer -

The views and opinions expressed on this blog are purely my own. Any product claim, statistic, quote, or other representation about a product or service should be verified with the manufacturer, provider, or party in question.

要查看或添加评论,请登录

Shoukath Ali Shaik的更多文章

  • Prompting - Prompt Engineering

    Prompting - Prompt Engineering

    Part one — Convincing LLM, how to generate outputs. ( Brainwash LLM models ??) As discussed in the previous blog…

  • Hadoop — Distributed File System(HDFS)

    Hadoop — Distributed File System(HDFS)

    High-level Overview, Focusing on Distributed Storage Architecture. Large organizations have a typical problem of…

  • LLM — Large Language Models

    LLM — Large Language Models

    Every language model has their own Vocabulary, also known as a group of words, which is used during the pre-training or…

    5 条评论
  • 5 V’s?—?Big Data

    5 V’s?—?Big Data

    What is Big Data? Big data is a descriptive definition of data (system) that cannot be stored (processed) using…

    1 条评论
  • A Day in the Life of —a Big Data Engineer

    A Day in the Life of —a Big Data Engineer

    Big Data Engineer — A Big Data Engineer is a professional who specializes in preparing ‘big data’ for analytical or…

社区洞察

其他会员也浏览了