登录查看更多内容

Apache Spark?-?Data Engineering

Shoukath Ali Shaik

MSc in Data Science @ Indiana University Bloomington | 2x Microsoft Azure Certified | Aspiring Data Engineer | Data Scientist | Big Data Developer | Software Engineer | PySpark | GenAI | MLOps

发布日期: 2024年6月28日

Overview of how data and compute resources are distributed across clusters, with optimization.

Spark is an alternative way of handling a big data ecosystem, built upon Hadoop’s MapReduce concept, designed to handle large datasets in a distributed environment.?

Unlike the master-slave architecture, Spark operates in a leader-member architecture.

Therefore, we introduce the concept of Resilient Distributed Datasets (RDDs).

Resilient Distributed Datasets (RDDs): This is the basic unit that stores or holds the data in Apache Spark for processing and transforming. An RDD is an immutable (read-only), collection of data units that can be operated on by many compute devices simultaneously (Spark parallel processing).

Each dataset in an RDD can be divided into logical partitions, which are then executed on different nodes of a cluster.

Key Characteristics:

Horizontally scalable: This means the ability to add new compute resources to the system to increase capacity.

Levels of Operation in?Spark:

Higher Level: Built upon the RDD (recommended for use). Generally involves working with DataFrames and SparkSQL.
Lower Level: Known as the Spark Core API, also based on RDDs. Coding here can be done in Python, Scala, Java, or R, and it is more complex due to the parallelism and handling of large data concerning Spark architecture.

Other Higher-Level APIs Built on Spark?Core:

Structured Streaming API
MLlib
GraphX

领英推荐

PySpark Introduction: Powering Big Data Processing…

Eduardo Miranda 7 个月前

Understanding Spark on YARN Architecture

Sachin D N ???? 1 年前

Unlocking the Power of Apache Spark: A Comprehensive…

Udaya G 3 周前

Operations on Spark?RDD:

There are two major types of operations performed on Spark RDDs:

Transformation: Operations that transform the data.
Action: Operations that generate the final expected output.

A new RDD is created for each transformation and keeps track of them with the help of a Directed Acyclic Graph (DAG). Transformations are lazy, meaning they are not executed immediately but are optimized to minimize data shuffling and processing overhead. This allows transformations to be optimized and only executed when an action operation is performed.

For example, if you perform a heavy transformation followed by a filter operation, it is not an optimized approach. However, lazy evaluation allows the filter to be performed first, followed by the heavy transformation, optimizing the data processing flow.

Types of Transformations:

Narrow Transformation: Data is processed without shuffling.
Wide Transformation: A costly operation where data is shuffled and processed.

We’ll discuss more about various transformations and their differences in the next blog. Stay tuned!

About me -

Hello! I’m Shoukath Ali, an aspiring data professional, with a Master’s in Data Science and a Bachelor’s in Computer Science and Engineering.

If you have any queries or suggestions, please feel free to reach out to me at [email protected]

Connect me on LinkedIn —www.dhirubhai.net/in/shoukath-ali-b6650576/

Disclaimer -

The views and opinions expressed on this blog are purely my own. Any product claim, statistic, quote, or other representation about a product or service should be verified with the manufacturer, provider, or party in question.

要查看或添加评论，请登录

Shoukath Ali Shaik的更多文章

Prompting - Prompt Engineering

2024年6月4日

Prompting - Prompt Engineering

Part one — Convincing LLM, how to generate outputs. ( Brainwash LLM models ??) As discussed in the previous blog…
Hadoop — Distributed File System(HDFS)

2024年5月25日

Hadoop — Distributed File System(HDFS)

High-level Overview, Focusing on Distributed Storage Architecture. Large organizations have a typical problem of…
LLM — Large Language Models

2024年5月21日

LLM — Large Language Models

Every language model has their own Vocabulary, also known as a group of words, which is used during the pre-training or…

5 条评论
5 V’s?—?Big Data

2024年5月16日

5 V’s?—?Big Data

What is Big Data? Big data is a descriptive definition of data (system) that cannot be stored (processed) using…

1 条评论
A Day in the Life of —a Big Data Engineer

2024年5月11日

A Day in the Life of —a Big Data Engineer

Big Data Engineer — A Big Data Engineer is a professional who specializes in preparing ‘big data’ for analytical or…

See all articles

Apache Spark?-?Data Engineering

Shoukath Ali Shaik

MSc in Data Science @ Indiana University Bloomington | 2x Microsoft Azure Certified | Aspiring Data Engineer | Data Scientist | Big Data Developer | Software Engineer | PySpark | GenAI | MLOps

Key Characteristics:

Levels of Operation in?Spark:

Other Higher-Level APIs Built on Spark?Core:

领英推荐

Operations on Spark?RDD:

Types of Transformations:

Shoukath Ali Shaik的更多文章

社区洞察

其他会员也浏览了

How to implement Apache Spark in Data Processing and Analytics?

Apache Spark: The Ultimate Big Data Processing Engine

What is Apache Spark? The Big Data Platform That Surpassed Hadoop

Coupling and Cohesion (Decoupling) from a data engineering perspective.

Why Apache Spark is Not the Only Way Forward for Data Teams

What is Apache Spark? Why, When, How Using Apache Spark..?

Apache Spark

Apache Spark: Revolutionizing Big Data Processing

Apache Spark is an open-source, distributed computing system designed for fast and flexible big data processing.

Apache Spark and Databricks

Key Characteristics:

Levels of Operation in?Spark:

Other Higher-Level APIs Built on Spark?Core:

领英推荐

Operations on Spark?RDD:

Types of Transformations:

Shoukath Ali Shaik的更多文章

Prompting - Prompt Engineering

Hadoop — Distributed File System(HDFS)

LLM — Large Language Models

5 V’s?—?Big Data

A Day in the Life of —a Big Data Engineer

社区洞察

其他会员也浏览了

How to implement Apache Spark in Data Processing and Analytics?

Apache Spark: The Ultimate Big Data Processing Engine

What is Apache Spark? The Big Data Platform That Surpassed Hadoop

Coupling and Cohesion (Decoupling) from a data engineering perspective.

Why Apache Spark is Not the Only Way Forward for Data Teams

What is Apache Spark? Why, When, How Using Apache Spark..?

Apache Spark

Apache Spark: Revolutionizing Big Data Processing

Apache Spark is an open-source, distributed computing system designed for fast and flexible big data processing.

Apache Spark and Databricks