Apache Spark?-?Data Engineering
Shoukath Ali Shaik
MSc in Data Science @ Indiana University Bloomington | 2x Microsoft Azure Certified | Aspiring Data Engineer | Data Scientist | Big Data Developer | Software Engineer | PySpark | GenAI | MLOps
Overview of how data and compute resources are distributed across clusters, with optimization.
Spark is an alternative way of handling a big data ecosystem, built upon Hadoop’s MapReduce concept, designed to handle large datasets in a distributed environment.?
Unlike the master-slave architecture, Spark operates in a leader-member architecture.
Therefore, we introduce the concept of Resilient Distributed Datasets (RDDs).
Resilient Distributed Datasets (RDDs): This is the basic unit that stores or holds the data in Apache Spark for processing and transforming. An RDD is an immutable (read-only), collection of data units that can be operated on by many compute devices simultaneously (Spark parallel processing).
Each dataset in an RDD can be divided into logical partitions, which are then executed on different nodes of a cluster.
Key Characteristics:
Horizontally scalable: This means the ability to add new compute resources to the system to increase capacity.
Levels of Operation in?Spark:
Other Higher-Level APIs Built on Spark?Core:
领英推荐
Operations on Spark?RDD:
There are two major types of operations performed on Spark RDDs:
A new RDD is created for each transformation and keeps track of them with the help of a Directed Acyclic Graph (DAG). Transformations are lazy, meaning they are not executed immediately but are optimized to minimize data shuffling and processing overhead. This allows transformations to be optimized and only executed when an action operation is performed.
For example, if you perform a heavy transformation followed by a filter operation, it is not an optimized approach. However, lazy evaluation allows the filter to be performed first, followed by the heavy transformation, optimizing the data processing flow.
Types of Transformations:
We’ll discuss more about various transformations and their differences in the next blog. Stay tuned!
About me -
Hello! I’m Shoukath Ali, an aspiring data professional, with a Master’s in Data Science and a Bachelor’s in Computer Science and Engineering.
If you have any queries or suggestions, please feel free to reach out to me at [email protected]
Connect me on LinkedIn —www.dhirubhai.net/in/shoukath-ali-b6650576/
Disclaimer -
The views and opinions expressed on this blog are purely my own. Any product claim, statistic, quote, or other representation about a product or service should be verified with the manufacturer, provider, or party in question.