Introduction to Apache spark
NIKHIL G R
Serving Notice Period, Cloud Data Engineer at TCS, 2x Microsoft Azure Cloud Certified, Python, Pyspark, Azure Databricks, ADLs, Azure Synapse, Azure Data factory, MySQL, Lake House, Delta Lake, Data Enthusiast
Apache Spark is a Distributed Computing Framework.
Before going into Apache Spark let us understand what are the challenges with the Mapreduce and why Apache Spark came into existence.
Challenges of Mapreduce
What is Apache Spark?
Apache Spark is a plug and play compute engine. Spark does not come up with storage and Resource Manager.
Everyone get confused that Spark is an alternative for Hadoop. But it is wrong. Apache Spark is an alternative for Mapreduce in Hadoop ecosystem.
It is a General Purpose, In-Memory, Compute Engine.
General Purpose
In-Memory
Compute Engine
Apache Spark vs Databricks
Databricks is a company and they have a product called as Databricks.
It is Spark internally but have extra features such as:
Apache Spark provides mainly two types of API's
RDD's
领英推荐
Drawbacks of RDD
Dataframes
Spark SQL
Spark Table
Transformations, Actions and Utility Functions
Transformations - These are the operations which are applied on RDD to create a new RDD.
There are two types of Transformations:
Actions - These are the operations which are applied on RDD, that instructs the Apache Spark to perform computation and send the result back to driver.
eg: reduce(), count()
Utility Functions - These are the builtin functions available for operations.
eg: cache(), printSchema()
Why Transformations are Lazy?
Consider we have a file file1 in HDFS which is having more than 10 billion records and we perform certain operations.
If the transformation was not lazy, there would have been 10 billion records loaded into memory just to display a single record.
SparkSession is an entry point for any Spark program to execute. Before Spark2, we had
But after Spark2, it has been bundled as an umbrella as SparkSession.
Credit: Sumit Mittal sir