The SparkSession
Manoj Chandrashekar
UAB’24?? | Former Lead Data Engineer @7-Eleven | I Torture The Data, To Confess To Anything
You control your Spark Application through a driver process called the SparkSession. The SparkSession instance is the way Spark executes user-defined manipulations across the cluster. There is a one-to-one correspondence between a SparkSession and a Spark Application. In Scala and Python, the variable is available as spark when you start the console. Let’s go ahead and look at the SparkSession in both Scala and/or Python:
spark
In Scala, you should see something like the following:res0:
org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@...
Python you’ll see something like this:
<pyspark.sql.session.SparkSession at 0x7efda4c1ccd0>
Let’s now perform the simple task of creating a range of numbers. This range of numbers is just like a named column in a spreadsheet:
// in Scala
val myRange = spark.range(1000).toDF("number")
# in Python
myRange = spark.range(1000).toDF("number")
We created a DataFrame with one column containing 1,000 rows with values from 0 to 999. This range of numbers represents a distributed collection. When run on a cluster, each part of this range of numbers exists on a different executor. This is a Spark DataFrame.