The SparkSession

The SparkSession

You control your Spark Application through a driver process called the SparkSession. The SparkSession instance is the way Spark executes user-defined manipulations across the cluster. There is a one-to-one correspondence between a SparkSession and a Spark Application. In Scala and Python, the variable is available as spark when you start the console. Let’s go ahead and look at the SparkSession in both Scala and/or Python:

spark        

In Scala, you should see something like the following:res0:

org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@...        

Python you’ll see something like this:

<pyspark.sql.session.SparkSession at 0x7efda4c1ccd0>        

Let’s now perform the simple task of creating a range of numbers. This range of numbers is just like a named column in a spreadsheet:

// in Scala
val myRange = spark.range(1000).toDF("number")

# in Python
myRange = spark.range(1000).toDF("number")        

We created a DataFrame with one column containing 1,000 rows with values from 0 to 999. This range of numbers represents a distributed collection. When run on a cluster, each part of this range of numbers exists on a different executor. This is a Spark DataFrame.


要查看或添加评论,请登录

Manoj Chandrashekar的更多文章

  • ?????????????? ???????????????????? ????????????????????????

    ?????????????? ???????????????????? ????????????????????????

    Spark makes it easy to develop and create big data programs. Spark also makes it easy to turn your interactive…

    1 条评论
  • End to End Pyspark Example

    End to End Pyspark Example

    We’ll use Spark to analyze some flight data from the United States Bureau of Transportation statistics. Inside the CSV…

社区洞察

其他会员也浏览了