Spark SQL DataFrame
Malini Shukla
Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist
Introduction to Spark SQL DataFrame
DataFrame appeared in Spark Release 1.3.0. We can term DataFrame as Dataset organized into named columns. DataFrames are similar to the table in a relational database or data frame in R /Python. It can be said as a relational table with good optimization technique.
To Play with DataFrame in spark, install Apache Spark in Standalone mode.
The idea behind DataFrame is it allows processing of a large amount of structured data. DataFrame contains rows with Schema. The schema is the illustration of the structure of data.
DataFrame in Apache Spark prevails over RDD but contains the features of RDD as well. The features common to RDD and DataFrame are immutability, in-memory, resilient, distributed computing capability. It allows the user to impose the structure onto a distributed collection of data. Thus provides higher level abstraction.
We can build DataFrame from different data sources. For Example structured data file, tables in Hive, external databases or existing RDDs. The Application Programming Interface (APIs) of DataFrame is available in various languages. Examples include Scala, Java, Python, and R.
Both in Scala and Java, we represent DataFrame as Dataset of rows. In the Scala API, DataFrames are type alias of Dataset[Row]. In Java API, the user uses Dataset<Row> to represent a DataFrame.
Why DataFrame?
DataFrame is one step ahead of RDD. Since it provides memory management and optimized execution plan.
Learn more about Apache Spark RDD vs DataFrame vs DataSet.
Custom Memory Management: This is also known as Project Tungsten. A lot of memory is saved as the data is stored in off-heap memory in binary format. Apart from this, there is no Garbage Collection overhead. Expensive Java serialization is also avoided. Since the data is stored in binary format and the schema of memory is known.
Optimized Execution plan: This is also known as the query optimizer. Using this, an optimized execution plan is created for the execution of a query. Once the optimized plan is created final execution takes place on RDDs of Spark.
Features of Apache Spark DataFrame
Some of the limitations of Spark RDD were-
- It does not have any built-in optimization engine.
- There is no provision to handle structured data.
Thus, to overcome these limitations the picture of DataFrame came into existence. Some of the key features of DataFrame in Spark are:
DataFrame is a distributed collection of data organized in named column. It is equivalent to the table in RDBMS.
It can deal with both structured and unstructured data formats. For Example Avro, CSV, elastic search, and Cassandra. It also deals with storage systems HDFS, HIVE tables, MySQL, etc.
You can refer this guide to learn Spark SQL optimization phases in detail.