登录查看更多内容

Spark SQL DataFrame

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

发布日期: 2018年1月29日

Introduction to Spark SQL DataFrame

DataFrame appeared in Spark Release 1.3.0. We can term DataFrame as Dataset organized into named columns. DataFrames are similar to the table in a relational database or data frame in R /Python. It can be said as a relational table with good optimization technique.

To Play with DataFrame in spark, install Apache Spark in Standalone mode.

The idea behind DataFrame is it allows processing of a large amount of structured data. DataFrame contains rows with Schema. The schema is the illustration of the structure of data.

DataFrame in Apache Spark prevails over RDD but contains the features of RDD as well. The features common to RDD and DataFrame are immutability, in-memory, resilient, distributed computing capability. It allows the user to impose the structure onto a distributed collection of data. Thus provides higher level abstraction.

We can build DataFrame from different data sources. For Example structured data file, tables in Hive, external databases or existing RDDs. The Application Programming Interface (APIs) of DataFrame is available in various languages. Examples include Scala, Java, Python, and R.

Both in Scala and Java, we represent DataFrame as Dataset of rows. In the Scala API, DataFrames are type alias of Dataset[Row]. In Java API, the user uses Dataset<Row> to represent a DataFrame.

Why DataFrame?

DataFrame is one step ahead of RDD. Since it provides memory management and optimized execution plan.

Learn more about Apache Spark RDD vs DataFrame vs DataSet.

Custom Memory Management: This is also known as Project Tungsten. A lot of memory is saved as the data is stored in off-heap memory in binary format. Apart from this, there is no Garbage Collection overhead. Expensive Java serialization is also avoided. Since the data is stored in binary format and the schema of memory is known.

Optimized Execution plan: This is also known as the query optimizer. Using this, an optimized execution plan is created for the execution of a query. Once the optimized plan is created final execution takes place on RDDs of Spark.

Features of Apache Spark DataFrame

Some of the limitations of Spark RDD were-

It does not have any built-in optimization engine.
There is no provision to handle structured data.

Thus, to overcome these limitations the picture of DataFrame came into existence. Some of the key features of DataFrame in Spark are:

DataFrame is a distributed collection of data organized in named column. It is equivalent to the table in RDBMS.

It can deal with both structured and unstructured data formats. For Example Avro, CSV, elastic search, and Cassandra. It also deals with storage systems HDFS, HIVE tables, MySQL, etc.

You can refer this guide to learn Spark SQL optimization phases in detail.

Read Complete article >>

要查看或添加评论，请登录

Malini Shukla的更多文章

Top 9 Computer Vision Project Ideas for Beginners

2020年1月21日

Top 9 Computer Vision Project Ideas for Beginners

Understand the visual world around us Computer Vision Projects Computer vision is the most powerful and compelling type…
12 Cool Data Science project ideas with source code - "Strengthen your Resume"

2019年11月13日

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

INTRODUCTION Data Science, a field that brings out wonders almost every second day and that’s why it is often regarded…

3 条评论
Python Coding Interview Questions for Experienced - Python FAQ's

2019年9月30日

Python Coding Interview Questions for Experienced - Python FAQ's

Firstly, If you are here, you probably already have a interview scheduled so my friend all the very best with that…
How Data Science is the Backbone of Retail?

2019年7月16日

How Data Science is the Backbone of Retail?

Data Science is having an increasing impact on business models in all industries. And in today’s digital world, data…
How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

2019年7月9日

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

“The goal is to turn data into information, and information into insight” Data Scientist is an analytical data expert…
What’s the Best programming Language to Start a Career in Data Science?

2019年6月25日

What’s the Best programming Language to Start a Career in Data Science?

If you are thinking which programming languages should I learn to Master data Science in 2019? Then you are at the…

1 条评论
11 Reason Why TensorFlow is So Popular

2019年6月15日

11 Reason Why TensorFlow is So Popular

TensorFlow Features | Why TensorFlow Is So Popular TensorFlow gives us an interactive multiplatform programming…
20 Deep Learning Terminologies You Must Know

2019年6月14日

20 Deep Learning Terminologies You Must Know

Deep Learning Terminologies a. Recurrent Neuron It’s one of the best from the Deep Learning Terminologies.

2 条评论
TensorFlow Performance Optimization – Tips To Improve Performance

2019年6月12日

TensorFlow Performance Optimization – Tips To Improve Performance

Ways for TensorFlow Performance Optimization There a variety of ways through which you can optimize your hardware tools…
Top 9 Reasons Why QlikView is Best in BI

2019年6月11日

Top 9 Reasons Why QlikView is Best in BI

QlikView Features Below are the 9 Features of QlikView, which gives us the importance of QlikView, let’s discuss them:…

See all articles

Spark SQL DataFrame

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

Introduction to Spark SQL DataFrame

Features of Apache Spark DataFrame

Malini Shukla的更多文章

社区洞察

其他会员也浏览了

WAT IS SPARK

Aggregation Functions in PySpark

SQL

BigData Analytics with PySpark

Mastering Spark Session Creation and Configuration in Apache Spark

WHAT IS SPARK

Building a Data Pipeline with SQL, Python, and Azure Fabric

Simplifying Apache Spark usage with Optimus

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

How to implement Apache Spark in Data Processing and Analytics?

Introduction to Spark SQL DataFrame

Features of Apache Spark DataFrame

Malini Shukla的更多文章

Top 9 Computer Vision Project Ideas for Beginners

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

Python Coding Interview Questions for Experienced - Python FAQ's

How Data Science is the Backbone of Retail?

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

What’s the Best programming Language to Start a Career in Data Science?

11 Reason Why TensorFlow is So Popular

20 Deep Learning Terminologies You Must Know

TensorFlow Performance Optimization – Tips To Improve Performance

Top 9 Reasons Why QlikView is Best in BI

社区洞察

其他会员也浏览了

WAT IS SPARK

Aggregation Functions in PySpark

SQL

BigData Analytics with PySpark

Mastering Spark Session Creation and Configuration in Apache Spark

WHAT IS SPARK

Building a Data Pipeline with SQL, Python, and Azure Fabric

Simplifying Apache Spark usage with Optimus

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

How to implement Apache Spark in Data Processing and Analytics?