登录查看更多内容

Apache Spark (big Data) DataFrame - Things to know

Abhishek Choudhary

Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI

发布日期: 2015年10月12日

+ 关注

What is the architecture of Apache Spark Now?

What is the point of interaction in Spark?

Previously it was RDD but currently Spark brought DataFrames API and that is considered to be the primary interaction point of Spark. But if we things which cannot be done in Dataframe then you can still use RDD like cases in Blob, image recognition etc.

How does job run in Spark?

What is partitioning in Apache Spark?

Partitioning is actually the main concept of access your entire Hardware resources while executing any Job.

More Partition = More Parallelism

So conceptually you must check the number of slots in your hardware, how many tasks can each of executors can handle.Each partition will leave in different Executor.

How does Dataframe APIs work?

With RDD data looks like following-

But in Dataframe the same data will be look like –

So Dataframe is more like column structure and each record is actually a line.

One of the feature in Dataframe is if you cache a Dataframe , it can compress the column value based on the type defined in the column. As Type column can be compressed as String compression technique. So great Performance optimization.

Can Run statistics naturally as its somewhat works like SQL or Python/R Dataframe.
In RDD, to process any data for last 7 days, spark needed to go through entire dataset to get the details, but in Dataframe you already get Time column to handle the situation, so Spark won’t even see the data which is greater than 7 days.
Easier to program.
Better performance and storage in the heap of executor.

You need to write much less code to process any result in Dataframe.

What is the most important thing about a Dataframe in Spark?

Spark Dataframe is lazy evaluation. Dataframes are lazy. Transformations contribute to the query plan, but they don’t execute anything.Actions cause the execution of the query.

Everything a Dataframe does, can be expressed from one of the following fundamental Transformations-

mapPartions()
New ShuffledRDD
ZipPartitions()

What & Why Spark SQL?

While trying to use Dataframe, a user needed to use either SQLContext or HiveContext. It is highly recommended to use HiveContext. USE HiveContext.

Behind the scene of Dataframes?

Create Dataframes (Right Data Sources)

How to read data from JDBC ?

Supports any JDBC compatible RDBMS: MySQL, PostGres, H2 etc
It supports predicate push down. Precisely, if you care about certain column in relational DB then you won’t have to read all of them in and filter. Spark Dataframe actually tells the Dataframe to prune out columns and only gives certain data back.
Same Spark application can be shared among multiple JDBC Servers

How Dataframe ensures to read less data?

You can skip partition while reading the data using Dataframe.
Using Parquet
Skipping data using statistucs (ie min, max)
Using partitioning (ie year = 2015/month = 06…)
Pushing predicates into storage systems.

What is Parquet?

Parquet should be the source for any operation or ETL. So if the data is different format, preferred approach is to convert the source to Parquet and then process.

If any dataset in JSON or comma separated file, first ETL it to convert it to Parquet.

It limits I/O , so scans/reads only the columns that are needed.
Parquet is columnar layout based, so it compresses better, so save spaces.

So parquet takes first column and store that as a file, and so on. So if we have 3 different files and sql query is on 2 files, then parquet won’t even consider to read the 3rd file.

How does Parquet store data?

Parquet stores data based on meta data in Footer.

So each row group , it has a footer.

On reading parquet, Spark has to auto discover the Data row format.

So Spark run 1st (separate) job to read the footer to understand layout of the data and 2nd job is actually to access the data or columns of data.

What are the features the parquet?

Metadata Merging. So if you add or remove columns in the data file, spark can scan the metadata for files to understand the change. So no need to inform spark about the change in data format.
Auto-discover data that has been partitioned into folders.

What is the technique behind Caching Dataframe?

Spark SQL will re-encode the data into byte buffers before calling caching so that there is less pressure on the GC.

These are just basics, will try to add more.

Shyam Punna

6 年

Nice article, can you please provide more on? , spark 2.3 or above , spark partitions and their use cases?

Ravinder Karra

Bigdata Engineer @ Amgen at Tata Consultancy Services

6 年

Couple of ways to cache a DF...? # 1st - using SQL sqlContext.sql('CACHE TABLE table_x').collect() # 2nd - using SQLContext sqlContext.cacheTable('table_x') # 3rd - using Spark cache underlying RDD table_x.cache() #4th?HIVE Context? from pyspark.sql import HiveContext hive_context = HiveContext(sc) hive_context.cacheTable("table_x") sqlContext.sql('select count(*) from table_x').collect()

Papil Patil

Sr. VP - Big Data at JPMorgan Chase & Co. | 4x Azure, 2x AWS Certified

6 年

Hi Abhishek, can you shed some more light on Caching Dataframes ? I am reading data from jdbc.read and want to cache that table in df.

Aakash Basu

Associate Director at KPMG || Data Architect || Strategy & Innovation || Ex-PwC || Big Data || AWS Cloud || Machine Learning || IoT || Deep Learning Deployment || Image, Audio & Video Analytics

7 年

This is a good piece! Thanks buddy, Abhishek.

查看更多评论

要查看或添加评论，请登录

Abhishek Choudhary的更多文章

Slack New Architecture

2020年1月1日

Slack New Architecture

This article presented the architecture/engineering decisions and changes brought in Slack to Scale it massively but by…
Unit Testing Apache Spark Applications in Scala or Python

2017年7月12日

Unit Testing Apache Spark Applications in Scala or Python

I saw a trend that developers usually find it very complicated to test spark application, may be no good library…
Spark On YARN cluster, Some Observations

2017年4月24日

Spark On YARN cluster, Some Observations

1. Number of partitions in Spark Basic => n Number of cores = n partitions = Number of executors Good => 2-3 times of…

4 条评论
Apache Spark (Big Data) Cache - Something Nice to Know

2017年1月17日

Apache Spark (Big Data) Cache - Something Nice to Know

Spark Caching is one of the most important aspect of in-memory computing technology. Spark RDD Caching is required when…
Apache Airflow - if you are bored of Oozie & style

2016年12月12日

Apache Airflow - if you are bored of Oozie & style

Apache Airflow is an incubator Apache project for Workflow or Job Scheduler. DAG is the backbone of airflow.

1 条评论
Apache Spark Serialization issue

2016年11月13日

Apache Spark Serialization issue

Its bit common to face Spark Serialization Issue while working with Streaming or basic Spark Job org.apache.

3 条评论
Few points On Apache Spark 2.0 Streaming Over cluster

2016年8月23日

Few points On Apache Spark 2.0 Streaming Over cluster

Experience on Apache Spark 2.0 Streaming Over cluster Apache Spark streaming documentation has enough details about its…
Facebook Architecture (Technical)

2015年11月19日

Facebook Architecture (Technical)

Facebook's current architecture is: Web front-end written in PHP. Facebook's HipHop Compiler [1] then converts it to…
Apache Flink ,From a Developer point of View

2015年10月26日

Apache Flink ,From a Developer point of View

What is Apache Flink ? Apache Flink is an open source platform for distributed stream and batch data processing Flink’s…

2 条评论
Apache Spark 1.5 Released ...

2015年9月10日

Apache Spark 1.5 Released ...

Apache Spark 1.5 is released and now available to download https://spark.

See all articles

Apache Spark (big Data) DataFrame - Things to know

Abhishek Choudhary

Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI

How to read data from JDBC ?

How Dataframe ensures to read less data?

How does Parquet store data?

What are the features the parquet?

What is the technique behind Caching Dataframe?

Abhishek Choudhary的更多文章

社区洞察

其他会员也浏览了

Apache Spark

Mastering Spark Session Creation and Configuration in Apache Spark

Building a Data Pipeline with SQL, Python, and Azure Fabric

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

Handling Nested Schema in Apache Spark

Real-Time OLAP with Apache Pinot and Kafka: Practical Project

Data Ingestion with Spark Scala and SQL through JDBC

Learn How to Display Data From Hudi Tables to your Frontend with Flask and Daft (NO SPARK NEEDED)

Dataframe Hints in Apache Spark

Practical Apache Spark in 10 minutes. Part 3?-?DataFrames and?SQL

How to read data from JDBC ?

How Dataframe ensures to read less data?

How does Parquet store data?

What are the features the parquet?

What is the technique behind Caching Dataframe?

Abhishek Choudhary的更多文章

Slack New Architecture

Unit Testing Apache Spark Applications in Scala or Python

Spark On YARN cluster, Some Observations

Apache Spark (Big Data) Cache - Something Nice to Know

Apache Airflow - if you are bored of Oozie & style

Apache Spark Serialization issue

Few points On Apache Spark 2.0 Streaming Over cluster

Facebook Architecture (Technical)

Apache Flink ,From a Developer point of View

Apache Spark 1.5 Released ...

社区洞察

其他会员也浏览了

Apache Spark

Mastering Spark Session Creation and Configuration in Apache Spark

Building a Data Pipeline with SQL, Python, and Azure Fabric

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

Handling Nested Schema in Apache Spark

Real-Time OLAP with Apache Pinot and Kafka: Practical Project

Data Ingestion with Spark Scala and SQL through JDBC

Learn How to Display Data From Hudi Tables to your Frontend with Flask and Daft (NO SPARK NEEDED)

Dataframe Hints in Apache Spark

Practical Apache Spark in 10 minutes. Part 3?-?DataFrames and?SQL