登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Big Data and Spark difference between questionnaire: Part 3

Saikrishna Cheruvu

Lead Developer | Data Engineer | MLOPS | ex@ BOFA

发布日期: 2021年7月4日

+ 关注

This is the continuous Article,

Part 1 link:?Big Data and Spark difference between questionnaire: Part 1

Part 2 link: Big Data and Spark difference between questionnaire: Part 2

StructType vs StructField

StructType and StructField both are used to specify the Schema.

StructType is a collection of StructFileds.

Using these two classes we can create the complex, nested, array, map columns.

Both classes are imported from?

pyspark.sql.types import StructType,StructFiled?

if we use PrintSchema() method??on the DataFrame shows StructType columns as “struct”.

UNION vs UNION ALL

UNION

Either two or more data frames?or?two table objects we can combine.

The result should be unique?records.?

Make sure both objects, attributes mustthe?same and in?sequence?order.

UNION ALL :?

It will combine?two or more SQL Tables or DataFrames.

it will append the data sets?vertically.

Possibility?of Duplicated data.

map() vs mapPartitions()

map and mapPartitions both are transformations?

map or mapPartition we can send functions?to deal with the logic.?

Mostly user define functions?we can call using these methods.

map is a row-level operation?

mapPartition is for?heavy initializations like database extracts instead?of dealing row by row?This helps the performance of the job when you dealing with heavy-weighted initialization on larger datasets.

map and mapPartition is not aggregated?so before or after the transformation?row?count?will not change might columns (attributes will increase).

foreachPartition vs foreach

foreachPartition and foreach both are actions in Spark. mostly both actions?are used to manipulate the accumulators.

When foreachPartition() applied on Spark DataFrame, it executes a function specified in foreach() for each partition on DataFrame. This operation is mainly used if you wanted to save the DataFrame result to RDBMS tables, or produce it to kafka topics e.t.c

we should use foreachPartition action operation when using heavy initialization like database connections or Kafka producer etc where it initializes one per partition rather than one per element(foreach). foreach() transformation mostly used to update accumulator variables

explode array?vs map columns

explode array and map columns both are functions we can use on explode(e:column)

When an array is passed to this function, it creates a new default column “col1” and it contains all array elements. When a map is passed, it creates two new columns, one for key and one for value, and each element in the map split into rows.

This will ignore elements that have null or empty.

Ref : https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/

Thank you to be continue (3 /5)

Arulkumaran Kumaraswamipillai

Java/Big Data contractor for 19yrs, self-taught from Mech Eng to IT, sold 30K+ books at Amazon.com & Author at java-success.com with 3.5K registered users

3 年

Good Q&As.

1 次回应

要查看或添加评论，请登录

Saikrishna Cheruvu的更多文章

How Databricks AI/BI is Revolutionizing BI and Overtaking Power BI

2024年8月4日

How Databricks AI/BI is Revolutionizing BI and Overtaking Power BI

In recent years, the landscape of Business Intelligence (BI) has witnessed significant transformations. One of the most…
"Which tool is the right choice for cloud data transformation?" ?? #Cloud #DataTransformation #Databricks #DecisionMaking #Dbt

2024年6月30日

"Which tool is the right choice for cloud data transformation?" ?? #Cloud #DataTransformation #Databricks #DecisionMaking #Dbt

I am trying to attempt a comparison between dbt and Databricks (delta live tables) Note: Not prompted and copied from…

3 条评论
Problems with scalable data systems need creative approaches.

2024年4月7日

Problems with scalable data systems need creative approaches.

Maybe chatGpt will help to write the code, not the solutions that we need to do with human intelligence. (?? soon the…

3 条评论
Datasbricks vs Snowflake ??part 1??

2023年8月19日

Datasbricks vs Snowflake ??part 1??

Snowflake and Databricks have wonderful features and most of them are common. If any feature is released on one of the…

4 条评论
What is Z-Order on Databricks?

2023年4月1日

What is Z-Order on Databricks?

What is Z-Order? We can compare the z-order with the cluster index in Oracle (I am a fan of SQL and databases, so my…
SQL Statement Execution API by Databricks

2023年3月9日

SQL Statement Execution API by Databricks

Recently, Databricks released an API for the execution of SQL statements. as of now, this is available on AWS and Azure…

2 条评论
What is Data Mesh?

2022年11月2日

What is Data Mesh?

What is a data mesh? Data mesh is not a technology; it is a conceptual theory of what types of applications we can…

3 条评论
Enterprise Scale Analytics/AI

2022年10月31日

Enterprise Scale Analytics/AI

few lines on ESA Enterprise scale is an architecture approach and reference implementation that enables effective…
Data bricks Governance and Security(Data masking) Implementation with example

2022年10月19日

Data bricks Governance and Security(Data masking) Implementation with example

Some lines about Data masking: Data masking is a technique for creating a dummy data (fake) but realistic version of…

2 条评论
Building Python SDK for Databricks REST API

2022年10月17日

Building Python SDK for Databricks REST API

This article is about a project I've started to work on lately. Please welcome Databricsk REST API - Python.

See all articles

Saikrishna Cheruvu的更多文章

How Databricks AI/BI is Revolutionizing BI and Overtaking Power BI

"Which tool is the right choice for cloud data transformation?" ?? #Cloud #DataTransformation #Databricks #DecisionMaking #Dbt

Problems with scalable data systems need creative approaches.

Datasbricks vs Snowflake ??part 1??

What is Z-Order on Databricks?

SQL Statement Execution API by Databricks

What is Data Mesh?

Enterprise Scale Analytics/AI

Data bricks Governance and Security(Data masking) Implementation with example

Building Python SDK for Databricks REST API

社区洞察