Big Data and Spark difference between questionnaire: Part 3

Big Data and Spark difference between questionnaire: Part 3

This is the continuous Article,

Part 1 link:?Big Data and Spark difference between questionnaire: Part 1

Part 2 link: Big Data and Spark difference between questionnaire: Part 2

StructType vs StructField

StructType and StructField both are used to specify the Schema.

StructType is a collection of StructFileds.

Using these two classes we can create the complex, nested, array, map columns.

Both classes are imported from?

pyspark.sql.types import StructType,StructFiled?

if we use PrintSchema() method??on the DataFrame shows StructType columns as “struct”.

UNION vs UNION ALL

UNION

Either two or more data frames?or?two table objects we can combine.

The result should be unique?records.?

Make sure both objects, attributes mustthe?same and in?sequence?order.

UNION ALL :?

It will combine?two or more SQL Tables or DataFrames.

it will append the data sets?vertically.

Possibility?of Duplicated data.

map() vs mapPartitions()

map and mapPartitions both are transformations?

map or mapPartition we can send functions?to deal with the logic.?

Mostly user define functions?we can call using these methods.

map is a row-level operation?

mapPartition is for?heavy initializations like database extracts instead?of dealing row by row?This helps the performance of the job when you dealing with heavy-weighted initialization on larger datasets.

map and mapPartition is not aggregated?so before or after the transformation?row?count?will not change might columns (attributes will increase).

foreachPartition vs foreach

foreachPartition and foreach both are actions in Spark. mostly both actions?are used to manipulate the accumulators.

When foreachPartition() applied on Spark DataFrame, it executes a function specified in foreach() for each partition on DataFrame. This operation is mainly used if you wanted to save the DataFrame result to RDBMS tables, or produce it to kafka topics e.t.c

we should use foreachPartition action operation when using heavy initialization like database connections or Kafka producer etc where it initializes one per partition rather than one per element(foreach). foreach() transformation mostly used to update accumulator variables

explode array?vs map columns

explode array and map columns both are functions we can use on explode(e:column)

When an array is passed to this function, it creates a new default column “col1” and it contains all array elements. When a map is passed, it creates two new columns, one for key and one for value, and each element in the map split into rows.

This will ignore elements that have null or empty.

No alt text provided for this image

Ref : https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/

Thank you to be continue (3 /5)
Arulkumaran Kumaraswamipillai

Java/Big Data contractor for 19yrs, self-taught from Mech Eng to IT, sold 30K+ books at Amazon.com & Author at java-success.com with 3.5K registered users

3 年

Good Q&As.

要查看或添加评论,请登录

Saikrishna Cheruvu的更多文章

社区洞察