Big Data and Spark difference between questionnaire: Part 3
This is the continuous Article,
StructType vs StructField
StructType and StructField both are used to specify the Schema.
StructType is a collection of StructFileds.
Using these two classes we can create the complex, nested, array, map columns.
Both classes are imported from?
pyspark.sql.types import StructType,StructFiled?
if we use PrintSchema() method??on the DataFrame shows StructType columns as “struct”.
UNION vs UNION ALL
UNION
Either two or more data frames?or?two table objects we can combine.
The result should be unique?records.?
Make sure both objects, attributes mustthe?same and in?sequence?order.
UNION ALL :?
It will combine?two or more SQL Tables or DataFrames.
it will append the data sets?vertically.
Possibility?of Duplicated data.
map() vs mapPartitions()
map and mapPartitions both are transformations?
map or mapPartition we can send functions?to deal with the logic.?
Mostly user define functions?we can call using these methods.
map is a row-level operation?
mapPartition is for?heavy initializations like database extracts instead?of dealing row by row?This helps the performance of the job when you dealing with heavy-weighted initialization on larger datasets.
map and mapPartition is not aggregated?so before or after the transformation?row?count?will not change might columns (attributes will increase).
foreachPartition vs foreach
foreachPartition and foreach both are actions in Spark. mostly both actions?are used to manipulate the accumulators.
When foreachPartition() applied on Spark DataFrame, it executes a function specified in foreach() for each partition on DataFrame. This operation is mainly used if you wanted to save the DataFrame result to RDBMS tables, or produce it to kafka topics e.t.c
we should use foreachPartition action operation when using heavy initialization like database connections or Kafka producer etc where it initializes one per partition rather than one per element(foreach). foreach() transformation mostly used to update accumulator variables
explode array?vs map columns
explode array and map columns both are functions we can use on explode(e:column)
When an array is passed to this function, it creates a new default column “col1” and it contains all array elements. When a map is passed, it creates two new columns, one for key and one for value, and each element in the map split into rows.
This will ignore elements that have null or empty.
Ref : https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/
Thank you to be continue (3 /5)
Java/Big Data contractor for 19yrs, self-taught from Mech Eng to IT, sold 30K+ books at Amazon.com & Author at java-success.com with 3.5K registered users
3 年Good Q&As.