Spark Tidbits - Lesson 2
During my undergraduate education at the University of Rhode Island, I took an assembly language course. If we wanted to compute Pythagoras' theorem, we move the value of A into two registers. We multiple the registers and move the squared result into memory. We repeat the steps for the value B. Then we move the results to two separate registers and add the two values. Finally, we take the square root of the result to find the value of C.
What is the common between Assembly and Data Engineering in Spark? Data processing has to do with moving the data from one place to another. In spark, we read files into memory, change the contents of the memory, and write the results back to disk. With assembly, it has to do with moving bits between memory locations and general purpose registers.
Let's go back to the weather dataset in which we had two CSV files, one for low temperatures and one for high temperatures. The first solution uses three code blocks to combine the two files and save the result as a delta table.
#
# 1 - Read low temp file
#
# location
path = "Files/Raw/Weather/low_temps.csv"
# read file
df_low = spark.read.format("csv").option("header","true").load(path)
# rename columns
df_low = df_low.withColumnRenamed("temp", "low_temp")
The code above and below loads the two files into separate Dataframes named df_low and df_high.
#
# 2 - Read high temp file
#
# location
path = "Files/Raw/Weather/high_temps.csv"
# read file
df_high = spark.read.format("csv").option("header","true").load(path)
# rename columns
df_high = df_high.withColumnRenamed("temp", "high_temp")
df_high = df_high.withColumnRenamed("date", "date2")
The code below joins the two Dataframes and saves the unique columns as a delta table named tbl_temps_01.
领英推荐
#
# 3 - combine sets and save as delta table
#
# join dataframes + drop extra column
df_temps = df_low.join(df_high, df_low["date"] == df_high["date2"]).drop("date2")
# save a delta table
df_temps.write.format("delta").saveAsTable("tbl_temps_01")
The interesting that we can create the same delta table with one large block of Spark SQL code.
%%sql
--
-- 4 - read high + low temps, join sets, save as delta table
--
create table tbl_temps_02 as
select
l.date,
l.low_temp,
h.high_temp
from
(
select
_c0 as date,
_c1 as low_temp
from CSV.`Files/Raw/Weather/low_temps.csv`
) as l
join
(
select
_c0 as date,
_c1 as high_temp
from CSV.`Files/Raw/Weather/high_temps.csv`
) as h
on
l.date = h.date
where
l.date <> 'date'
Both the PySpark Dataframes and Spark SQL statements create a delta table with the same results. The question is which syntax does your company like the best and want to maintain in the future?
Next time, we will investigate how to create a calculated column using both syntaxes.