登录查看更多内容

Spark Tidbits - Lesson 2

John Miner

Data Architect at Insight

发布日期: 2024年6月18日

During my undergraduate education at the University of Rhode Island, I took an assembly language course. If we wanted to compute Pythagoras' theorem, we move the value of A into two registers. We multiple the registers and move the squared result into memory. We repeat the steps for the value B. Then we move the results to two separate registers and add the two values. Finally, we take the square root of the result to find the value of C.

What is the common between Assembly and Data Engineering in Spark? Data processing has to do with moving the data from one place to another. In spark, we read files into memory, change the contents of the memory, and write the results back to disk. With assembly, it has to do with moving bits between memory locations and general purpose registers.

Let's go back to the weather dataset in which we had two CSV files, one for low temperatures and one for high temperatures. The first solution uses three code blocks to combine the two files and save the result as a delta table.

#
#  1 - Read low temp file
# 

# location
path = "Files/Raw/Weather/low_temps.csv"

# read file
df_low = spark.read.format("csv").option("header","true").load(path)

# rename columns
df_low = df_low.withColumnRenamed("temp", "low_temp")

The code above and below loads the two files into separate Dataframes named df_low and df_high.

#
#  2 - Read high temp file
# 

# location
path = "Files/Raw/Weather/high_temps.csv"

# read file
df_high = spark.read.format("csv").option("header","true").load(path)

# rename columns
df_high = df_high.withColumnRenamed("temp", "high_temp")
df_high = df_high.withColumnRenamed("date", "date2")

The code below joins the two Dataframes and saves the unique columns as a delta table named tbl_temps_01.

领英推荐

Milan's Data Science Insights #011

Milan Janosov 2 个月前

NumPy for Data Science Beginners: 2021

Free Online Courses With Printable Certificates 1 年前

Milan's Data Science Insights #009

Milan Janosov 6 个月前

#
#  3 - combine sets and save as delta table
#

# join dataframes + drop extra column
df_temps = df_low.join(df_high, df_low["date"] == df_high["date2"]).drop("date2")

# save a delta table
df_temps.write.format("delta").saveAsTable("tbl_temps_01")

The interesting that we can create the same delta table with one large block of Spark SQL code.


%%sql
--
-- 4 - read high + low temps,  join sets, save as delta table
--
create table tbl_temps_02 as
  select
    l.date,
    l.low_temp,
    h.high_temp
  from
  (
    select 
      _c0 as date, 
      _c1 as low_temp 
    from CSV.`Files/Raw/Weather/low_temps.csv` 
  ) as l
  join
  (
    select 
      _c0 as date, 
      _c1 as high_temp 
    from CSV.`Files/Raw/Weather/high_temps.csv` 
  ) as h
  on 
    l.date = h.date
  where
    l.date <> 'date'

Both the PySpark Dataframes and Spark SQL statements create a delta table with the same results. The question is which syntax does your company like the best and want to maintain in the future?

Next time, we will investigate how to create a calculated column using both syntaxes.

要查看或添加评论，请登录

John Miner的更多文章

Why use Tally Tables in the Fabric Warehouse?

2025年2月26日

Why use Tally Tables in the Fabric Warehouse?

Technical Problem Did you know that Edgar F. Codd is considered the father of the relational model that is used by most…
Streaming Data with Azure Databricks

2025年2月25日

Streaming Data with Azure Databricks

Technical Problem The core functionality of Apache Spark has support for structured streaming using either a batch or a…

1 条评论
Upcoming Fabric Webinars from Insight

2025年2月19日

Upcoming Fabric Webinars from Insight

Don't miss the opportunity to boost your data skills with Insight and Microsoft. This webinar series will help you…
How to develop solutions with Fabric Data Warehouse?

2025年2月18日

How to develop solutions with Fabric Data Warehouse?

Technology Details The SQL endpoint of the Fabric Data Warehouse allows programs to read from and write to tables. The…
Understanding file formats within the Fabric Lakehouse

2025年2月10日

Understanding file formats within the Fabric Lakehouse

I am looking forward to talking to the Cloud Data Driven user group on March 13th. You can find all the presentation…

3 条评论
Engineering a Lakehouse with Azure Databricks with Spark Dataframes

2025年2月3日

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Problem Time does surely fly. I remember when Databricks was released to general availability in Azure in March 2018.
Create an Azure Databricks SQL Warehouse

2025年1月21日

Create an Azure Databricks SQL Warehouse

Problem Many companies are leveraging data lakes to manage both structured and unstructured data. However, not all…

2 条评论
How to Load a Fabric Warehouse?

2025年1月9日

How to Load a Fabric Warehouse?

Technology The data warehouse in Microsoft Fabric was re-written to use One Lake storage. This means each and every…
My Year End Wrap Up for 2024

2024年12月26日

My Year End Wrap Up for 2024

Hi Folks, It has been a very busy year. At the start of this year I wanted to learn Fabric in depth.

1 条评论
Virtualizing GCP data with Fabric Shortcuts

2024年12月16日

Virtualizing GCP data with Fabric Shortcuts

New Technology Before the invention of shortcuts in Microsoft Fabric, big data engineers had to create pipelines to…

See all articles

Spark Tidbits - Lesson 2

John Miner

Data Architect at Insight

领英推荐

John Miner的更多文章

社区洞察

其他会员也浏览了

Linux

Build a Strong Foundation in Data Structures and Algorithms with This Curated List of 60 LeetCode Problems! ??

Optimizing Performance in Python/PySpark for Data Filtering and Transformation

What is an RDD?

Jupyter Lab as a container

How to dispatch multiple PySpark jobs on a Databricks Notebook using Python multithreading and Spark pools

10 Tricks that Helped Me Land Data Engineering Role

"ex-libris" of a Data Scientist, part IV: Code

GCP Data Engineering Free Course

???????? ?????????????????????? ?????????? ????????????????

领英推荐

John Miner的更多文章

Why use Tally Tables in the Fabric Warehouse?

Streaming Data with Azure Databricks

Upcoming Fabric Webinars from Insight

How to develop solutions with Fabric Data Warehouse?

Understanding file formats within the Fabric Lakehouse

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Create an Azure Databricks SQL Warehouse

How to Load a Fabric Warehouse?

My Year End Wrap Up for 2024

Virtualizing GCP data with Fabric Shortcuts

社区洞察

其他会员也浏览了

Linux

Build a Strong Foundation in Data Structures and Algorithms with This Curated List of 60 LeetCode Problems! ??

Optimizing Performance in Python/PySpark for Data Filtering and Transformation

What is an RDD?

Jupyter Lab as a container

How to dispatch multiple PySpark jobs on a Databricks Notebook using Python multithreading and Spark pools

10 Tricks that Helped Me Land Data Engineering Role

"ex-libris" of a Data Scientist, part IV: Code

GCP Data Engineering Free Course

???????? ?????????????????????? ?????????? ????????????????