登录查看更多内容

Spark Tidbits - Lesson 7

John Miner

Data Architect at Insight

发布日期: 2024年7月8日

Having company data stored in a lake house allows the business users to gain insight on patterns. For instance, the manager at Adventure Works wants to know which products exceeded a quarter of a million dollars in sales in the past.

In the last lesson, we looked at the Sales LT schema that was using a full load pattern. The Bronze tables contained multiple versions of the truth (audit trail) while Silver tables contained the most recent data. Please see my article on SQL Server Central for details.

The above image shows the silver tables in Adventure Works lakehouse. The code below joins several of the tables into a view. Right now, views are not supported in the lakehouse explorer.

%%sql

--
--  1 - create view named rpt_prepared_data
--

create or replace view rpt_prepared_data as

-- choose fields
select
  d.CalendarYear as RptYear,
  d.MonthNumberOfYear AS RptMonth,
  s.SalesTerritoryGroup AS RptRegion,
  Coalesce(p.ModelName, p.EnglishProductName) AS ModelNo,
  f.OrderQuantity AS Quantity,
  cast(f.ExtendedAmount as decimal(10, 4)) AS Amount   

-- sales
from
  silver_fact_internet_sales_full as f

-- dates
inner join 
  silver_dim_date_full as d
on 
  f.OrderDateKey = d.DateKey

-- products
inner join 
  silver_dim_product_full as p
ON 
  f.ProductKey = p.ProductKey

-- customers    
inner join 
  silver_dim_customer_full as c
ON 
  f.CustomerKey = c.CustomerKey

-- geographies
inner join 
  silver_dim_geography_full as g
ON 
  c.GeographyKey = g.GeographyKey

-- territories
inner join 
  silver_dim_sales_territory_full as s
ON 
  g.SalesTerritoryKey = s.SalesTerritoryKey;

If we execute the show tables command, we get a list of tables and views.

%%sql

--
--  2 - show tables / views
--

show tables

The screen shot below shows the view named rpt_prepared_data does exist in the catalog but can not be seen in the explorer.

Today, we are going to see how we can find sales by year, month and model number that exceed 250 thousand dollars in value. The Spark SQL code uses the group by, having and order by clauses to solve this problem.

%%sql

--
--  3 - Spark SQL - Months with sales > 250K
--

SELECT 
  RptYear,
  RptMonth,
  RptRegion,
  ModelNo,
  SUM(Quantity) as TotalQty,
  SUM(Amount) as TotalAmt
FROM 
  rpt_prepared_data 
GROUP BY
  RptYear,
  RptMonth,
  RptRegion,
  ModelNo
HAVING 
  TotalAmt > 250000
ORDER BY
  RptYear,
  RptMonth,
  RptRegion,
  ModelNo

The image below show there were 4 times in the year 2011 in which the Road-150 bike achieved this sales quota. Additionally, the Mountain-200 bike achieved similar results in the year 2013.

领英推荐

Azure Synapse Analytics

Amir Sherif 5 个月前

WINDOWs of the World

Helen Wall 1 年前

Databricks SQL Series — Introduction to Databricks SQL…

Krishna Yogi Kolluru 8 个月前

The same results can be achieved with Spark Dataframes. First, we can use spark.sql method to place all the records from the previously defined view named rpt_prepared_data into dataframes named df_raw.

There are several methods that you should get familiar with: groupBy - groups by specified columns; agg - allows functions to be applied columns; alias - required to rename columns for end user; filter - this is basically the having clause; orderBy - order the resulting data set; asc - apply ascending order; and desc - apply descending order.

#
#  4 - Spark Dataframes - Months with sales > 250K
#

# grab library
from pyspark.sql.functions import *

# grab view data
df_raw = spark.sql('select * from rpt_prepared_data')

# summarize & filter
df_qry = df_raw.groupBy("RptYear", "RptMonth", "RptRegion", "ModelNo"). \
    agg( sum("Quantity").alias("TotalQty"), sum("Amount").alias("TotalAmt") ). \
    filter(col('TotalAmt') > 250000). \
    orderBy(desc("RptYear"), desc("RptMonth"), asc("RptRegion"), asc("ModelNo"))

# show results
display(df_qry)

The same results are produced using the dataframes methods.

The same tables and views can be created in Azure Synapse Analytics using Spark. The image below shows the same 8 records we have been working with.

The major difference is the fact that views are supported in the lake database for Azure Synapse.

In a nutshell, aggregation of the Adventure Works tables can be done with both Spark SQL and Spark Dataframes. The same results can be achieved regardless of coding syntax.

I find views very useful. I brought up the issue to the Fabric product team in March 2024. Views do not show in the lakehouse explorer and these views can not be used in the SQL endpoint. Please request for this feature to be added since it already exists in both Azure Databricks and Azure Synapse.

要查看或添加评论，请登录

John Miner的更多文章

Why use Tally Tables in the Fabric Warehouse?

2025年2月26日

Why use Tally Tables in the Fabric Warehouse?

Technical Problem Did you know that Edgar F. Codd is considered the father of the relational model that is used by most…
Streaming Data with Azure Databricks

2025年2月25日

Streaming Data with Azure Databricks

Technical Problem The core functionality of Apache Spark has support for structured streaming using either a batch or a…

1 条评论
Upcoming Fabric Webinars from Insight

2025年2月19日

Upcoming Fabric Webinars from Insight

Don't miss the opportunity to boost your data skills with Insight and Microsoft. This webinar series will help you…
How to develop solutions with Fabric Data Warehouse?

2025年2月18日

How to develop solutions with Fabric Data Warehouse?

Technology Details The SQL endpoint of the Fabric Data Warehouse allows programs to read from and write to tables. The…
Understanding file formats within the Fabric Lakehouse

2025年2月10日

Understanding file formats within the Fabric Lakehouse

I am looking forward to talking to the Cloud Data Driven user group on March 13th. You can find all the presentation…

3 条评论
Engineering a Lakehouse with Azure Databricks with Spark Dataframes

2025年2月3日

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Problem Time does surely fly. I remember when Databricks was released to general availability in Azure in March 2018.
Create an Azure Databricks SQL Warehouse

2025年1月21日

Create an Azure Databricks SQL Warehouse

Problem Many companies are leveraging data lakes to manage both structured and unstructured data. However, not all…

2 条评论
How to Load a Fabric Warehouse?

2025年1月9日

How to Load a Fabric Warehouse?

Technology The data warehouse in Microsoft Fabric was re-written to use One Lake storage. This means each and every…
My Year End Wrap Up for 2024

2024年12月26日

My Year End Wrap Up for 2024

Hi Folks, It has been a very busy year. At the start of this year I wanted to learn Fabric in depth.

1 条评论
Virtualizing GCP data with Fabric Shortcuts

2024年12月16日

Virtualizing GCP data with Fabric Shortcuts

New Technology Before the invention of shortcuts in Microsoft Fabric, big data engineers had to create pipelines to…

See all articles

Spark Tidbits - Lesson 7

John Miner

Data Architect at Insight

领英推荐

John Miner的更多文章

社区洞察

其他会员也浏览了

Synapse Serverless SQL and file types – the ultimate guide!

Essential SQL Skills Every Aspiring Data Scientist Should Know

A Hierarchy of Ways to Leave Healthcare for Data Analytics, Part 1

Databricks SQL Series: Advanced Analytics in Databricks SQL — Using Window Functions — Part 3

Azure Data and Power BI News (July 2022)

Apache Arrow Flight SQL: Revolutionizing Data Transfer ( Flight vs JDBC/ODBC): 4.49x Faster with benchmark and code

Essential SQL Tips for Crafting Efficient Queries in Data Engineering

Mastering Advanced SQL Queries for Data Professionals

Microsoft Data Platform News 2024 - Week 01

Microsoft Data Platform News 2024 - Week 02

领英推荐

John Miner的更多文章

Why use Tally Tables in the Fabric Warehouse?

Streaming Data with Azure Databricks

Upcoming Fabric Webinars from Insight

How to develop solutions with Fabric Data Warehouse?

Understanding file formats within the Fabric Lakehouse

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Create an Azure Databricks SQL Warehouse

How to Load a Fabric Warehouse?

My Year End Wrap Up for 2024

Virtualizing GCP data with Fabric Shortcuts

社区洞察

其他会员也浏览了

Synapse Serverless SQL and file types – the ultimate guide!

Essential SQL Skills Every Aspiring Data Scientist Should Know

A Hierarchy of Ways to Leave Healthcare for Data Analytics, Part 1

Databricks SQL Series: Advanced Analytics in Databricks SQL — Using Window Functions — Part 3

Azure Data and Power BI News (July 2022)

Apache Arrow Flight SQL: Revolutionizing Data Transfer ( Flight vs JDBC/ODBC): 4.49x Faster with benchmark and code

Essential SQL Tips for Crafting Efficient Queries in Data Engineering

Mastering Advanced SQL Queries for Data Professionals

Microsoft Data Platform News 2024 - Week 01

Microsoft Data Platform News 2024 - Week 02