Spark Tidbits - Lesson 7
Having company data stored in a lake house allows the business users to gain insight on patterns. For instance, the manager at Adventure Works wants to know which products exceeded a quarter of a million dollars in sales in the past.
In the last lesson, we looked at the Sales LT schema that was using a full load pattern. The Bronze tables contained multiple versions of the truth (audit trail) while Silver tables contained the most recent data. Please see my article on SQL Server Central for details.
The above image shows the silver tables in Adventure Works lakehouse. The code below joins several of the tables into a view. Right now, views are not supported in the lakehouse explorer.
%%sql
--
-- 1 - create view named rpt_prepared_data
--
create or replace view rpt_prepared_data as
-- choose fields
select
d.CalendarYear as RptYear,
d.MonthNumberOfYear AS RptMonth,
s.SalesTerritoryGroup AS RptRegion,
Coalesce(p.ModelName, p.EnglishProductName) AS ModelNo,
f.OrderQuantity AS Quantity,
cast(f.ExtendedAmount as decimal(10, 4)) AS Amount
-- sales
from
silver_fact_internet_sales_full as f
-- dates
inner join
silver_dim_date_full as d
on
f.OrderDateKey = d.DateKey
-- products
inner join
silver_dim_product_full as p
ON
f.ProductKey = p.ProductKey
-- customers
inner join
silver_dim_customer_full as c
ON
f.CustomerKey = c.CustomerKey
-- geographies
inner join
silver_dim_geography_full as g
ON
c.GeographyKey = g.GeographyKey
-- territories
inner join
silver_dim_sales_territory_full as s
ON
g.SalesTerritoryKey = s.SalesTerritoryKey;
If we execute the show tables command, we get a list of tables and views.
%%sql
--
-- 2 - show tables / views
--
show tables
The screen shot below shows the view named rpt_prepared_data does exist in the catalog but can not be seen in the explorer.
Today, we are going to see how we can find sales by year, month and model number that exceed 250 thousand dollars in value. The Spark SQL code uses the group by, having and order by clauses to solve this problem.
%%sql
--
-- 3 - Spark SQL - Months with sales > 250K
--
SELECT
RptYear,
RptMonth,
RptRegion,
ModelNo,
SUM(Quantity) as TotalQty,
SUM(Amount) as TotalAmt
FROM
rpt_prepared_data
GROUP BY
RptYear,
RptMonth,
RptRegion,
ModelNo
HAVING
TotalAmt > 250000
ORDER BY
RptYear,
RptMonth,
RptRegion,
ModelNo
The image below show there were 4 times in the year 2011 in which the Road-150 bike achieved this sales quota. Additionally, the Mountain-200 bike achieved similar results in the year 2013.
领英推荐
The same results can be achieved with Spark Dataframes. First, we can use spark.sql method to place all the records from the previously defined view named rpt_prepared_data into dataframes named df_raw.
There are several methods that you should get familiar with: groupBy - groups by specified columns; agg - allows functions to be applied columns; alias - required to rename columns for end user; filter - this is basically the having clause; orderBy - order the resulting data set; asc - apply ascending order; and desc - apply descending order.
#
# 4 - Spark Dataframes - Months with sales > 250K
#
# grab library
from pyspark.sql.functions import *
# grab view data
df_raw = spark.sql('select * from rpt_prepared_data')
# summarize & filter
df_qry = df_raw.groupBy("RptYear", "RptMonth", "RptRegion", "ModelNo"). \
agg( sum("Quantity").alias("TotalQty"), sum("Amount").alias("TotalAmt") ). \
filter(col('TotalAmt') > 250000). \
orderBy(desc("RptYear"), desc("RptMonth"), asc("RptRegion"), asc("ModelNo"))
# show results
display(df_qry)
The same results are produced using the dataframes methods.
The same tables and views can be created in Azure Synapse Analytics using Spark. The image below shows the same 8 records we have been working with.
The major difference is the fact that views are supported in the lake database for Azure Synapse.
In a nutshell, aggregation of the Adventure Works tables can be done with both Spark SQL and Spark Dataframes. The same results can be achieved regardless of coding syntax.
I find views very useful. I brought up the issue to the Fabric product team in March 2024. Views do not show in the lakehouse explorer and these views can not be used in the SQL endpoint. Please request for this feature to be added since it already exists in both Azure Databricks and Azure Synapse.