Spark Tidbits - Lesson 12
Designer a Power BI report that will be used for the whole wide company requires planning, testing and deployment. Many times, managers want a quick ad-hoc report that supplies insights within a few minutes. How can we leverage the Fabric Lakehouse Python notebooks to provide those answers?
Today, we are going to use the Adventure Works dataset. The bronze layer has the raw data and the silver layer has the refined data. The code below lists the tables that start with 'silver' as part of the name.
#
# 1 - Show silver tables in lakehouse
#
# grab data frame
df1 = spark.sql("show tables")
# filter rows
df1 = df1.filter("tableName like 'silver%'")
# show results
display(df1)
The image below shows the silver tables in the Fabric Lakehouse named "lh_adv_wrks".
Data modelers like to use both star and snowflake schemas. However, report designers like to de-normalize dimensions into views or tables that reduce the number of joins. The Fabric Lakehouse does support views. Here are two issues that are currently exists. First, the views do not show up in the lakehouse explorer. Second, views do not show up in the SQL Analytic endpoint and the Semantic Model.
#
# 2 - Views - does not show in explorer or sql analytics end point
#
# del view
stmt2 = "DROP VIEW IF EXISTS silver_vw_products"
df2 = spark.sql(stmt2)
# add view
stmt1 = """
SELECT
p.ProductKey,
p.ProductAlternateKey,
p.WeightUnitMeasureCode,
p.SizeUnitMeasureCode,
p.EnglishProductName,
p.StandardCost,
p.FinishedGoodsFlag,
p.Color,
p.SafetyStockLevel,
p.ReorderPoint,
p.ListPrice,
p.Size,
p.SizeRange,
p.Weight,
p.DaysToManufacture,
p.ProductLine,
p.DealerPrice,
p.Class,
p.Style,
p.ModelName,
p.StartDate,
p.EndDate,
p.Status,
sc.ProductSubcategoryKey,
sc.ProductSubcategoryAlternateKey,
sc.EnglishProductSubcategoryName,
pc.ProductCategoryKey,
pc.EnglishProductCategoryName
FROM
silver_dim_product_full as p
JOIN
silver_dim_product_subcategory_full as sc
ON
p.ProductSubcategoryKey = sc.ProductSubcategoryKey
JOIN
silver_dim_product_category_full as pc
ON
pc.ProductCategoryKey = sc.ProductCategoryKey;
"""
# create physical view
stmt2 = 'CREATE OR REPLACE VIEW silver_vw_products AS ' + stmt1
df2 = spark.sql(stmt2)
# create temporary view
df2 = spark.sql(stmt1)
df2.createOrReplaceTempView("tmp_silver_dim_products")
The above code collapses the product, product category and product subcategory tables into both a physical and temporary view.
#
# 3 - Show silver product tables in lakehouse
#
# grab data frame
df3 = spark.sql("show tables")
# filter rows
df3 = df3.filter("tableName like '%silver%product%'")
# show results
display(df3)
The output below shows both the tables and views regardless if they are physical or temporary.
The new notebooks in Fabric have a quick visuals feature. Let's take a look at this new feature now.
%%sql
--
-- 4 - quick visuals using fabric notebook feature
--
select * from tmp_silver_dim_products limit 10
The output from inspecting the List Price column is shown below. Do not believe everything you see. The minimum value and average value do not match output from Spark SQL. Why is this?
The first rule of quick statistics window is that the first 1000 records are used to generate values. The total number of rows in the dataset are less than this limitation. The second unwritten rule is that nulls are converted to zeros. Let's look at both Spark SQL and Spark Dataframes to see if they display the correct results. Please see the documentation for details on the Inspect feature of notebooks.
The Spark SQL below uses the temporary view to generate a variety of statistics. The minimum value is 2.29 and the mean is 747.66. This result does not count the null values when performing the calculation.
%%sql
--
-- 5 - calculate statistics using spark functions
--
select
count(ListPrice) as cnt1,
min(ListPrice), 2) as min1,
max(ListPrice), 2) as max1,
std(ListPrice), 2) as std1,
sum(ListPrice), 2) as tot1,
sum(ListPrice) / count(ListPrice) as avg1
from
tmp_silver_dim_products;
The output from executing the Spark SQL is show below.
This time, we want to return the results of the SQL as a dataframe. The summary method displays statistics we might be interested in. We will be looking at the Product Key, Color and List Price columns.
#
# 6 - calculate statistics using spark function
#
df6 = spark.sql("select ProductKey, Color, ListPrice from tmp_silver_dim_products")
display(df6.summary())
The image below shows the output from executing the dataframe method.
Because views can not be used in the Semantic Model of the Lakehouse, we are going to create a gold table for reporting. The code below joins a bunch of tables together which are related to the customer as a table named "gold_customer_sales".
#
# 7 - create gold reporting table
#
# del table
stmt7 = "DROP TABLE IF EXISTS gold_customer_sales"
df7 = spark.sql(stmt7)
# add table
stmt7 = """
CREATE TABLE gold_customer_sales AS
SELECT
p.EnglishProductCategoryName
,Coalesce(p.ModelName, p.EnglishProductName) AS Model
,c.CustomerKey
,s.SalesTerritoryGroup AS Region
, ABS(CAST((datediff(current_date(), c.BirthDate) / 365.25) AS INT)) as Age
,CASE
WHEN c.YearlyIncome < 40000 THEN 'Low'
WHEN c.YearlyIncome > 60000 THEN 'High'
ELSE 'Moderate'
END AS IncomeGroup
,d.CalendarYear
,d.FiscalYear
,d.MonthNumberOfYear AS Month
,f.SalesOrderNumber AS OrderNumber
,f.SalesOrderLineNumber AS LineNumber
,f.OrderQuantity AS Quantity
,f.ExtendedAmount AS Amount
FROM
silver_fact_internet_sales_full as f
INNER JOIN
silver_dim_date_full as d
ON
f.OrderDateKey = d.DateKey
INNER JOIN
tmp_silver_dim_products as p
ON
f.ProductKey = p.ProductKey
INNER JOIN
silver_dim_customer_full as c
ON
f.CustomerKey = c.CustomerKey
INNER JOIN
silver_dim_geography_full as g
ON
c.GeographyKey = g.GeographyKey
INNER JOIN
silver_dim_sales_territory_full as s
ON
g.SalesTerritoryKey = s.SalesTerritoryKey
"""
df7 = spark.sql(stmt7)
When can verify that the table was create in the lakehouse using the code below.
领英推荐
#
# 8 - Show gold tables in lakehouse
#
# grab data frame
df8 = spark.sql("show tables")
# filter rows
df8 = df8.filter("tableName like 'gold%'")
# show results
display(df8)
The image below shows one table exists in the gold layer for reporting.
Our manager wants to know the sales that happened in December 2010 by region and product model. The Spark query below gives us that answer for multiple years and months.
#
# 9 - summary by year, month, region, and model.
#
stmt9 = """
SELECT
CalendarYear as RptYear,
Month as RptMonth,
Region as RptRegion,
Model as ModelNo,
SUM(Quantity) as TotalQty,
SUM(Amount) as TotalAmt
FROM
gold_customer_sales
GROUP BY
CalendarYear,
Month,
Region,
Model
ORDER BY
CalendarYear,
Month,
Region
"""
# get data frame
df9 = spark.sql(stmt9)
# make temp view
df9.createOrReplaceTempView("rpt_sales_by_region_model")
Looking at the first seven rows of the result set gives us the answer our manager wants.
%%sql
--
-- 10 - Show December 2010 data
--
select *
from rpt_sales_by_region_model limit 7
The output below shows the total quantity and total amount for each product by year, month and region.
While the details in the table are nice, upper management is just worried about revenue by region. In fact, they want to see this information visually. How can we generate custom charts using Python code in a Fabric Lakehouse notebook?
There are several reporting libraries we can use. I am going to talk about seaborn and matplotlib today.
The first visualization our manager wants is a pie chart showing sales by region for the year 2013. First, we are going to group by year and category as a string and calculate total sales. This resulting Spark dataframe is converted into a Pandas dataframe.
#
# 11 - convert to pandas
#
# spark sql stmt
stmt11 = """
select cast(RptYear as string) || '-' || RptRegion as category, sum(TotalAmt) as value
from rpt_sales_by_region_model
where RptYear = 2013
group by cast(RptYear as string) || '-' || RptRegion
"""
# spark + pandas df
df_spark = spark.sql(stmt11)
df_pandas = df_spark.toPandas()
The code below uses seaborn to change the theme. The plotting is done with the matplotlib.
#
# 12 - pie chart for 2013 sales by region
#
# import libs
import seaborn as sns
import matplotlib.pyplot as plt
# pie chart using Seaborn
sns.set_theme(style="whitegrid")
plt.figure(figsize=(6, 6))
plt.pie(df_pandas["value"], labels=df_pandas["category"], autopct='%1.1f%%')
plt.title("Sales by Region")
plt.show()
The output of execution the code produces our pie chart.
Another manager likes the information in a bar chart with actual sales number, not the percentage of each division contributing to the bottom line of the company.
#
# 13 - bar chart for 2013 sales by region
#
# import libs
import matplotlib.pyplot as plt
# create chart
plt.bar( df_pandas["category"], df_pandas["value"],label="Sales (Millions)", color='b')
plt.plot()
# add lables + titles
plt.xlabel("year & region")
plt.ylabel("total sales")
plt.title("Sales by Region")
plt.legend()
# show values
for index, value in enumerate(df_pandas["value"]):
plt.text(index - 0.125, value, str(round(value/1000000, 2)) + " M")
# shopw chart
plt.show()
The above code is a little more complex since we wanted actual values shows at the top of each bar chart. By default, scientific notation is used on the y-axes and categorical information is on the x-axis.
A quick look at this chart shows the North America region had the most amount of sales (6.55 million dollars) and the Pacific region has the least amount of sales (4.34 million).
This article showed three ways to gather statistics: Fabric Notebook Inspect; Seaborn charts and Matplotlib visuals. Please vote on my two fabric ideas that are currently not fixed: idea1 - how inspect handles nulls and idea2 - please show views in the lakehouse explorer. There are many more libraries that can be used to generate quick visuals. Please see MS Learn documentation for more information.