登录查看更多内容

Spark Tidbits - Lesson 12

John Miner

Data Architect at Insight

发布日期: 2024年12月9日

Designer a Power BI report that will be used for the whole wide company requires planning, testing and deployment. Many times, managers want a quick ad-hoc report that supplies insights within a few minutes. How can we leverage the Fabric Lakehouse Python notebooks to provide those answers?

Today, we are going to use the Adventure Works dataset. The bronze layer has the raw data and the silver layer has the refined data. The code below lists the tables that start with 'silver' as part of the name.

#
#  1 - Show silver tables in lakehouse
#

# grab data frame
df1 = spark.sql("show tables")

# filter rows
df1 = df1.filter("tableName like 'silver%'")

# show results
display(df1)

The image below shows the silver tables in the Fabric Lakehouse named "lh_adv_wrks".

Data modelers like to use both star and snowflake schemas. However, report designers like to de-normalize dimensions into views or tables that reduce the number of joins. The Fabric Lakehouse does support views. Here are two issues that are currently exists. First, the views do not show up in the lakehouse explorer. Second, views do not show up in the SQL Analytic endpoint and the Semantic Model.

#
#  2 - Views - does not show in explorer or sql analytics end point
#

# del view
stmt2 = "DROP VIEW IF EXISTS silver_vw_products"
df2 = spark.sql(stmt2)

# add view
stmt1 = """
SELECT
  p.ProductKey, 
  p.ProductAlternateKey, 
  p.WeightUnitMeasureCode, 
  p.SizeUnitMeasureCode, 
  p.EnglishProductName,
  p.StandardCost, 
  p.FinishedGoodsFlag, 
  p.Color, 
  p.SafetyStockLevel, 
  p.ReorderPoint, 
  p.ListPrice, 
  p.Size,  
  p.SizeRange, 
  p.Weight, 
  p.DaysToManufacture, 
  p.ProductLine, 
  p.DealerPrice, 
  p.Class, 
  p.Style, 
  p.ModelName, 
  p.StartDate, 
  p.EndDate, 
  p.Status,
  sc.ProductSubcategoryKey, 
  sc.ProductSubcategoryAlternateKey, 
  sc.EnglishProductSubcategoryName,
  pc.ProductCategoryKey, 
  pc.EnglishProductCategoryName
FROM 
  silver_dim_product_full as p
JOIN 
  silver_dim_product_subcategory_full as sc 
ON 
  p.ProductSubcategoryKey = sc.ProductSubcategoryKey
JOIN 
  silver_dim_product_category_full as pc
ON 
  pc.ProductCategoryKey = sc.ProductCategoryKey;
"""

# create physical view
stmt2 = 'CREATE OR REPLACE VIEW silver_vw_products AS ' + stmt1
df2 = spark.sql(stmt2)

#  create temporary view
df2 = spark.sql(stmt1)
df2.createOrReplaceTempView("tmp_silver_dim_products")

The above code collapses the product, product category and product subcategory tables into both a physical and temporary view.

#
#  3 - Show silver product tables in lakehouse
#

# grab data frame
df3 = spark.sql("show tables")

# filter rows
df3 = df3.filter("tableName like '%silver%product%'")

# show results
display(df3)

The output below shows both the tables and views regardless if they are physical or temporary.

The new notebooks in Fabric have a quick visuals feature. Let's take a look at this new feature now.

%%sql

--
--  4 - quick visuals using fabric notebook feature
--

select * from tmp_silver_dim_products limit 10

The output from inspecting the List Price column is shown below. Do not believe everything you see. The minimum value and average value do not match output from Spark SQL. Why is this?

The first rule of quick statistics window is that the first 1000 records are used to generate values. The total number of rows in the dataset are less than this limitation. The second unwritten rule is that nulls are converted to zeros. Let's look at both Spark SQL and Spark Dataframes to see if they display the correct results. Please see the documentation for details on the Inspect feature of notebooks.

The Spark SQL below uses the temporary view to generate a variety of statistics. The minimum value is 2.29 and the mean is 747.66. This result does not count the null values when performing the calculation.

%%sql

--
--  5 - calculate statistics using spark functions
--

select 
    count(ListPrice) as cnt1,
    min(ListPrice), 2) as min1,
    max(ListPrice), 2) as max1,
    std(ListPrice), 2) as std1,
    sum(ListPrice), 2) as tot1,
    sum(ListPrice) / count(ListPrice) as avg1
 from 
    tmp_silver_dim_products;

The output from executing the Spark SQL is show below.

This time, we want to return the results of the SQL as a dataframe. The summary method displays statistics we might be interested in. We will be looking at the Product Key, Color and List Price columns.

#
#  6 - calculate statistics using spark function
#

df6 = spark.sql("select ProductKey, Color, ListPrice from tmp_silver_dim_products")
display(df6.summary())

The image below shows the output from executing the dataframe method.

Because views can not be used in the Semantic Model of the Lakehouse, we are going to create a gold table for reporting. The code below joins a bunch of tables together which are related to the customer as a table named "gold_customer_sales".

#
#  7 - create gold reporting table
#

# del table
stmt7 = "DROP TABLE IF EXISTS gold_customer_sales"
df7 = spark.sql(stmt7)


# add table
stmt7 = """
CREATE TABLE gold_customer_sales AS
SELECT
   p.EnglishProductCategoryName
  ,Coalesce(p.ModelName, p.EnglishProductName) AS Model
  ,c.CustomerKey
  ,s.SalesTerritoryGroup AS Region
  , ABS(CAST((datediff(current_date(), c.BirthDate) / 365.25) AS INT)) as Age
  ,CASE
      WHEN c.YearlyIncome < 40000 THEN 'Low'
      WHEN c.YearlyIncome > 60000 THEN 'High'
      ELSE 'Moderate'
  END AS IncomeGroup
  ,d.CalendarYear
  ,d.FiscalYear
  ,d.MonthNumberOfYear AS Month
  ,f.SalesOrderNumber AS OrderNumber
  ,f.SalesOrderLineNumber AS LineNumber
  ,f.OrderQuantity AS Quantity
  ,f.ExtendedAmount AS Amount   
FROM
  silver_fact_internet_sales_full as f
INNER JOIN 
  silver_dim_date_full as d
ON 
  f.OrderDateKey = d.DateKey

INNER JOIN 
  tmp_silver_dim_products as p
ON 
  f.ProductKey = p.ProductKey
    
INNER JOIN 
  silver_dim_customer_full as c
ON 
  f.CustomerKey = c.CustomerKey

INNER JOIN 
  silver_dim_geography_full as g
ON 
  c.GeographyKey = g.GeographyKey

INNER JOIN 
  silver_dim_sales_territory_full as s
ON 
  g.SalesTerritoryKey = s.SalesTerritoryKey 
"""
df7 = spark.sql(stmt7)

When can verify that the table was create in the lakehouse using the code below.

领英推荐

Comparing loc and iloc in Pandas: When to Use Each for…

ITVersity, Inc. 1 个月前

Understanding Pandas DataFrames: A Complete Guide with…

ITVersity, Inc. 1 个月前

Unlocking Pandas: Listing Column Names and a Solid…

ITVersity, Inc. 1 个月前

#
#  8 - Show gold tables in lakehouse
#

# grab data frame
df8 = spark.sql("show tables")

# filter rows
df8 = df8.filter("tableName like 'gold%'")

# show results
display(df8)

The image below shows one table exists in the gold layer for reporting.

Our manager wants to know the sales that happened in December 2010 by region and product model. The Spark query below gives us that answer for multiple years and months.

#
#  9 - summary by year, month, region, and model.
#

stmt9 = """
SELECT 
  CalendarYear as RptYear,
  Month as RptMonth,
  Region as RptRegion,
  Model as ModelNo,
  SUM(Quantity) as TotalQty,
  SUM(Amount) as TotalAmt
FROM 
  gold_customer_sales 
GROUP BY
  CalendarYear,
  Month,
  Region,
  Model
ORDER BY
  CalendarYear,
  Month,
  Region
"""

# get data frame
df9 = spark.sql(stmt9)

# make temp view
df9.createOrReplaceTempView("rpt_sales_by_region_model")

Looking at the first seven rows of the result set gives us the answer our manager wants.

%%sql

--
--  10 - Show December 2010 data
--

select *
from rpt_sales_by_region_model limit 7

The output below shows the total quantity and total amount for each product by year, month and region.

While the details in the table are nice, upper management is just worried about revenue by region. In fact, they want to see this information visually. How can we generate custom charts using Python code in a Fabric Lakehouse notebook?

There are several reporting libraries we can use. I am going to talk about seaborn and matplotlib today.

The first visualization our manager wants is a pie chart showing sales by region for the year 2013. First, we are going to group by year and category as a string and calculate total sales. This resulting Spark dataframe is converted into a Pandas dataframe.

#
#  11 - convert to pandas
#

# spark sql stmt
stmt11 = """
select cast(RptYear as string) || '-' || RptRegion as category, sum(TotalAmt) as value 
from rpt_sales_by_region_model
where RptYear = 2013
group by cast(RptYear as string) || '-' || RptRegion
"""

# spark + pandas df
df_spark = spark.sql(stmt11)
df_pandas = df_spark.toPandas()

The code below uses seaborn to change the theme. The plotting is done with the matplotlib.

#
#  12 - pie chart for 2013 sales by region
#

# import libs
import seaborn as sns
import matplotlib.pyplot as plt

# pie chart using Seaborn
sns.set_theme(style="whitegrid")
plt.figure(figsize=(6, 6))
plt.pie(df_pandas["value"], labels=df_pandas["category"], autopct='%1.1f%%')
plt.title("Sales by Region")
plt.show()

The output of execution the code produces our pie chart.

Another manager likes the information in a bar chart with actual sales number, not the percentage of each division contributing to the bottom line of the company.

#
#  13 - bar chart for 2013 sales by region
#

# import libs
import matplotlib.pyplot as plt

# create chart
plt.bar( df_pandas["category"], df_pandas["value"],label="Sales (Millions)", color='b')
plt.plot()

# add lables + titles
plt.xlabel("year & region")
plt.ylabel("total sales")
plt.title("Sales by Region")
plt.legend()

# show values
for index, value in enumerate(df_pandas["value"]):
    plt.text(index - 0.125, value, str(round(value/1000000, 2)) + " M")

# shopw chart
plt.show()

The above code is a little more complex since we wanted actual values shows at the top of each bar chart. By default, scientific notation is used on the y-axes and categorical information is on the x-axis.

A quick look at this chart shows the North America region had the most amount of sales (6.55 million dollars) and the Pacific region has the least amount of sales (4.34 million).

This article showed three ways to gather statistics: Fabric Notebook Inspect; Seaborn charts and Matplotlib visuals. Please vote on my two fabric ideas that are currently not fixed: idea1 - how inspect handles nulls and idea2 - please show views in the lakehouse explorer. There are many more libraries that can be used to generate quick visuals. Please see MS Learn documentation for more information.

要查看或添加评论，请登录

John Miner的更多文章

Why use Tally Tables in the Fabric Warehouse?

2025年2月26日

Why use Tally Tables in the Fabric Warehouse?

Technical Problem Did you know that Edgar F. Codd is considered the father of the relational model that is used by most…
Streaming Data with Azure Databricks

2025年2月25日

Streaming Data with Azure Databricks

Technical Problem The core functionality of Apache Spark has support for structured streaming using either a batch or a…

1 条评论
Upcoming Fabric Webinars from Insight

2025年2月19日

Upcoming Fabric Webinars from Insight

Don't miss the opportunity to boost your data skills with Insight and Microsoft. This webinar series will help you…
How to develop solutions with Fabric Data Warehouse?

2025年2月18日

How to develop solutions with Fabric Data Warehouse?

Technology Details The SQL endpoint of the Fabric Data Warehouse allows programs to read from and write to tables. The…
Understanding file formats within the Fabric Lakehouse

2025年2月10日

Understanding file formats within the Fabric Lakehouse

I am looking forward to talking to the Cloud Data Driven user group on March 13th. You can find all the presentation…

2 条评论
Engineering a Lakehouse with Azure Databricks with Spark Dataframes

2025年2月3日

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Problem Time does surely fly. I remember when Databricks was released to general availability in Azure in March 2018.
Create an Azure Databricks SQL Warehouse

2025年1月21日

Create an Azure Databricks SQL Warehouse

Problem Many companies are leveraging data lakes to manage both structured and unstructured data. However, not all…

2 条评论
How to Load a Fabric Warehouse?

2025年1月9日

How to Load a Fabric Warehouse?

Technology The data warehouse in Microsoft Fabric was re-written to use One Lake storage. This means each and every…
My Year End Wrap Up for 2024

2024年12月26日

My Year End Wrap Up for 2024

Hi Folks, It has been a very busy year. At the start of this year I wanted to learn Fabric in depth.

1 条评论
Virtualizing GCP data with Fabric Shortcuts

2024年12月16日

Virtualizing GCP data with Fabric Shortcuts

New Technology Before the invention of shortcuts in Microsoft Fabric, big data engineers had to create pipelines to…

See all articles

Spark Tidbits - Lesson 12

John Miner

Data Architect at Insight

领英推荐

John Miner的更多文章

社区洞察

其他会员也浏览了

Loading Data in GraphDB: Best Practices and Tools

Real-World Applications: Harnessing Tools for Data Manipulation and Decision Support

Databricks notebooks

Making Tradeoffs: The Data Edition

Top 7 Data Analytics Tools in 2023

Z-Order: Visualization and Implementation

Practice Window Functions for Data Analysis with SQLite and Jupyter Notebook

Comparison Between SQL Joins and Python Joins

领英推荐

John Miner的更多文章

Why use Tally Tables in the Fabric Warehouse?

Streaming Data with Azure Databricks

Upcoming Fabric Webinars from Insight

How to develop solutions with Fabric Data Warehouse?

Understanding file formats within the Fabric Lakehouse

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Create an Azure Databricks SQL Warehouse

How to Load a Fabric Warehouse?

My Year End Wrap Up for 2024

Virtualizing GCP data with Fabric Shortcuts

社区洞察

其他会员也浏览了

Loading Data in GraphDB: Best Practices and Tools

Real-World Applications: Harnessing Tools for Data Manipulation and Decision Support

Databricks notebooks

Making Tradeoffs: The Data Edition

Top 7 Data Analytics Tools in 2023

Z-Order: Visualization and Implementation

Practice Window Functions for Data Analysis with SQLite and Jupyter Notebook

Comparison Between SQL Joins and Python Joins