Spark Tidbits - Lesson 7

Spark Tidbits - Lesson 7

Having company data stored in a lake house allows the business users to gain insight on patterns. For instance, the manager at Adventure Works wants to know which products exceeded a quarter of a million dollars in sales in the past.


In the last lesson, we looked at the Sales LT schema that was using a full load pattern. The Bronze tables contained multiple versions of the truth (audit trail) while Silver tables contained the most recent data. Please see my article on SQL Server Central for details.



The above image shows the silver tables in Adventure Works lakehouse. The code below joins several of the tables into a view. Right now, views are not supported in the lakehouse explorer.

%%sql

--
--  1 - create view named rpt_prepared_data
--

create or replace view rpt_prepared_data as

-- choose fields
select
  d.CalendarYear as RptYear,
  d.MonthNumberOfYear AS RptMonth,
  s.SalesTerritoryGroup AS RptRegion,
  Coalesce(p.ModelName, p.EnglishProductName) AS ModelNo,
  f.OrderQuantity AS Quantity,
  cast(f.ExtendedAmount as decimal(10, 4)) AS Amount   

-- sales
from
  silver_fact_internet_sales_full as f

-- dates
inner join 
  silver_dim_date_full as d
on 
  f.OrderDateKey = d.DateKey

-- products
inner join 
  silver_dim_product_full as p
ON 
  f.ProductKey = p.ProductKey

-- customers    
inner join 
  silver_dim_customer_full as c
ON 
  f.CustomerKey = c.CustomerKey

-- geographies
inner join 
  silver_dim_geography_full as g
ON 
  c.GeographyKey = g.GeographyKey

-- territories
inner join 
  silver_dim_sales_territory_full as s
ON 
  g.SalesTerritoryKey = s.SalesTerritoryKey;        

If we execute the show tables command, we get a list of tables and views.

%%sql

--
--  2 - show tables / views
--

show tables        

The screen shot below shows the view named rpt_prepared_data does exist in the catalog but can not be seen in the explorer.



Today, we are going to see how we can find sales by year, month and model number that exceed 250 thousand dollars in value. The Spark SQL code uses the group by, having and order by clauses to solve this problem.

%%sql

--
--  3 - Spark SQL - Months with sales > 250K
--

SELECT 
  RptYear,
  RptMonth,
  RptRegion,
  ModelNo,
  SUM(Quantity) as TotalQty,
  SUM(Amount) as TotalAmt
FROM 
  rpt_prepared_data 
GROUP BY
  RptYear,
  RptMonth,
  RptRegion,
  ModelNo
HAVING 
  TotalAmt > 250000
ORDER BY
  RptYear,
  RptMonth,
  RptRegion,
  ModelNo        


The image below show there were 4 times in the year 2011 in which the Road-150 bike achieved this sales quota. Additionally, the Mountain-200 bike achieved similar results in the year 2013.



The same results can be achieved with Spark Dataframes. First, we can use spark.sql method to place all the records from the previously defined view named rpt_prepared_data into dataframes named df_raw.

There are several methods that you should get familiar with: groupBy - groups by specified columns; agg - allows functions to be applied columns; alias - required to rename columns for end user; filter - this is basically the having clause; orderBy - order the resulting data set; asc - apply ascending order; and desc - apply descending order.

#
#  4 - Spark Dataframes - Months with sales > 250K
#

# grab library
from pyspark.sql.functions import *

# grab view data
df_raw = spark.sql('select * from rpt_prepared_data')

# summarize & filter
df_qry = df_raw.groupBy("RptYear", "RptMonth", "RptRegion", "ModelNo"). \
    agg( sum("Quantity").alias("TotalQty"), sum("Amount").alias("TotalAmt") ). \
    filter(col('TotalAmt') > 250000). \
    orderBy(desc("RptYear"), desc("RptMonth"), asc("RptRegion"), asc("ModelNo"))

# show results
display(df_qry)        

The same results are produced using the dataframes methods.


The same tables and views can be created in Azure Synapse Analytics using Spark. The image below shows the same 8 records we have been working with.


The major difference is the fact that views are supported in the lake database for Azure Synapse.

In a nutshell, aggregation of the Adventure Works tables can be done with both Spark SQL and Spark Dataframes. The same results can be achieved regardless of coding syntax.

I find views very useful. I brought up the issue to the Fabric product team in March 2024. Views do not show in the lakehouse explorer and these views can not be used in the SQL endpoint. Please request for this feature to be added since it already exists in both Azure Databricks and Azure Synapse.



要查看或添加评论,请登录

John Miner的更多文章

  • Why use Tally Tables in the Fabric Warehouse?

    Why use Tally Tables in the Fabric Warehouse?

    Technical Problem Did you know that Edgar F. Codd is considered the father of the relational model that is used by most…

  • Streaming Data with Azure Databricks

    Streaming Data with Azure Databricks

    Technical Problem The core functionality of Apache Spark has support for structured streaming using either a batch or a…

    1 条评论
  • Upcoming Fabric Webinars from Insight

    Upcoming Fabric Webinars from Insight

    Don't miss the opportunity to boost your data skills with Insight and Microsoft. This webinar series will help you…

  • How to develop solutions with Fabric Data Warehouse?

    How to develop solutions with Fabric Data Warehouse?

    Technology Details The SQL endpoint of the Fabric Data Warehouse allows programs to read from and write to tables. The…

  • Understanding file formats within the Fabric Lakehouse

    Understanding file formats within the Fabric Lakehouse

    I am looking forward to talking to the Cloud Data Driven user group on March 13th. You can find all the presentation…

    3 条评论
  • Engineering a Lakehouse with Azure Databricks with Spark Dataframes

    Engineering a Lakehouse with Azure Databricks with Spark Dataframes

    Problem Time does surely fly. I remember when Databricks was released to general availability in Azure in March 2018.

  • Create an Azure Databricks SQL Warehouse

    Create an Azure Databricks SQL Warehouse

    Problem Many companies are leveraging data lakes to manage both structured and unstructured data. However, not all…

    2 条评论
  • How to Load a Fabric Warehouse?

    How to Load a Fabric Warehouse?

    Technology The data warehouse in Microsoft Fabric was re-written to use One Lake storage. This means each and every…

  • My Year End Wrap Up for 2024

    My Year End Wrap Up for 2024

    Hi Folks, It has been a very busy year. At the start of this year I wanted to learn Fabric in depth.

    1 条评论
  • Virtualizing GCP data with Fabric Shortcuts

    Virtualizing GCP data with Fabric Shortcuts

    New Technology Before the invention of shortcuts in Microsoft Fabric, big data engineers had to create pipelines to…

社区洞察

其他会员也浏览了