Running an Open Footprint Lakehouse on Microsoft Fabric

Running an Open Footprint Lakehouse on Microsoft Fabric

Written in partnership with Peter Kowalchuk

Why Open Footprint on Microsoft Fabric?

Microsoft Fabric is an end-to-end, unified analytics platform that integrates various data and analytics tools, enabling organizations to create, share, and visualize data with ease.

OneLake, is likened to “The OneDrive for Data,” providing a single, unified SaaS data lake for the entire organization without the need for building it from scratch. OneLake organizes data into domains, indexes it for discovery, and ensures compliance with governance standards, all while providing full access through industry-standard APIs.?

Tables within OneLake, are based on Delta Lake technology, allowing for the creation of advanced analytics solutions. These tables are part of the lakehouse architecture, which combines the benefits of data lakes and data warehouses.

Enabling Open Footprint data on Delta Tables on Fabric One Lake has numerous advantages, some of which are:

?1-?????? Flexible Delta Lake architecture that enables the power of Notebooks and Spark Jobs to process emissions data in near time and at mass scale

2-?????? Out of the box connectivity with Microsoft tools and business products such as Azure AI, Copilot Studio, Power BI, Power Apps, Dynamics 365, Office 365, Teams. No need to build any connectors as Fabric connectivity is supported natively.

3-?????? Connectivity across 3rd party data platforms such as Snowflake, Databricks and across all major clouds. ?

4-?????? Real Time data streaming capabilities in Fabric enables reading activity from all major sources out of the box.

5-?????? Integration with Microsoft Sustainability Manager is supported out of the box thru the ESG Data Estate solution on Microsoft Fabric.

?

Open Footprint and OneLake, a match made in heaven

The Open Footprint is a forum under The Open Group that focuses on developing open and vendor-neutral industry standards to provide consistent and accurate measurement and reporting of environmental footprint data. The forum aims to develop a single set of data and metadata definitions, known as The Open Footprint Data Model, to enable emissions data to be more easily shared and aggregated and is a platform independent model where it can be deployed on OSDU, Relational Databases and Data Lakes.

We have converted the DDL script provided by the Open Footprint community for relational databases to SparkSQL to create the tables. In our assessment, we have observed couple of things that needs to be updated in the DDL script due the differences in the DDL and supported objects in OneLake Delta tables. We can’t provide the full script here due to licensing limitations, however if you are an Open Footprint member feel free to reach our Microsoft representative. The changes made were as follows:

1-?????? Removal of schema creation and usage as they are not supported.

2-?????? Removal of primary and foreign keys as they are not supported.

3-?????? Removal of comments and metadata that are defined at the table and column levels.

4-?????? Changing all names to lowercase in case they aren’t.

5-?????? Remove the schema at the beginning of the table name in create statement.

6-?????? Remove the schema version information at the end of the table name in the create statement.

7-?????? Replace the data types such as varchar(10) with the equivalent Spark datatypes.

CREATE TABLE productfootprint (

    productfootprintid STRING,

    comment STRING,

    companyname STRING,

    companyids INTEGER,

    created STRING,

    environmentalproductdeclarationid STRING,

    pcf INTEGER,

    precedingproductfootprintid STRING,

    productcategorycpc INTEGER,

    productdescription STRING,

    productids INTEGER,

    productnamecompany STRING,

    specversion STRING,

    status STRING,

    statuscomment STRING,

    updated STRING,

    validityperiodend STRING,

    validityperiodstart STRING,

    validfrom STRING,

    validto STRING,

    version STRING

);        

Running the script in the lakehouse creates the following tables under Microsoft Fabric. (only partially shown for licensing constraints)

Creating the Open Footprint Tables in Fabric with PySQL

Once the tables are created, we took a further step to upload the csv files for the reference data, first they are uploaded to Fabric Lakehouse as files. Our experience is that some of the csv files have missing columns or have differences in naming.

Uploading Reference Data csv Files

To test the loading, we did change the column and tables names to all lowercase and ran the following PySpark code to load the data, which went pretty well.

PySpark Code to Load Reference Data

?Fabric automatically provides a SQL Analytics Endpoint once the delta lake tables are created which could be accessed as a SQL Server relational database.

SQL Analytics Endpoint for Delta Tables

Next Steps

?1-?????? Reference data csv’s have inconsistencies in the table schemas in terms of the column naming (_ vs -), missing columns, etc…, these should be cleared.

2-?????? Obviously sample data needs to be loaded and tested, this could be easily done using parquet or csv files which are natively supported with Spark Jobs.

3-?????? Good AI use cases, please add to comments if you have any ideas that you want us to test.

Dzmitry Krotau

?? BA Technical Lead | AI & Digital Health Expert | Driving Innovation in Healthcare, Biotech & Life Sciences | #AI #HealthcareTechnology #DigitalHealth #AIinHealthcare #Innovation

9 个月

Great news, Kadri Umay! It looks like a very smart integration of Open Footprint Data model with Microsoft Fabric. This appears to be a killer feature for OFP as the platform could be engaged with MS ecosystem easily!

回复
Sun Maria Lehmann

Team and people first!, Technology Geek, Data Enthusiast, Chair of Boards

9 个月

What an achivement, great news and progress! Excited to follow the forward development too…

回复
Arminder Singh

Director at KPMG Climate Data & Technology| Delivery Lead| Digital Transformation | Energy Efficiency and GHG Emissions Management | The Wharton School

9 个月

This is super interesting! Thanks Kadri Umay for sharing

回复
Bertrand Rioux

Technology Consultant and Strategist

10 个月

Hi Kadri Umay I produced the OFP DDLs from the OFP modeling tool I am maintaining. I’ve reached out to Peter to schedule a call to discuss how we can streamline the integration with Delta Lake for Microsoft Fabric.

要查看或添加评论,请登录

Kadri Umay的更多文章

社区洞察

其他会员也浏览了