登录查看更多内容

Microsoft Fabric: Data Transformation for Product Attributes Management

RAJEEV KUMAR

Senior Data Architect at TCS || Ex Infosys || Ex IBMer

发布日期: 2024年12月26日

Context:

Organizations often manage product data in a structured but complex format where attributes are stored as key-value pairs across multiple rows in a database. This format, while flexible, makes it difficult to perform analytics, reporting, and downstream processing. For meaningful use, such as categorization, supplier analysis, or compliance checks, these attributes must be transformed into a tabular format with clearly defined columns.

Objective:

The goal is to transform raw product attribute data stored in a database table (raw.productmaster) into a well-structured and enriched format suitable for reporting, analysis, and integration with downstream systems. The process should:

Filter and extract relevant product attributes.
Pivot the data to convert rows of attribute-value pairs into columns.
Handle missing values to ensure data integrity.
Provide meaningful and user-friendly column names for better understanding.
Add new contextual information (e.g., supplier region) to enrich the dataset.

Input:

A raw data table (raw.productmaste) containing the following key columns:
PRODUCTMASTER_ID: A unique identifier for the product classification.
NAM: Attribute names describing product characteristics.
WRT: Attribute values associated with the attribute names.
COD: Additional attribute-related codes.

Output:

A transformed and enriched DataFrame (pmc_df) with:
Key product attributes as columns.
Null values filled with appropriate default values.
Columns renamed with meaningful aliases.
A new column (Supplier_Region) added with a static value.

Challenges:

Data Volume and Performance:

The raw table may contain millions of rows, necessitating efficient filtering, grouping, and transformation.

Handling Inconsistent Data:

Null or missing values in attribute columns can lead to incomplete records.

Some attributes may not be present for all products.

Dynamic Data Structure:

The list of relevant attributes (ATNAM values) may change over time, requiring flexibility in the code.

Data Enrichment:

Additional contextual information, such as supplier region, needs to be integrated dynamically.

Requirements:

Transformations:

Filter the data to include only relevant attributes (ATNAM values).

Pivot rows into columns using the NAM field as headers and aggregate the first value of WRT or COD as the respective cell value.

Null Handling:

Replace missing values with predefined defaults to ensure data completeness.

Column Renaming:

Map raw attribute names to user-friendly column names, e.g., Z_BUDGETGROUP to bgc.

领英推荐

Building an Effective Data & Analytics Operating Model

EQengineered 2 年前

Analytics Essentials: Data Dictionary

173tech 2 年前

How To Use Data Analytics To Improve Your Business

Periculum 1 年前

Data Enrichment:

Add a static column (Supplier_Region) with the value "XYZ" for additional context.

Output Validation:

Ensure the resulting DataFrame is well-structured, free of null values (where defaults are provided), and contains the required attributes.

Use Case Examples:

Supplier Analysis:

Enable analysis of products based on suppliers and regions.

Compliance Reporting:

Extract hazard-related flags (Hazardous_Flag) for regulatory submissions.

Inventory Management:

Use Large_Order_Quantity and Item_Description for stock optimization.

Proposed Solution: The code uses Apache Spark to process the raw data efficiently, employing:

Lambda Functions:

Modular functions for filtering, pivoting, filling nulls, renaming columns, and adding new columns.

DataFrame Transformations:

Filters, group-by operations, and joins to reshape the data.

Enrichment:

Adding context with new fields (Supplier_Region) and renaming columns for usability.

This solution addresses the need to convert raw attribute data into a clean, tabular, and enriched format, ready for diverse business use cases.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, first, lit

# Load the raw data into DataFrames
df = spark.table("raw.productmaster_classification_e1auspm")

# Define lambda functions
filter_atnam = lambda df, conditions: df.filter(col("ATNAM").isin(conditions))
pivot_atwrt = lambda df: df.filter(col("ATNAM") != "Z_CRP_LARGEORDERQTY").groupBy("PRODUCTMASTER_CLASSIFICATION_ID").pivot("ATNAM").agg(first("ATWRT"))
pivot_atcod = lambda df: df.filter(col("ATNAM") == "Z_CRP_LARGEORDERQTY").groupBy("PRODUCTMASTER_CLASSIFICATION_ID").pivot("ATNAM").agg(first("ATCOD"))
fill_nulls = lambda df, fill_values: df.fillna(fill_values)
add_aliases = lambda df: df.select(
    col("Z_CRP_BUDGETGROUP").alias("bgc"),
    col("Z_CRP_DAFLRESTRICTED").alias("dafl_restricted"),
    col("Z_CRP_HAZARDOUS").alias("Hazardous_Flag"),
    col("Z_CRP_ITEMDESCRIPTION").alias("Item_Description"),
    col("Z_CRP_LARGEORDERQTY").alias("Large_Order_Quantity"),
    col("Z_CRP_MANUFACTURERNAME").alias("Manufacturer"),
    col("Z_CRP_PML").alias("pml"),
    col("Z_CRP_SERVICELINE").alias("Service_Line"),
    col("Z_CRP_SIO").alias("SIO"),
    col("Z_CRP_VENDORNAME").alias("Supplier"),
    col("Z_CRP_SUPPLIERPARTID").alias("Supplier_Part_ID")
)
add_new_column = lambda df, column_name, value: df.withColumn(column_name, lit(value))

# Use the lambda functions
filtered_df = filter_atnam(df, ["Z_CRP_BUDGETGROUP", "Z_CRP_DAFLRESTRICTED", "Z_CRP_HAZARDOUS", "Z_CRP_ITEMDESCRIPTION", "Z_CRP_LARGEORDERQTY", "Z_CRP_MANUFACTURERNAME", "Z_CRP_PML", "Z_CRP_SERVICELINE", "Z_CRP_SIO", "Z_CRP_SUPPLIERPARTID", "Z_CRP_VENDORNAME"])
pivot_df_atwrt = pivot_atwrt(filtered_df)
pivot_df_atcod = pivot_atcod(filtered_df)
pivot_df = pivot_df_atwrt.join(pivot_df_atcod, on="PRODUCTMASTER_CLASSIFICATION_ID", how="left")
pivot_df = fill_nulls(pivot_df, {"Z_CRP_BUDGETGROUP": "NULL", "Z_CRP_DAFLRESTRICTED": "NULL", "Z_CRP_HAZARDOUS": "NULL", "Z_CRP_ITEMDESCRIPTION": "NULL", "Z_CRP_LARGEORDERQTY": 0, "Z_CRP_MANUFACTURERNAME": "NULL", "Z_CRP_PML": "NULL", "Z_CRP_SERVICELINE": "NULL", "Z_CRP_SIO": "NULL", "Z_CRP_SUPPLIERPARTID": "NULL", "Z_CRP_VENDORNAME": "NULL"})
pmc_df = add_aliases(pivot_df)
pmc_df = add_new_column(pmc_df, "Supplier_Region", "XYZ")

# Show the result
display(pmc_df)

This code uses lambda functions to:

Filter the DataFrame for specific NAM values.
Pivot the DataFrame for WRT values.
Pivot the DataFrame for the COD value for Z_LARGEORDERQTY.
Join the pivoted DataFrames.
Handle null values with default values.
Add alias names to the columns.
Add a new column named Supplier_Region with a default value.
Display the resulting DataFrame.

Conclusion:

This code effectively transforms and enriches the product master classification data by filtering, pivoting, handling null values, and adding new columns. The use of lambda functions makes the code modular and reusable, allowing for easy adjustments and extensions in the future. This structured DataFrame can now be used for further analysis or reporting purposes.

If you have any further questions or need additional modifications, feel free to ask!

Bhola Thakur

Development Head | Meridian Solution

2 个月

Insightful

要查看或添加评论，请登录

RAJEEV KUMAR的更多文章

Exploring the Memory Demands of Spark Executors for 128 MB Data Blocks

2025年2月11日

Exploring the Memory Demands of Spark Executors for 128 MB Data Blocks

In this article we are going to focus on below topics: 1. Overhead Memory 2.
Why Listening to Tech Professionals Matters: The Reality of “Anything Is Possible” in Technology

2025年1月31日

Why Listening to Tech Professionals Matters: The Reality of “Anything Is Possible” in Technology

In the fast-paced world of technology, there’s an ever-growing misconception among some clients: “Anything is possible…
Microsoft Fabric: Transform and integrate multiple raw data sources into a consolidated, structured format for analytics and reporting.

2024年12月27日

Microsoft Fabric: Transform and integrate multiple raw data sources into a consolidated, structured format for analytics and reporting.

The objective of this project is to transform and integrate multiple raw data sources into a consolidated, structured…

1 条评论
Mounting Lakehouse in Fabric Notebook

2024年12月22日

Mounting Lakehouse in Fabric Notebook

To mount another Lakehouse in the notebook, use the below code. It should be noted that Lakehouses mounted at the…

1 条评论
Getting A List of Folders and Delta Tables in the Fabric Lakehouse

2024年12月22日

Getting A List of Folders and Delta Tables in the Fabric Lakehouse

When you are working with the files and folders in Fabric, you will often need the list of file folders and tables for…
What happens when a Spark job is submitted

2024年12月18日

What happens when a Spark job is submitted
How to Decide if Databricks Is the Right Tool for You

2024年11月19日

How to Decide if Databricks Is the Right Tool for You

Databricks is an all-in-one data platform, offering capabilities for a wide range of data use cases, from data…

1 条评论
Architectural Patterns in Data Engineering Projects

2024年7月30日

Architectural Patterns in Data Engineering Projects

Data engineering is a critical component of any data-driven organization, enabling the collection, transformation, and…
Best Practice to optimize Azure Databricks environment.

2024年5月11日

Best Practice to optimize Azure Databricks environment.

Azure Databricks (ADB) has the power to process terabytes of data, while simultaneously running heavy data engineering…
How to Use Oozie Workflow with HDInsight

2023年12月9日

How to Use Oozie Workflow with HDInsight

Oozie is a workflow and coordination system that manages Hadoop jobs. Oozie is integrated with the Hadoop stack, and it…

2 条评论

See all articles

Microsoft Fabric: Data Transformation for Product Attributes Management

RAJEEV KUMAR

Senior Data Architect at TCS || Ex Infosys || Ex IBMer

领英推荐

Conclusion:

RAJEEV KUMAR的更多文章

社区洞察

其他会员也浏览了

From Data to Success: Why Analysis Matters for Your Company

Five Key Steps to Keep Data-Driven Decision-Making Simple

Synerise integration with Google Sheets

Data Discovery is Becoming a Necessity For Businesses

What is DAX in Data Analytics?

Is Dirty Data Holding You Back?

Dig Deeper into Data to Create Reliable Digital Services with Data Storytelling

Self-Service Business Intelligence: Empowering Users with Data Discovery Tools

???????? ?????? ???????????? ???? ?????????? ???? ?????????? ???????? ????????????????????

?????? ???????? ???????????????????????? ?????????? ???? ?????? ???????????????? ???????????????????????

领英推荐

Conclusion:

RAJEEV KUMAR的更多文章

Exploring the Memory Demands of Spark Executors for 128 MB Data Blocks

Why Listening to Tech Professionals Matters: The Reality of “Anything Is Possible” in Technology

Microsoft Fabric: Transform and integrate multiple raw data sources into a consolidated, structured format for analytics and reporting.

Mounting Lakehouse in Fabric Notebook

Getting A List of Folders and Delta Tables in the Fabric Lakehouse

What happens when a Spark job is submitted

How to Decide if Databricks Is the Right Tool for You

Architectural Patterns in Data Engineering Projects

Best Practice to optimize Azure Databricks environment.

How to Use Oozie Workflow with HDInsight

社区洞察

其他会员也浏览了

From Data to Success: Why Analysis Matters for Your Company

Five Key Steps to Keep Data-Driven Decision-Making Simple

Synerise integration with Google Sheets

Data Discovery is Becoming a Necessity For Businesses

What is DAX in Data Analytics?

Is Dirty Data Holding You Back?

Dig Deeper into Data to Create Reliable Digital Services with Data Storytelling

Self-Service Business Intelligence: Empowering Users with Data Discovery Tools

???????? ?????? ???????????? ???? ?????????? ???? ?????????? ???????? ????????????????????

?????? ???????? ???????????????????????? ?????????? ???? ?????? ???????????????? ???????????????????????