登录查看更多内容

Ingest data from On-Premise SQL Server using Microsoft Fabric Notebook

Luká? Karlovsky

Fabric | Power BI | Power Query | SQL | PySpark | Data | Powershell | DWH | Excel | VBA | Instructor

发布日期: 2024年2月19日

Introduction

This article explores a method for importing data using PySpark and Microsoft's Fabric Notebook, showcasing an approach to ingest data from an on-premise SQL Server. The aim is to inspire with this data ingestion technique.

Prerequisites

For the solution outlined in the article the following setup was utilized:

SQL Server on Azure VM: The SQL Server is hosted on an Azure Virtual Machine, simulating an on-premise environment within a cloud infrastructure. This setup provides a balance between on-premise control and cloud flexibility.
AdventureWorksDW2022 Database: The SQL Server is equipped with the AdventureWorksDW2022 database. This sample database offers a variety of data scenarios that are ideal for demonstrating data ingestion processes.
Network Configuration: Inbound port 1433 is configured to allow connections. This is essential for remote access to the SQL Server, enabling data transfer and communication from external sources.
Azure Key Vault: Secrets for the SQL Server’s connection details—specifically, the server address (sqlserver), username, and password—are securely stored in an Azure Key Vault. This method ensures secure access management to the SQL Server credentials, facilitating a secure and efficient connection process.
Lakehouse Creation: Additionally, it's necessary to create a lakehouse to support the storage and analysis of ingested data.

This configuration was chosen for its effectiveness in demonstrating the data ingestion process using PySpark and Microsoft Fabric Notebook. While this setup was used for the solution in the article, it's worth noting that users may adapt or modify this approach based on their specific environments or preferences, exploring various configurations to meet their unique requirements.

Notebook

Libraries

from trident_token_library_wrapper import PyTridentTokenLibrary
from pyspark.sql.functions import *
from pyspark.sql import SparkSession

The use of the PyTridentTokenLibrary library is necessary to access secrets from Azure Key Vault, ensuring secure and efficient retrieval of SQL Server credentials.

Function - Get secret from Key vault

# Name of the Key Vault
nameOfKeyVault = 'kv-keyvault-01'
keyvault = f'https://{nameOfKeyVault}.vault.azure.net/'

# Function definition
## Function for getting secret from Key Vault
def get_secret_from_keyvault(secret_name):
    access_token=PyTridentTokenLibrary.get_access_token("keyvault") # Get Access Token for Key Vault
    return PyTridentTokenLibrary.get_secret_with_token(keyvault,secret_name,access_token) # Return Secret

## Function for getting Secret
def get_secret(secret_name):
    #resource = "https://analysis.windows.net/powerbi/api" # Resource for Power BI API    

    secret = get_secret_from_keyvault(secret_name)

    if secret:
        return secret
    else:
        return "Invalid secret name"

This code retrieves secrets from an Azure Key Vault using the PyTridentTokenLibrary. It sets up the Key Vault's URL, then defines two functions: get_secret_from_keyvault obtains a secret by first getting an access token for the Key Vault and then fetching the secret using this token. get_secret acts as a wrapper to this process, returning the secret if found or a message indicating an invalid name if not. This approach simplifies secure access to sensitive information stored in the Key Vault.

Variables

sqlserver = get_secret('sqlserver')
database = "AdventureWorksDW2022"
username = get_secret('username')
password = get_secret('password')

This code retrieves the SQL Server address, username, and password from Azure Key Vault using the get_secret function, and sets the database to "AdventureWorksDW2022". It's used to securely access database connection details.

SQL Query - List of tables

sql = '''
    select s.TABLE_NAME
    from AdventureWorksDW2022.INFORMATION_SCHEMA.tables s
    where 1=1
        and s.TABLE_SCHEMA = 'dbo'
        and s.TABLE_TYPE = 'BASE TABLE'
        and (
            s.TABLE_NAME like 'Dim%' or
            s.TABLE_NAME like 'Fact%'
        )
'''

This SQL query selects the names of tables from the "AdventureWorksDW2022" database that are in the "dbo" schema and are of the type "BASE TABLE". It specifically looks for tables whose names start with "Dim" or "Fact".

Data frame - List of tables

Atul Kumar 8 个月前

Why Postgres Stands Out Among Relational Databases

Adam Brown Sr. 3 周前

Is an all-in-one database the future?

Secoda 3 个月前

# Initialize Spark session with specific configurations
spark = SparkSession.builder.appName("JDBCExample") \
    .config("spark.sql.parquet.vorder.enabled", "true") \
    .config("spark.microsoft.delta.optimizeWrite.enabled", "true") \
    .getOrCreate()

# Read data from JDBC source
df = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://{sqlserver};databaseName={database}") \
    .option("query", sql) \
    .option("user", username) \
    .option("password", password) \
    .load()

This code initializes a Spark session with specific configurations and reads data from a JDBC source (SQL Server) into a DataFrame df, specifically saving the list of "Dim" and "Fact" tables from the database into df using provided SQL Server address, database name, user credentials, and a SQL query.

Save data to delta tables

table_names = df.select("TABLE_NAME").collect()

for row in table_names:
    table = row.TABLE_NAME
    df_table = spark.read.format("jdbc") \
        .option("url", f"jdbc:sqlserver://{sqlserver};databaseName={database}") \
        .option("dbtable", table) \
        .option("user", username) \
        .option("password", password) \
        .load()

    df_table.write.mode("overwrite").format("delta").save(f"Tables/{table}")

This code iterates over the list of table names collected from the DataFrame df, reading each table from the SQL Server database into a Spark DataFrame df_table using JDBC connection options. It then writes each df_table to a Delta format as a managed table, saving it into a directory named "Tables" with a subdirectory for each table name, overwriting any existing data.

Output

The provided image showcases a lakehouse with data successfully ingested from an on-premise SQL Server. Utilizing PySpark and Microsoft Fabric Notebook, tables prefixed with "Dim" and "Fact" from the AdventureWorksDW2022 database have been organized into Delta format within the lakehouse.

Conclusion

This article presents a detailed method for ingesting data from an on-premise SQL Server into a distributed data processing environment using Microsoft Fabric Notebook and PySpark. By leveraging the PyTridentTokenLibrary for secure access to Azure Key Vault and employing JDBC for data extraction, this approach demonstrates a secure and efficient way to connect to, query, and store data from specific tables within the AdventureWorksDW2022 database. The process highlights the importance of secure credential management and the flexibility of using Spark for data processing and storage in Delta format.

Luká? Karlovsky

LinkedIn Profile

Joyful Craftsmen

Joyful Craftsmen is a data design and data engineering team. We’ve been learning and building data solutions since 2014. We focus on our vision to do the data craft as smartly as reachable. Humanity, teamwork and joy is part of our game, regardless you are our partner or one of us. Altogether we are valued for delivering cutting edge data solutions, on time, in top quality.

https://joyfulcraftsmen.com

https://www.dhirubhai.net/company/joyful-craftsmen

Dennes Torres

Data Platform MVP (2020-2023) | Certified Expert in Fabric Analytics Engineering, Azure Data Engineering, Solutions Architecture, DevOps, and Development

9 个月

on premises environments don't and shouldn't allow a web connection to a sql server port. We need the power bi data gateway, which relies on service bud relay for this

2 次回应

Adam Poslední

Fabric | Power Apps | Power Automate | SharePoint

9 个月

Inspiring article! I am just worried that not many security admins would allow to open that port to public (mayhaps to specific IP range - e.g. Azure Resources?). I would love the option to be able to use Azure On-Premise Data Gateway programatically and connect to the gateway from the notebook directly. Sadly, it is allowed only for Logic Apps and Power Platfrom.

查看更多评论

要查看或添加评论，请登录

Luká? Karlovsky的更多文章

Microsoft Data Platform News 2024 - Week 46

2024年11月18日

Microsoft Data Platform News 2024 - Week 46

Content Fabric Microsoft Fabric Direct Lake Semantic Mode Row-Level Security (RLS) Store and access your Iceberg data…
Microsoft Data Platform News 2024 - Week 45

2024年11月11日

Microsoft Data Platform News 2024 - Week 45

Content Fabric Unlock Faster Writes in Delta Lake with Deletion Vectors SQL to Microsoft Fabric Migration:…
Microsoft Data Platform News 2024 - Week 44

2024年11月4日

Microsoft Data Platform News 2024 - Week 44

Content Fabric Announcing public preview of VARCHAR(MAX) and VARBINARY(MAX) types in Fabric Data Warehouse Announcing…
Microsoft Data Platform News 2024 - Week 43

2024年10月29日

Microsoft Data Platform News 2024 - Week 43

Content Fabric Fabric Runtime 1.3 is Generally Available! Upgrade your data engineering and science workloads to…
Microsoft Data Platform News 2024 - Week 42

2024年10月21日

Microsoft Data Platform News 2024 - Week 42

Content Power BI Auto Documentation Report Using INFO.VIEW DAX Functions Fabric How to overcome the Identity column…
Microsoft Data Platform News 2024 - Week 41

2024年10月14日

Microsoft Data Platform News 2024 - Week 41

Content Power BI Exploring Power Query Buffering: How Table.Buffer and List.
Microsoft Data Platform News 2024 - Week 40

2024年10月7日

Microsoft Data Platform News 2024 - Week 40

Content Fabric Performance Analysis of V-Ordering in Fabric Warehouse: On or Off? Thoughts on workspace design in…
Microsoft Data Platform News 2024 - Week 39

2024年10月7日

Microsoft Data Platform News 2024 - Week 39

It was added to the wrong newsletter last week. https://www.
Microsoft Data Platform News 2024 - Week 39

2024年9月30日

Microsoft Data Platform News 2024 - Week 39

Content Power BI Connect your data to the real world with the power of NFC and the Power BI Mobile Apps By Popular…

1 条评论
Microsoft Data Platform News 2024 - Week 38

2024年9月23日

Microsoft Data Platform News 2024 - Week 38

Content Power BI Field parameters in Analysis Services models Power Bi to REST API utilizing an Authorization Token…

See all articles

Ingest data from On-Premise SQL Server using Microsoft Fabric Notebook

Luká? Karlovsky

Fabric | Power BI | Power Query | SQL | PySpark | Data | Powershell | DWH | Excel | VBA | Instructor

Introduction

Prerequisites

Notebook

Libraries

Function - Get secret from Key vault

Variables

SQL Query - List of tables

Data frame - List of tables

领英推荐

Save data to delta tables

Output

Conclusion

Luká? Karlovsky

Joyful Craftsmen

Luká? Karlovsky的更多文章

社区洞察

其他会员也浏览了

SQL Pool in Azure

Should Snowflake expand into Transactional Databases and Messaging ? I think it has no choice.

Understanding Azure SQL Database: A Comprehensive Overview

Azure Synapse Analytics - Save costs up to 94%

Monolithic vs. Distributed SQL databases

SQL Server Big Data Clusters on Azure

SQL Server 2022, always innovating

The Benefits Of Migrating Your SQL Database To AWS DynamoDB

Modernize your Data Estate with SQL Server 2022

Introduction

Prerequisites

Notebook

Libraries

Function - Get secret from Key vault

Variables

SQL Query - List of tables

Data frame - List of tables

领英推荐

Save data to delta tables

Output

Conclusion

Luká? Karlovsky

Joyful Craftsmen

Luká? Karlovsky的更多文章

Microsoft Data Platform News 2024 - Week 46

Microsoft Data Platform News 2024 - Week 45

Microsoft Data Platform News 2024 - Week 44

Microsoft Data Platform News 2024 - Week 43

Microsoft Data Platform News 2024 - Week 42

Microsoft Data Platform News 2024 - Week 41

Microsoft Data Platform News 2024 - Week 40

Microsoft Data Platform News 2024 - Week 39

Microsoft Data Platform News 2024 - Week 39

Microsoft Data Platform News 2024 - Week 38

社区洞察

其他会员也浏览了

SQL Pool in Azure

Should Snowflake expand into Transactional Databases and Messaging ? I think it has no choice.

Understanding Azure SQL Database: A Comprehensive Overview

Azure Synapse Analytics - Save costs up to 94%

Monolithic vs. Distributed SQL databases

SQL Server Big Data Clusters on Azure

SQL Server 2022, always innovating

The Benefits Of Migrating Your SQL Database To AWS DynamoDB

Modernize your Data Estate with SQL Server 2022