Ingest data from On-Premise SQL Server using Microsoft Fabric Notebook

Ingest data from On-Premise SQL Server using Microsoft Fabric Notebook

Introduction

This article explores a method for importing data using PySpark and Microsoft's Fabric Notebook, showcasing an approach to ingest data from an on-premise SQL Server. The aim is to inspire with this data ingestion technique.

Prerequisites

For the solution outlined in the article the following setup was utilized:

  1. SQL Server on Azure VM: The SQL Server is hosted on an Azure Virtual Machine, simulating an on-premise environment within a cloud infrastructure. This setup provides a balance between on-premise control and cloud flexibility.
  2. AdventureWorksDW2022 Database: The SQL Server is equipped with the AdventureWorksDW2022 database. This sample database offers a variety of data scenarios that are ideal for demonstrating data ingestion processes.
  3. Network Configuration: Inbound port 1433 is configured to allow connections. This is essential for remote access to the SQL Server, enabling data transfer and communication from external sources.
  4. Azure Key Vault: Secrets for the SQL Server’s connection details—specifically, the server address (sqlserver), username, and password—are securely stored in an Azure Key Vault. This method ensures secure access management to the SQL Server credentials, facilitating a secure and efficient connection process.
  5. Lakehouse Creation: Additionally, it's necessary to create a lakehouse to support the storage and analysis of ingested data.

This configuration was chosen for its effectiveness in demonstrating the data ingestion process using PySpark and Microsoft Fabric Notebook. While this setup was used for the solution in the article, it's worth noting that users may adapt or modify this approach based on their specific environments or preferences, exploring various configurations to meet their unique requirements.

Notebook

Libraries

from trident_token_library_wrapper import PyTridentTokenLibrary
from pyspark.sql.functions import *
from pyspark.sql import SparkSession        

The use of the PyTridentTokenLibrary library is necessary to access secrets from Azure Key Vault, ensuring secure and efficient retrieval of SQL Server credentials.

Function - Get secret from Key vault

# Name of the Key Vault
nameOfKeyVault = 'kv-keyvault-01'
keyvault = f'https://{nameOfKeyVault}.vault.azure.net/'

# Function definition
## Function for getting secret from Key Vault
def get_secret_from_keyvault(secret_name):
    access_token=PyTridentTokenLibrary.get_access_token("keyvault") # Get Access Token for Key Vault
    return PyTridentTokenLibrary.get_secret_with_token(keyvault,secret_name,access_token) # Return Secret

## Function for getting Secret
def get_secret(secret_name):
    #resource = "https://analysis.windows.net/powerbi/api" # Resource for Power BI API    

    secret = get_secret_from_keyvault(secret_name)

    if secret:
        return secret
    else:
        return "Invalid secret name"        

This code retrieves secrets from an Azure Key Vault using the PyTridentTokenLibrary. It sets up the Key Vault's URL, then defines two functions: get_secret_from_keyvault obtains a secret by first getting an access token for the Key Vault and then fetching the secret using this token. get_secret acts as a wrapper to this process, returning the secret if found or a message indicating an invalid name if not. This approach simplifies secure access to sensitive information stored in the Key Vault.

Variables

sqlserver = get_secret('sqlserver')
database = "AdventureWorksDW2022"
username = get_secret('username')
password = get_secret('password')        

This code retrieves the SQL Server address, username, and password from Azure Key Vault using the get_secret function, and sets the database to "AdventureWorksDW2022". It's used to securely access database connection details.

SQL Query - List of tables

sql = '''
    select s.TABLE_NAME
    from AdventureWorksDW2022.INFORMATION_SCHEMA.tables s
    where 1=1
        and s.TABLE_SCHEMA = 'dbo'
        and s.TABLE_TYPE = 'BASE TABLE'
        and (
            s.TABLE_NAME like 'Dim%' or
            s.TABLE_NAME like 'Fact%'
        )
'''        

This SQL query selects the names of tables from the "AdventureWorksDW2022" database that are in the "dbo" schema and are of the type "BASE TABLE". It specifically looks for tables whose names start with "Dim" or "Fact".

Data frame - List of tables

# Initialize Spark session with specific configurations
spark = SparkSession.builder.appName("JDBCExample") \
    .config("spark.sql.parquet.vorder.enabled", "true") \
    .config("spark.microsoft.delta.optimizeWrite.enabled", "true") \
    .getOrCreate()

# Read data from JDBC source
df = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://{sqlserver};databaseName={database}") \
    .option("query", sql) \
    .option("user", username) \
    .option("password", password) \
    .load()        

This code initializes a Spark session with specific configurations and reads data from a JDBC source (SQL Server) into a DataFrame df, specifically saving the list of "Dim" and "Fact" tables from the database into df using provided SQL Server address, database name, user credentials, and a SQL query.

Save data to delta tables

table_names = df.select("TABLE_NAME").collect()

for row in table_names:
    table = row.TABLE_NAME
    df_table = spark.read.format("jdbc") \
        .option("url", f"jdbc:sqlserver://{sqlserver};databaseName={database}") \
        .option("dbtable", table) \
        .option("user", username) \
        .option("password", password) \
        .load()

    df_table.write.mode("overwrite").format("delta").save(f"Tables/{table}")        

This code iterates over the list of table names collected from the DataFrame df, reading each table from the SQL Server database into a Spark DataFrame df_table using JDBC connection options. It then writes each df_table to a Delta format as a managed table, saving it into a directory named "Tables" with a subdirectory for each table name, overwriting any existing data.

Output

The provided image showcases a lakehouse with data successfully ingested from an on-premise SQL Server. Utilizing PySpark and Microsoft Fabric Notebook, tables prefixed with "Dim" and "Fact" from the AdventureWorksDW2022 database have been organized into Delta format within the lakehouse.

Conclusion

This article presents a detailed method for ingesting data from an on-premise SQL Server into a distributed data processing environment using Microsoft Fabric Notebook and PySpark. By leveraging the PyTridentTokenLibrary for secure access to Azure Key Vault and employing JDBC for data extraction, this approach demonstrates a secure and efficient way to connect to, query, and store data from specific tables within the AdventureWorksDW2022 database. The process highlights the importance of secure credential management and the flexibility of using Spark for data processing and storage in Delta format.


Luká? Karlovsky

Power BI | FABRIC | Power Query | SQL | Data | DWH | Excel | VBA | Instructor

LinkedIn Profile


Joyful Craftsmen

Joyful Craftsmen is a data design and data engineering team. We’ve been learning and building data solutions since 2014. We focus on our vision to do the data craft as smartly as reachable. Humanity, teamwork and joy is part of our game, regardless you are our partner or one of us. Altogether we are valued for delivering cutting edge data solutions, on time, in top quality.

https://joyfulcraftsmen.com

https://www.dhirubhai.net/company/joyful-craftsmen

Dennes Torres

Data Platform MVP (2020-2023) | Certified Expert in Fabric Analytics Engineering, Azure Data Engineering, Solutions Architecture, DevOps, and Development

9 个月

on premises environments don't and shouldn't allow a web connection to a sql server port. We need the power bi data gateway, which relies on service bud relay for this

Adam Poslední

Fabric | Power Apps | Power Automate | SharePoint

9 个月

Inspiring article! I am just worried that not many security admins would allow to open that port to public (mayhaps to specific IP range - e.g. Azure Resources?). I would love the option to be able to use Azure On-Premise Data Gateway programatically and connect to the gateway from the notebook directly. Sadly, it is allowed only for Logic Apps and Power Platfrom.

回复

要查看或添加评论,请登录

Luká? Karlovsky的更多文章

社区洞察

其他会员也浏览了