Ingest data from On-Premise SQL Server using Microsoft Fabric Notebook
Luká? Karlovsky
Fabric | Power BI | Power Query | SQL | PySpark | Data | Powershell | DWH | Excel | VBA | Instructor
Introduction
This article explores a method for importing data using PySpark and Microsoft's Fabric Notebook, showcasing an approach to ingest data from an on-premise SQL Server. The aim is to inspire with this data ingestion technique.
Prerequisites
For the solution outlined in the article the following setup was utilized:
This configuration was chosen for its effectiveness in demonstrating the data ingestion process using PySpark and Microsoft Fabric Notebook. While this setup was used for the solution in the article, it's worth noting that users may adapt or modify this approach based on their specific environments or preferences, exploring various configurations to meet their unique requirements.
Notebook
Libraries
from trident_token_library_wrapper import PyTridentTokenLibrary
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
The use of the PyTridentTokenLibrary library is necessary to access secrets from Azure Key Vault, ensuring secure and efficient retrieval of SQL Server credentials.
Function - Get secret from Key vault
# Name of the Key Vault
nameOfKeyVault = 'kv-keyvault-01'
keyvault = f'https://{nameOfKeyVault}.vault.azure.net/'
# Function definition
## Function for getting secret from Key Vault
def get_secret_from_keyvault(secret_name):
access_token=PyTridentTokenLibrary.get_access_token("keyvault") # Get Access Token for Key Vault
return PyTridentTokenLibrary.get_secret_with_token(keyvault,secret_name,access_token) # Return Secret
## Function for getting Secret
def get_secret(secret_name):
#resource = "https://analysis.windows.net/powerbi/api" # Resource for Power BI API
secret = get_secret_from_keyvault(secret_name)
if secret:
return secret
else:
return "Invalid secret name"
This code retrieves secrets from an Azure Key Vault using the PyTridentTokenLibrary. It sets up the Key Vault's URL, then defines two functions: get_secret_from_keyvault obtains a secret by first getting an access token for the Key Vault and then fetching the secret using this token. get_secret acts as a wrapper to this process, returning the secret if found or a message indicating an invalid name if not. This approach simplifies secure access to sensitive information stored in the Key Vault.
Variables
sqlserver = get_secret('sqlserver')
database = "AdventureWorksDW2022"
username = get_secret('username')
password = get_secret('password')
This code retrieves the SQL Server address, username, and password from Azure Key Vault using the get_secret function, and sets the database to "AdventureWorksDW2022". It's used to securely access database connection details.
SQL Query - List of tables
sql = '''
select s.TABLE_NAME
from AdventureWorksDW2022.INFORMATION_SCHEMA.tables s
where 1=1
and s.TABLE_SCHEMA = 'dbo'
and s.TABLE_TYPE = 'BASE TABLE'
and (
s.TABLE_NAME like 'Dim%' or
s.TABLE_NAME like 'Fact%'
)
'''
This SQL query selects the names of tables from the "AdventureWorksDW2022" database that are in the "dbo" schema and are of the type "BASE TABLE". It specifically looks for tables whose names start with "Dim" or "Fact".
Data frame - List of tables
领英推荐
# Initialize Spark session with specific configurations
spark = SparkSession.builder.appName("JDBCExample") \
.config("spark.sql.parquet.vorder.enabled", "true") \
.config("spark.microsoft.delta.optimizeWrite.enabled", "true") \
.getOrCreate()
# Read data from JDBC source
df = spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://{sqlserver};databaseName={database}") \
.option("query", sql) \
.option("user", username) \
.option("password", password) \
.load()
This code initializes a Spark session with specific configurations and reads data from a JDBC source (SQL Server) into a DataFrame df, specifically saving the list of "Dim" and "Fact" tables from the database into df using provided SQL Server address, database name, user credentials, and a SQL query.
Save data to delta tables
table_names = df.select("TABLE_NAME").collect()
for row in table_names:
table = row.TABLE_NAME
df_table = spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://{sqlserver};databaseName={database}") \
.option("dbtable", table) \
.option("user", username) \
.option("password", password) \
.load()
df_table.write.mode("overwrite").format("delta").save(f"Tables/{table}")
This code iterates over the list of table names collected from the DataFrame df, reading each table from the SQL Server database into a Spark DataFrame df_table using JDBC connection options. It then writes each df_table to a Delta format as a managed table, saving it into a directory named "Tables" with a subdirectory for each table name, overwriting any existing data.
Output
The provided image showcases a lakehouse with data successfully ingested from an on-premise SQL Server. Utilizing PySpark and Microsoft Fabric Notebook, tables prefixed with "Dim" and "Fact" from the AdventureWorksDW2022 database have been organized into Delta format within the lakehouse.
Conclusion
This article presents a detailed method for ingesting data from an on-premise SQL Server into a distributed data processing environment using Microsoft Fabric Notebook and PySpark. By leveraging the PyTridentTokenLibrary for secure access to Azure Key Vault and employing JDBC for data extraction, this approach demonstrates a secure and efficient way to connect to, query, and store data from specific tables within the AdventureWorksDW2022 database. The process highlights the importance of secure credential management and the flexibility of using Spark for data processing and storage in Delta format.
Luká? Karlovsky
Power BI | FABRIC | Power Query | SQL | Data | DWH | Excel | VBA | Instructor
Joyful Craftsmen
Joyful Craftsmen is a data design and data engineering team. We’ve been learning and building data solutions since 2014. We focus on our vision to do the data craft as smartly as reachable. Humanity, teamwork and joy is part of our game, regardless you are our partner or one of us. Altogether we are valued for delivering cutting edge data solutions, on time, in top quality.
Data Platform MVP (2020-2023) | Certified Expert in Fabric Analytics Engineering, Azure Data Engineering, Solutions Architecture, DevOps, and Development
9 个月on premises environments don't and shouldn't allow a web connection to a sql server port. We need the power bi data gateway, which relies on service bud relay for this
Fabric | Power Apps | Power Automate | SharePoint
9 个月Inspiring article! I am just worried that not many security admins would allow to open that port to public (mayhaps to specific IP range - e.g. Azure Resources?). I would love the option to be able to use Azure On-Premise Data Gateway programatically and connect to the gateway from the notebook directly. Sadly, it is allowed only for Logic Apps and Power Platfrom.