登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Copy Tables from On-Premise SQL Server to Azure Data Lake | Azure Data Engineering Project Guide [Part 3]

Akshay T.

Azure 14X | KPMG | Ex - EY | Azure Data Engineer | Data Factory | DataBricks | Data Lake | Synapse | Data Pipelines | Data Warehousing | CI/CD | PySpark | SQL | Python | [Views Are Personal]

发布日期: 2024年3月18日

Ever struggled with copying numerous tables from an on-premise SQL Server database to your Azure Data Lake? You're not alone! This manual process can be time-consuming and error-prone. But fear not, data engineers! Here's a powerful approach using Azure Data Factory (ADF) to automate copying all tables at once.

If you haven't already read Part 1 and Part 2 of this series, I highly recommend doing so before proceeding with this article. This installment builds upon the content covered in Part 1 and Part 2, providing a seamless continuation of the series.

Step-by-Step Guide with Actionable Insights

Use a SQL script to fetch all table names and schema names from the SQL Server database.

SELECT s.name AS SchemaName,
	t.name AS TableName
FROM sys.tables t
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
WHERE s.name = 'SalesLT'

Create a new pipeline in Azure Data Factory called "Copy All Tables".

Add a Lookup activity to the pipeline to execute the SQL script and fetch the table/schema names from the database.

We'll leverage the same SQL query used to discover tables directly in the Lookup activity. Go to the settings tab, where you'll find the option for creating a dataset. We need to create one dataset for our on-premise SQL Server database.

Choose SQL Server as the dataset type, then proceed by naming it 'SqlDBTables'. Next, select the linked service established in the previous article to establish connectivity with the on-premise SQL Server. Keep the remaining options at their default settings and proceed by clicking 'OK'.

Now, let's explore an intriguing feature within the lookup activity: the 'Use Query' option. By default, the activity is set to retrieve data from tables. However, we have the flexibility to switch to a query or stored procedure. Let's navigate to this option and select 'Query'. Subsequently, I'll retrieve the SQL script from SSMS, which we've recently used, and paste it into the query window. Upon execution, the lookup activity will employ this identical query to retrieve data from the on-premise SQL Server.

Let's verify our setup by utilizing the 'Preview Data' option. Simply clicking on this option will reveal the table structure. Remember to uncheck the 'First Row Only' option to display all rows for thorough examination.

To ensure everything's working smoothly, use the debug option when running the Lookup activity. This displays the input and output sections in the output window. The input section confirms the SQL query used, while the output provides a clear JSON structure containing all table and schema names. This valuable data becomes the roadmap for copying tables from your source seamlessly.

Add a ForEach activity, Search for the "For Each Activity" in the ADF toolbox and drag it onto the workspace. Rename it to "For Each Schema Table" for clarity. Connect the success output of the Lookup activity to the "For Each Loop" activity. This ensures the Loop only runs if the Lookup successfully retrieves the table list.

Within the ForEach Loop settings, access the "Items" field and use the "Add Dynamic Content" option. Select "Activity Outputs" followed by the Lookup activity. This creates an expression that retrieves the list of tables (value) from the Lookup activity's output. This list will be used by the ForEach Loop to iterate through each schema and table name, enabling you to copy tables one by one.

Nest the Copy Data Activity within the ForEach Loop. Access the "Activities" section and click the edit icon (pencil) on the ForEach Loop activity. Drag and drop the Copy Data activity into the Loop container. Rename the Copy Data activity to "Copy Each Table" for clarity.

Configure the Copy Data activity's source dynamically. Instead of creating a static source dataset, click "New" and choose "SQL Server" as the connector. Name it "SQL Server Copy" and reuse the existing linked service connection to your on-premise SQL Server. Since we'll use dynamic content later, leave the table name blank. Click "OK" to create the source dataset.

Make the Copy Data activity's source query dynamic. Switch the "Use Query" option to "Query" and use dynamic content. Paste a script like @{concat('SELECT * FROM ', item().SchemaName, '.', item().TableName)}. This script leverages the item() function to access the current schema and table name being iterated through by the ForEach Loop (fetched from the Lookup activity output). As the loop progresses, the script will be populated with the specific schema and table name for each iteration, enabling you to copy each table individually.

Move on to configuring the Copy Data activity's destination (sink). Click on "Sink" and select "New." Choose "Azure Data Lake Storage" as the connector and hit "Continue." Maintain the default format CSV and click "Continue" again. Rename the sink dataset to "CSVTable" for clarity. Reuse the existing linked service connection to your Data Lake. Specify the file path using "Browse" and navigate to the desired bronze container where you want to store the copied tables. Finally, click "OK" to create the sink dataset.

Now, let's define the desired folder structure within the Data Lake. We want the tables organized by schema and table name.

1) Create Dynamic Parameters:

Open the "CSV Table" sink dataset and navigate to the "Parameters" tab. Create two new parameters: "schemaname" and "tablename".

2) Assign Dynamic Values:

In the "Value" section for each parameter, click "Add Dynamic Content." Use the expressions @item().SchemaName and @item().TableName to link the parameter values to the corresponding data from the Lookup activity (outer loop).

3) Build the Folder Path:

Open the "CSVTable" sink dataset again and navigate to the "Connection" tab. In the "File Path" field, use dynamic content to construct the path. Click "Add Dynamic Content" and enter a Python expression like: @{concat(dataset().schemaname, '/',dataset().tablename)} This expression concatenates the container name (obtained from the dataset), schema name (from the parameter), and table name (from the parameter) to create the desired directory structure.

4) Set the File Name:

In the same "File Path" field, use another dynamic content expression to define the filename format. You can use: @{concat(dataset().tablename,'.csv')} This expression combines the table name with ".csv" to create the desired file extension.

Publish and trigger the pipeline to execute the copy of all tables to the Data Lake.
Head over to the "Monitor" tab to track the pipeline's execution. Clicking the pipeline name reveals a detailed view of each activity's status. You can see that the Lookup activity has finished, and the ForEach Loop is currently running 10 activities in parallel, each copying a separate table. Successfully completed activities in the "Activity Runs" section indicate tables that have been loaded.

After a successful run message, let's verify the outcome in the Data Lake. Refreshing the "Bronze Container" shows a folder named "salesLT" (the schema name). Inside, you'll find all 10 tables with their designated directory structure and corresponding CSV files. Opening the "address" table confirms the presence of "address.csv." Similar files exist for all copied tables.

Key Takeaways and Benefits:

Automation Advantage: This approach eliminates the need for manual scripting and repetitive tasks, saving you valuable time and effort.
Error Reduction: Repetitive manual processes are prone to errors. ADF automates the entire process, minimizing the chance of human error.
Scalability and Flexibility: This ADF solution can easily handle copying a large number of tables efficiently. Additionally, it can be adapted to fit your specific data landscape.

Related References

Next Task For You

Don't forget to like and share this article to spread the knowledge among your data engineer network! :)

Data Digest

8,132 位关注者

要查看或添加评论，请登录

Akshay T.的更多文章

Conquering the Azure Data Engineer Associate Exam: A 30-Day Blueprint to Success

2024年3月11日

Conquering the Azure Data Engineer Associate Exam: A 30-Day Blueprint to Success

Are you ready to unlock a world of opportunity in cloud data engineering? The Azure Data Engineer Associate…
Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

2024年3月10日

Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

In this second installment, we delve into the pivotal process of data ingestion, seamlessly transferring data from…

4 条评论
A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure - Part 1

2024年3月9日

A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure - Part 1

Have you ever felt overwhelmed by the vast amount of data companies collect? As data engineers, we are the bridge…

4 条评论
Getting Your Hands Dirty with Microsoft Fabric: A Beginner's Guide (Part 1)

2023年7月9日

Getting Your Hands Dirty with Microsoft Fabric: A Beginner's Guide (Part 1)

Traditional large-scale data analytics solutions have relied on data warehouses and SQL queries to store and retrieve…

11 条评论
Seamless Integration: Databricks' Approach to Reading and Writing in Azure Data Lake Gen 2

2023年7月4日

Seamless Integration: Databricks' Approach to Reading and Writing in Azure Data Lake Gen 2

Introduction: In this article, we will discover how to read data from comma-separated values (CSV) files, and how to…
Azure Data Factory – CI/CD [Part-2]

2023年3月29日

Azure Data Factory – CI/CD [Part-2]

Welcome to Part 2 of our series on Azure Data Factory CI/CD. In this article, we will be delving into topic that were…

1 条评论
Azure Data Factory – CI/CD [Part 1]

2023年3月26日

Azure Data Factory – CI/CD [Part 1]

Azure DevOps is a set of tools for collaboration, continuous integration, and continuous delivery. Azure Repos allows…

7 条评论
Creating an Automated Data Pipeline with Databricks

2023年3月5日

Creating an Automated Data Pipeline with Databricks

In this article, we'll explore how to use Databricks to build a complete data pipeline. We'll cover everything from…
Capture Data Changes in Azure Data Factory and Azure Synapse Analytics

2023年2月4日

Capture Data Changes in Azure Data Factory and Azure Synapse Analytics

In the cloud environment, efficient data integration and ETL processes can greatly improve the performance of your…
Real-Time Challenges and Solutions for Data Engineers in Azure Databricks

2023年1月27日

Real-Time Challenges and Solutions for Data Engineers in Azure Databricks

As data engineers, we are constantly faced with a wide range of challenges in terms of data management, accessibility…

See all articles

Related References

Next Task For You

Data Digest

8,132 位关注者

Akshay T.的更多文章

Conquering the Azure Data Engineer Associate Exam: A 30-Day Blueprint to Success

Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure - Part 1

Getting Your Hands Dirty with Microsoft Fabric: A Beginner's Guide (Part 1)

Seamless Integration: Databricks' Approach to Reading and Writing in Azure Data Lake Gen 2

Azure Data Factory – CI/CD [Part-2]

Azure Data Factory – CI/CD [Part 1]

Creating an Automated Data Pipeline with Databricks

Capture Data Changes in Azure Data Factory and Azure Synapse Analytics

Real-Time Challenges and Solutions for Data Engineers in Azure Databricks

社区洞察