In this second installment, we delve into the pivotal process of data ingestion, seamlessly transferring data from on-premises SQL Server to Azure Data Lake Gen2 utilizing Data Factory. Building upon the foundational introduction and environment setup discussed in Part 1, if you haven't yet explored it, we encourage you to start there before delving into this next stage.
The Journey Begins: Establishing the Connection
Our adventure starts with establishing a connection between ADF and your on-premise SQL Server. This is achieved using a concept called an integration runtime.
- Self-hosted integration runtime: Ideal for on-premise data sources. We'll walk you through the installation process, ensuring a secure and efficient connection.
- Auto-resolve integration runtime: Perfect for cloud-based resources like Azure Data Lake. Later on, we'll elucidate how ADF seamlessly manages the connection process automatically.
Create a self-hosted IR via UI
- Navigate to the Azure Data Factory user interface homepage and click on the Manage tab located in the leftmost pane.
- From the options presented, select Integration runtimes and then click on +New.
- On the Integration runtime setup page, opt for Azure, then Self-Hosted, and proceed by clicking Continue.
- Next, select Self-Hosted to specify the creation of a Self-Hosted Integration Runtime and click Continue.
- Now, you'll be prompted to configure the self-hosted IR: Provide a name for your Integration Runtime and finalize the setup by clicking Create.
- To complete the setup via UI: Click the link under Option 1 to initiate the express setup on your computer. Alternatively, follow the steps outlined under Option 2 for a manual setup. The instructions provided here are based on the express setup (Option 1). You'll notice a newly downloaded application on the machine. Once the download completes, I'll proceed by clicking on the application to initiate it. Upon opening, you'll observe the initiation of the Self-Hosted Integration Runtime setup. The application will begin downloading the necessary files and automatically handle key-based authentication, streamlining the setup process.
Building the Bridge: The Copy Data Activity
Now comes the exciting part - data movement! We'll showcase the copy data activity, the workhorse that transfers data from your SQL Server tables to the designated location in Azure Data Lake. We'll explore:
- Navigating to the Author Tab, we locate the Pipelines section and initiate the creation process. With a simple click on the plus icon, a new pipeline is born. To avoid confusion down the line, we assign a descriptive name such as "Copy Pipeline" to our newly created pipeline.
- The heart of our pipeline lies in the Copy Data activity. By searching for and dragging the Copy Data activity into the workspace, we set the stage for data transfer. Initially, we opt to copy a single table, such as the "Address" table from our SQL Server database. Renaming the activity to "Copy Address Table," we proceed to configure the source and sink.
- Before proceeding, we need to define a source data set representing our SQL Server database. Clicking on the Source option, we're prompted to create a new source data set. Selecting SQL Server as our data store, we provide essential details such as the server name (e.g., localhost), database name, and authentication credentials. For security purposes, we utilize Azure Key Vault to securely store and retrieve our password.
- Selecting the source table (e.g., address table)
- On the other end of the spectrum lies our destination, Azure Data Lake Gen2. Here, we create a sync data set for storing our copied data. Selecting the appropriate data format, such as Parquet, CSV, ensures efficient storage and retrieval. We establish a link service connection to our Azure Data Lake storage account, specifying the desired container for data storage.
- With all configurations in place, we execute our pipeline using the debug option. This initiates the data copying process, transferring the "Address" table from our SQL Server database to Azure Data Lake. Upon successful completion, we verify the presence of the copied file in the designated container within the Data Lake.
Conclusion
By meticulously following these steps, data engineers can streamline the process of data ingestion, enabling seamless transfer of data from on-premise SQL Server databases to Azure Data Lake Gen2. In the next part of our data engineering projects journey, we'll explore scaling up our pipeline to handle multiple tables effortlessly.
Stay tuned for more insights and practical guides on data engineering techniques. Follow me for updates on future articles and dive deeper into the world of data engineering with Azure Data Factory. Let's unlock the true potential of data together.
#AzureDataFactory #DataIngestion #CloudMigration #DataEngineering
AI Leader aspiring for CDO, CAIO, CTO | Agentic AI & GenAI Product Engineering | Data & Analytics| Technology Consulting & Advisory |Presales | Architecture | Azure, AWS, GCP, Oracle, Databricks, Snowflake, Salesforce
1 年Well written Akshay. You have elaborated the step-by-step process for easy understanding.
Cloud Data Architect | Snowflake, Azure & GenAI Expert | Crafting Robust Solutions for Data Excellence | Healthcare & Insurance
1 年Great article Akshay Tondak. I liked the step-by-step guidance. Keep up the good work! For such amazing content on data modelling, please follow https://datamodeling.atanuconsulting.in
Analyst at Evolent Health International
1 年Thanks for sharing!
Excited to see how this series unfolds, especially the practical insights on data ingestion with Azure! Looking forward to the next part.