How to bring SQL Server data into Microsoft Fabric
Nikola Ilic
I make music from the data??Data Mozart ??| MVP Data Platform | Pluralsight Author | O'Reilly Instructor | MCT
Options, options, options…Having the possibility to perform a certain task in multiple different ways is usually a great “problem” to have, although very often not each option is equally effective. And, Microsoft Fabric is all about “options”…You want to ingest the data? No problem, you can use notebooks, pipelines, Dataflows, or T-SQL. Data transformation needed? No worries at all – again, you may leverage notebooks, T-SQL, Dataflows…Data processing, you asked? Lakehouse (Spark), Warehouse (SQL), Real-Time Intelligence (KQL), Power BI…The choice is yours again.
In a nutshell, almost every single task in Microsoft Fabric can be completed in multiple ways, and there is no “right” or “wrong” tool, as long as it gets the job done (of course, as efficiently as possible).
Therefore, this article is not about bringing the on-prem SQL Server data to Fabric in the “right” way, but more an overview of the current options we have at our disposal. The main motivation for examining these options and writing about them is the question I’ve been frequently asked in recent months: “We have our data (or chunks of data) stored in an on-prem SQL Server. How can we leverage this data within the Microsoft Fabric landscape?”
So, let’s check what can we do as of today (October 2024)…
Prerequisites
On-Premises Data Gateway
First things first…To be able to bring on-prem data into cloud, you need a special piece of software called on-premises data gateway, which you can download for free from here . Explaining on-premises data gateway is out of the scope of this article, but in a nutshell, you can think of it as a bridge between your non-cloud data (i.e. SQL Server, Oracle, etc.) and various cloud services, such as Power BI, Logic Apps, etc.
Gateway is responsible for ensuring a secure connection to on-prem data sources and reliable data transfer from on-prem sources to cloud.
Simply said, you MUST have a gateway installed, to be able to bring your SQL Server data to Microsoft Fabric.
Connection to SQL Server data source
Once you have the on-prem gateway up and running, you need to create and configure a connection to the specific instance of the SQL Server. In this step, you must provide details such as server name, database name, and login credentials.
In the illustration above, you can see how to access Manage Connections and gateways from the Admin portal in Microsoft Fabric. You may also notice that I already configured a connection to the Adventure Works 2020 database on my SQL Server local instance.
Since we installed a gateway and configured the connection, we can now move forward and learn how to bring the data from SQL Server to Microsoft Fabric.
Option 1: Dataflows Gen2
This is a low-code/no-code approach for bringing (any) data into Fabric. For anyone who has ever worked with Power Query in Power BI (both in Power BI Desktop and/or Power Query Online), the entire scenery will look very familiar. From a user interface perspective, nothing changed in Dataflows Gen2, except one key thing – you can now choose the output for your data!
If you go to the Data Factory experience, there is an option to create a new Dataflow Gen2:
I’ll select SQL Server as a data source and then configure my connection to the local instance of SQL Server and AdventureWorksDW2020 database. Since I’ve already created the connection, it will automatically appear in the dialog window:
From there, the process is fairly straightforward – you choose tables and/or views you want to bring into Fabric, and you can optionally apply data transformations using a rich set of options of the Power Query Editor:
Finally, the key step in the entire process is to set the data destination. Currently, you can choose between a lakehouse, warehouse, KQL database (Kusto) and Azure SQL database:
Once you are happy with all the settings, you publish the dataflow, and, depending on the amount of the data that needs to be transferred from the on-prem SQL Server to Fabric, it may take some time to see the data available in the form of delta tables within a Fabric destination you specified. By default, all of the tables will be v-ordered and ready for further downstream consumption by all Fabric engines, including Direct Lake mode for Power BI.
Option #2: Pipelines
With Dataflows Gen2, it was super easy! Let’s now examine another option – Data pipeline.
In a nutshell, a pipeline plays two different roles in the Microsoft Fabric ecosystem:
And, I’ll immediately tell you one thing to keep in mind if you choose to use Pipelines for bringing your SQL Server data into Fabric:
领英推荐
Option #2a: Lakehouse as a data destination
Similar to Dataflows Gen2, this one is fairly straightforward. After I’ve chosen to create a new Pipeline in the Data Factory experience, and then Copy data assistant, I’ll be prompted to enter a data source connection details:
I’ll then select tables and/or views I want to bring from my Adventure Works database and I’m done. I can also write a T-SQL query to create brand-new delta table in the Fabric lakehouse:
Next, I’ll choose a Lakehouse as the data destination. I can select both the existing lakehouse (which I’ll do for the sake of this example), or create a new lakehouse for this data specifically (which might be useful because of the certain limitations I’ll soon introduce).
I’ll then configure destination details and column data types, and I’m good to go:
After a few seconds, I’m able to see new delta tables in my Fabric lakehouse:
From here, I can leverage the data from these tables for all Fabric workloads, including Direct Lake for Power BI.
This was super easy and straightforward, life is good:)
Option #2b: Warehouse as a data destination
Let’s say that your tool of choice when working with Microsoft Fabric is a Warehouse. You would expect that the same process explained above is relevant in this scenario as well, right? Because, the official documentation doesn’t say a word that it’s different, right? Well, I have to disappoint you…
Let’s create a new pipeline and repeat all the steps from the “lakehouse scenario”, except for the data destination, where I’ll choose to move my SQL Server data into the Fabric warehouse:
As soon as I click “Next”, I see an error message at the bottom, written in red:
It requires me to adjust the staging connection. So, essentially, what it asks me to do is to define an intermediate staging area (an Azure Storage account), and then connect to this external account (external to Fabric) to bring the data into Warehouse! What the f…rench toast?!
A few things to mention here: first of all, I consider this a VERY BAD user experience, and I already provided this feedback internally to Microsoft. The answer I received was that the warehouse connector relies on the COPY INTO command to write into the Fabric warehouse. If the source is something that is not supported by COPY INTO (such as on-prem SQL Server, for example), it is necessary to stage the data first and then run the COPY INTO from that staging area (external storage account). So, it is a connector limitation…
But, I still consider it a bad user experience, and let me explain why: Fabric is “sold” as a SaaS solution (“Everything just works!”). And, the latter is very true for the Lakehouse “use the pipeline to bring SQL Server data to Fabric” scenario, but it’s not true for the Warehouse scenario. Why would I (when I say I, I refer to the Fabric user, my clients, etc.) need to set and configure any additional storage myself?! This is SaaS, right? So, I would expect that this intermediate storage (staging) is provisioned for me behind the scenes if it’s required “by design” of the Warehouse workloads (connector limitations, etc.).?
What should I tell clients who want to bring SQL Server data with pipeline into Fabric? “Folks, you should not use Warehouse, because then you need to configure XYZ in addition…But, hey, if you use Lakehouse, it just works…” That’s an inconsistency IMO between these two, and this should be explicitly mentioned in the official docs, so that everyone knows what are prerequisites if they plan to go the “Warehouse route”.
As already mentioned, I provided this feedback internally to Microsoft, and let’s hope that this is just a temporary limitation, which will be resolved soon…
As a current workaround, you can either use a Dataflow Gen2 to bring your on-prem SQL Server data directly into the Warehouse, or use Pipeline to bring the data into Lakehouse, and then write cross-database queries to combine the SQL Server data from Lakehouse with the other data in Warehouse.
What about Shortcuts, Mirroring, Notebooks…
Fine, now that we learned all the caveats of bringing on-prem SQL Server data to Fabric by using Dataflows Gen2 and/or Pipelines, let’s examine how other data ingestion options measure against this task.
Conclusion
Bringing data from the on-premises SQL Server databases to Microsoft Fabric is one of the key requirements for many companies considering moving to Fabric.
Since Fabric is all about the options – don’t forget, one single task can be performed in multiple different ways – I hope this article sheds more light on various tools and features that can be leveraged to make your SQL Server data available in Fabric. As you’ve seen, the choice can be very nuanced, and depending on your specific use case, you might want to consider certain limitations/workarounds before you decide which path to take.
Thanks for reading!
Microsoft Data Platform MVP | Analytics Engineer | BI Consultant | Power BI Expert | Microsoft Certified: Fabric Analytics Engineer Associate | Power BI Data Analyst Associate | Azure Data Engineer Associate
1 个月We should consider the limitations of some data ingestion methods, such as those you highlighted: #Shortcuts and #Mirroring aren’t available for SQL Server data, and #Notebooks currently lack support for on-prem data.
Big Data & Power BI Consultant | Public Speaker | Microsoft Certified Fabric Analytics Engineer Associate @ Avanade
1 个月Great article! Thanks for writing it so clearly and directly, as always. Hopefully, Microsoft will take these limitations into consideration and make the necessary adjustments soon.
Power BI Analyst | Microsoft Certified Power BI |Business Intelligence | Microsoft Certified Power Platform Developer Associate | Analista de datos | Microsoft Certified: Fabric Analytics Engineer Associate
1 个月Hello Nikola Ilic i have a question, why is the reason beacuse i not can create a identity column , y DWH in microsoft fabric.
Chief Data Officer ◆ COO ◆ Managed 200+ Employees ◆ Delivered $2M+ in Cost Savings ◆ Increased Revenue By $25M ◆ Expert in Data Strategy & Operational Efficiency ◆ Led Power BI Integration ◆ Grew Company to 5 Branches
1 个月Perfectly timed blog Nikola Ilic Thankyou for the insights. Everything boils down to CUs at this point. (Especially if trying to stick to something below F64) Incrimental refresh of DF GEN 2 seems to show some advantages for minimized capacity impact when compared to the copy activity but I’m still testing. Mirroring would be the ultimate trump card though…
Data Engineer & Analyst - Engenharia de Dados, BI, PowerBI, Fabric, Azure Synapse Analytics, SQL, pySpark, ETLs & Pipelines
1 个月thank′s so much Nik