Getting Adventureworks to Fabric

Getting Adventureworks to Fabric

Edit 20.09.2024: I've realized that I made a mistake with the code. The code I used works for AdventureWorksDW2022, but not for AdventureWorks2022 (no DW). I didn't realize that I exported the DW version and not the "normal" version until now. Therefore, I've updated the code accordingly. My apologies for this oversight! I now also provided the files as zip file https://github.com/kaysauter/getting-adventureworks-to-fabric

--

Today, I am starting a series of blog posts on how to get the AdventureWorks database to Fabric. This is a series of blog posts that will cover a journey from exporting the AdventureWorks database to csv files, to getting the data to Fabric Lakehouse and then to the Data Warehouse. Along the way, I will be discussing some approaches and tips.

If you don't know the AdventureWorks database, it is a sample database by Microsoft. It is used to demonstrate the capabilities of SQL Server. The database is used in many examples, tutorials and documentation by Microsoft. You can download the database here.

There are many versions of it, I'll be using the AdventureWorks2022 database. I've got my database on a local SQL Server instance. The first step is to export the data to csv files. If you're using a different database, please make sure that it isn't a productive database as exporting all data to csv files can be expensive process. This is not to say that it can't be used in a productive environment, but test it on a non-productive environment first. This way you'll be able to see how long it takes and how much cpu and memory it uses on your system. So, let me walk you through the Python code by the comments within it:

#import some libraries
import pyodbc
import pandas as pd
import os

# Define the folder path to where the csv files will be exported to
csv_dir = r"C:\Users\OneDrive\your\csv\path"

# Create the 'csv' directory if it doesn't exist
os.makedirs(csv_dir, exist_ok=True)

# Build connection string to the database
conn_str = (
    "DRIVER={ODBC Driver 17 for SQL Server};"
    "SERVER=localhost;"
    "DATABASE=AdventureWorks2022;"
    "Trusted_Connection=yes;"
    "Encrypt=yes;"
    "TrustServerCertificate=yes;"
)

# Use a context manager to handle the connection
try:
    with pyodbc.connect(conn_str) as conn:
        cursor = conn.cursor()

        # Get all SQL Server table names
        cursor.execute(
            "SELECT TABLE_SCHEMA, TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE = 'BASE TABLE'"
        )
        tables = cursor.fetchall()

        # We will exclude these tables from the export
        exclude_tables = {"dbo.AWBuildVersion", "dbo.DatabaseLog", "dbo.ErrorLog"}

        # Loop through each table and export its data to a CSV file
        for table in tables:
            schema_name = table[0]
            table_name = table[1]
            full_table_name = f"{schema_name}.{table_name}"

            # Skip excluded tables
            if full_table_name in exclude_tables:
                continue

            # Get column names and types
            cursor.execute(
                f"SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = '{schema_name}' AND TABLE_NAME = '{table_name}'"
            )
            columns = cursor.fetchall()

            # Build the query with conversion for unsupported types
            select_columns = []
            for col in columns:
                col_name = col[0]
                col_type = col[1]
                if col_type in ("xml", "geography", "hierarchyid"):
                    select_columns.append(
                        f"CAST([{col_name}] AS NVARCHAR(MAX)) AS [{col_name}]"
                    )
                else:
                    select_columns.append(f"[{col_name}]")

            columns_str = ", ".join(select_columns)
            query = f"SELECT {columns_str} FROM {full_table_name}"

            try:
                df = pd.read_sql(query, conn)
                csv_path = os.path.join(csv_dir, f"{schema_name}.{table_name}.csv")
                df.to_csv(csv_path, index=False)
            except Exception as e:
                print(f"Failed to export {full_table_name}: {e}")
except pyodbc.Error as e:
    print(f"Database connection failed: {e}")        

I am assuming that you already have a Fabric capacity. If you don't you may be eligible for a free trial. In this case, you may want to follow the steps as described on Microsofts documentation website here.

Next, go into your Fabric capacity. Here, you should create a new workspace and a new lakehouse:

Click on the "Create Workspace" button in the left menu bar (1) which opens a fold-out window. After this, click on the green "+ Create Workspace" button (2) in the bottom left corner.

Next, on the right side, another flap-out window is opening. Here, you can enter your workspace name. The description is optional. If you can, I advise to use the trial capacity that may be available to you. For this click ont he radio button a bit further down in that very window. At the bottom of this very window, you can click on the green "Apply" button to create the workspace. The other options are optional and are not needed for this tutorial.


Go into your workspace you just created. Click on "Workspaces" and you'll find the workspace that you've just created in the flap-out window on the left.

Next, click on the [+ New] button on the left top menu bar. This will open a drop-down window from where you'll be able to choose a new lakehouse. Click on the "Lakehouse" option.



Next, you'll want to enter your lakehouse name into the pop-up window that appeared. I recommend to use a name that is something like adventureworks_bronze. This makes it easier to understand what this lakehouse contains and that it is a bronze layer of the lakehouse. I'll be explaining the layer concept in a later blog post, but for now, we don't have to worry about that. If you have the option to click on "Lakehouse schemas" that is currently in public preview, I recommend not to choose this. The reason is that it can make the concept a bit more complicated which we don't need for this tutorial. In a later post, I'll cover this, but for now, let's skip this. Click on the green "Create" button to create the lakehouse.

Now that we've created it, we can upload our files. In the middle of the screen (1), we have the possibility to upload the files, but we also could make use of a right click on the folder "Files" (2) that you see on the left side of the screen. The advantage the second "method" is that you actually see the folder where your files will be uploaded to. You can create a folder in that folder if you want to. For this tutorial, i am going to keep this simple and will upload the files directly into this root folder. In the flap-out window on the right (3) you can go to your local folder and select all your files you've just created with the Python script above. If you choose the option "Overwrite if files already exist" you can upload the files again and overwrite the existing ones.

So this is for today, I'll be back with the next blog post on this soon!

This post was first published on Linkedin Kay On Data Article / Newsletter. My blog on kayondata.com currently is under maintenance, so the post will be published there later, too.

ADIGUN SUNDAY

Information Technology Specialist @ Atax Systems / Member / Volunter of Data science Nigeria

3 个月

Good work sir

John Younie BSc(Hons)

Senior Business Intelligence Consultant at NATO Communications and Information Agency (NCI Agency)

3 个月

This is an awesome endeavour, I'm just starting my Fabric journey so will be following along ??

Alexander Potts

Dev turned data engineer. Will write SQL for food.

3 个月

adventureworks and its siblings like northwind and wwi are great sources for any data work/study

Luca Zavarella

Head of Data & AI | Microsoft MVP for AI and Data Platform | Book Author | Contributor at Towards Data Science

3 个月

Great initiative! Analysts new to Fabric absolutely need tutorials like this. Just a note: As you probably already know, data managed by Python is strictly bound to the available RAM of your machine. So if you need to extract a huge table to CSV, the process will surely crash. I had the need to export huge tables, so I developed a PowerShell module that simplifies the process: https://github.com/lucazav/sql-bulk-export Just my 2 cents for those who want to play with their huge data in Fabric ??

Artur K?nig

POWER BI or DIE Podcast & Streams ???| Microsoft Data Plattform @ BI or DIE ??| Self-Service, Agile BI and Analytics ??| Power BI Fanboy ??| Sketchnotes ??| YouTube ??| TDWI Expert & Author ??

3 个月

great way to get some demo data! ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了