Preparing to take DP-900 Exam Hands-on experience Creating a Notebook in Azure Synapse Analytics
Tshepho Rapholo
Azure & O365 Support Engineer |Cloud Command Analyst at Cloud Direct
Currently Preparing to take my DP-900 Exam. Have been learning from Microsoft Learn content at?https://learn.microsoft.com/en-us/credentials/certifications/azure-data-fundamentals/?practice-assessment-type=certification?, Watching LinkedIn DP-900 Content at Prepare for the Azure Data Fundamentals (DP-900) Certification Exam (linkedin.com) and further decided to work on Hands-on labs from Whizlabs at https://www.whizlabs.com/microsoft-azure-certification-dp-900/ to familiarizing myself with resources used in Azure Data Analytics and how they are been used.
In this lab, I have learned to create a Notebook to connect to the Apache Spark pool and run Spark SQL queries against files and tables.
Below is the Architecture Diagram of Creating a Notebook in Azure Synapse Analytics
Creating a Synapse Analytics workspace
Login to Azure Portal and use the search bar to search for Synapse Analytics workspace as per the below Screenshot?
Select the Synapse Analytics workspace, you will be directed to the Workspace Page
You can now create a Synapse workspace by clicking on any of the below as per the screenshot
You will be redirected to a creation page?
You can create a new resource group or select an existing one if you do have one created already.
Choose your workspace name and remember the Workspace name below when choosing your resource name)
Choose your preferred region (I have chosen East US)
Select the Data Lake Storage Gen2 (From Subscription as per the screenshot below)
Now you need to select or create a new Storage Account and the File System below is the definition of each
Click on Next to Security Tab?
For Authentication Method Select both local and Microsoft Entra ID Authentication.
Create your Amin Login Credentials as preferred(Screenshot below) and leave other settings as defaults.
Click in Review + Create?
Your Synapse will be validated and get a success as per below:
Now Can click Create and wait while Synapse Deployment is in Progress(screenshot Below)
You will now get "Your Deployment is complete" Once everything goes well.
This means your Synapse workspace has been deployed?Successfully
Now go back to your Newly Created Synapse Workspace and select "New Apache Spark Pool"
Enter your Apache Spark Pool Details as per the below Screenshot/Preferred details According to your requirements
Automatic Pausing enabled(This feature releases resources after a set idle period, reducing the overall cost of your Apache Spark pool.)
Review and Create (New Apache Spark Pool get Evaluated)
Deployment in Progress
领英推荐
Deployment of New Apache Spark Pool Completed
Now we need to go back into the Azure Synapse and access the Synapse Studio
Below screenshot of the Azure Synapse Opening?
Once opened Hover over the Left Icon Tabs select Develop click the Plus Sign, select Notebook , name the notebook, and select the Pool to attach to which will be the Pool Created early on Publish
Copy the below command and paste it into the cell to create a simple Spark DataFrame object to manipulate and then click on?Run all. Make sure to delete the empty cell above.
new_rows = [('CA',22, 45000),("WA",35,65000) ,("WA",50,85000)]
demo_df = spark.createDataFrame(new_rows, ['state', 'age', 'salary'])
demo_df.show()
Once the code has finished running, information below the cell displays showing how long it took to run and its execution. The data now exists in a DataFrame from there you can use the data in many different ways.
Now, to create a Spark table, a CSV, and a Parquet file all with copies of the data, run the following command in another cell by clicking on?+ Code. Make sure to replace??<<TheNameOfAStorageAccountFileSystem>>?and?<<TheNameOfAStorageAccount>>?with the file system name and storage account name that you created while creating the Synapse Analytics workspace. You can find the storage account name and the file system name in Storage Accounts on the Azure portal.
demo_df.createOrReplaceTempView('demo_df')
demo_df.write.csv('demo_df', mode='overwrite')
demo_df.write.parquet('abfss://<<TheNameOfAStorageAccountFileSystem>>@<<TheNameOfAStorageAccount>>.dfs.core.windows.net/demodata/demo_df', mode='overwrite')
Now, navigate to?Data?Hub
Go to the?Linked?tab, under?Azure Data Lake Storage Gen2, select your Synapse analytics workspace, and then select your file system. You will see that it contains the?demodata?folder which was created when you ran the above command.
Now, open the folder by double-clicking on it. Notice in both the "CSV" and "parquet" formats, write operations a directory is created with many partitioned files.
Now, go back to your notebook copy the below command, and paste it in a new cell. Run the command to list the tables in the pool.
%%sql
SHOW TABLES
Now, to see the data in?demo_df, run the following command in a new cell:
%%sql
SELECT * FROM demo_df
You can also view the data in the form of a Chart. In the?View?switcher, select?Chart.
This makes it to the end of the Lab.
What we have achieved throughout the Lab?