Fabric Data Science with Github Copilot in VS Code
There is lot of excitement surrounding Microsoft Fabric's Co-pilot, but there's another avenue to explore the world of copiloting alongside Fabric in the interim. Delving into this alternative approach is precisely what this blog post aims to accomplish.
Prerequisites
I have already enabled Github Copilot in VS code using the steps mentioned here and here. I also suggest you enable Copilot chat inside VS code using this link. Once you have setup all the prerequisites for Github Copilot VS Code, lets complete the setup for Fabric VS Code integration. Follow this documentation for detailed instructions. Summarizing the setup below:
If you have followed all the instructions and completed the setup, you should see something similar to this:
You can notice following items:
It is worth mentioning that Synapse VS Code extension ships synapse-spark-kernel alongside. This kernel provides the ability to execute code cells atop the remote Fabric Spark compute.Upon selecting this kernel, the extension seamlessly intercepts all PySpark API calls during runtime, seamlessly translating them into the appropriate HTTP calls to the remote Spark compute. While the local environment handles pure Python code execution, this process ensures efficient utilization of both local and remote resources.
Let's start building a data science notebook. For this exercise, I will use the Titanic Dataset which is like a "Hello World" in Data Science world. Keeping it very simple!!!
I have already downloaded data from Kaggle. Link here. I have uploaded into my Fabric Workspace.
Next, I will create a new data science notebook and attach the lakehouse which already has my Titanic datasets.
Once you click on VS Code button at the top of the Notebook, it will open VS Code and you see something similar to this:
One you execute the cell, you will see the output as follows:
If you notice the above image, you will note that there is an option to " Launch Data Wrangler". It is similar to the one you get in Fabric Data Science Notebook inside your workspace. Let's explore the Data Wrangler experience within your VS Code.
The experience and functionality are very much similar to the one in Fabric Workspace.
Click on any column and you can see all the operations that you can perform on it alongside Data Summary and Statistics.
Lets try performing some operations on it. We do following steps:
领英推荐
Looks at the steps we perform and corresponding code.
We can copy the code and run that in our notebook. The code runs very fast as it leverages the Python kernel of my machine instead of the remote spark kernel of fabric.
Now, we will explore Github Copilot to perform Exploratory Data Analysis ( EDA). I will use Github Copilot Chat to perform EDA using natural language. First, I will ask to create a correlation chart between columns. I ask Copilot to use my cleansed dataframe as context and create corrplot. Surprisingly, it was very quick suggest a correlation chart using SNS package. I copy the code and run. Err! I get an error message!
In the above image you can see the prompt chat in the left window. It shows the code for my ask. When executed it shows an error message - "No Model name seaborn. Normally, I would immediately search in stackoverflow for solutions. Luckily our Github Copilot is super powerful. I click on Quickfix and select 'Fix using Copilot' and it automatically suggests me the necessary code change to remove the error.
I accept the changes and add that to my code. After I install in remote and local machines and I run the code. After removing some non-numeric columns, I try to find corr plot using SNS and below is the output.
As you can see, 'Survived' has high correlation with 'Sex_female' and 'Fare'. Now, let's say you are short of ideas for exploratory analysis, you can ask Copilot for it! I did that and here is the result:
Why stop just here, lets ask code for each of the suggestions. To my surprise, Copilot provides the results equally fast. Snapshot below.
I use the code suggested for "Proportions of passengers who survived by age group". Below is the result:
Next, we will create a Machine Learning Model. Let's ask Copilot to build a ML model and find its accuracy. Here is the result:
We can also start a chat with Copilot and ask it to change the algorithm to DecisionTree. Here is the result:
As you can see, it understands where to modify the code by highlighting the changes in green/red. We can 'Accept of Discard accordingly.
Finally I ask Copilot to log the parameters and model using Mlflow. It re-writes the code as shown below:
To summarize, this blog post explains how to increase productivity of data scientist by using Fabric Data Science notebook along with Visual Studio Code and Github Copilot. It guides through enabling Github Copilot, Synapse VS Code extension, and Fabric integration. Then it showcases a workflow using the Titanic Dataset, including Data Wrangler for easy operations. It highlights Github Copilot's role in Exploratory Data Analysis and ML model building, underlining its error fixing capabilities. Overall, the post demonstrates how these tools streamline tasks, from data handling to machine learning, enhancing productivity.
Cloud Analytics Business Lead- APJ market | Author
1 年Good one Ravikanth Musti
Creating impact through Technology | AI Technologist and Futurist | Blogger | Public Speaker | Published Author | Active Startup Mentor | Board Member
1 年Loved this post Ravikanth Musti. Super comprehensive. I can't wait to try it out.
Startups | Digital Natives | Data & AI | ISV
1 年Thanks Ravi, great article and super helpful!