Fabric Data Science with Github Copilot in VS Code

Fabric Data Science with Github Copilot in VS Code

There is lot of excitement surrounding Microsoft Fabric's Co-pilot, but there's another avenue to explore the world of copiloting alongside Fabric in the interim. Delving into this alternative approach is precisely what this blog post aims to accomplish.

Prerequisites

I have already enabled Github Copilot in VS code using the steps mentioned here and here. I also suggest you enable Copilot chat inside VS code using this link. Once you have setup all the prerequisites for Github Copilot VS Code, lets complete the setup for Fabric VS Code integration. Follow this documentation for detailed instructions. Summarizing the setup below:

  1. Install Synapse VS Code extension. Complete the prereqs and make sure you have JAVA_HOME and Conda in System Path Variables.
  2. If you face problems during conda installation, I suggest you follow this stackoverflow url - python - 'Conda' is not recognized as internal or external command - Stack Overflow
  3. Get yourself familiar with the features of Synapse VS Code extension especially around listing of notebooks and publishing your notebooks.
  4. You can either open a notebook directly from VS Code or you can click on VS Code button within a Fabric notebook as shown below.


VS Code extension in Fabric Notebook

If you have followed all the instructions and completed the setup, you should see something similar to this:


Synapse VS Code extension

You can notice following items:

  1. Fabric Workspace which holds all your notebooks and Lakehouse
  2. Synapse VS Code extension icon. Once you click on this you can see all the notebooks, Spark Job Definitions and Lakehouse
  3. Github Copilot Chat extension to assist you while writing code.

It is worth mentioning that Synapse VS Code extension ships synapse-spark-kernel alongside. This kernel provides the ability to execute code cells atop the remote Fabric Spark compute.Upon selecting this kernel, the extension seamlessly intercepts all PySpark API calls during runtime, seamlessly translating them into the appropriate HTTP calls to the remote Spark compute. While the local environment handles pure Python code execution, this process ensures efficient utilization of both local and remote resources.

synapse-spark-kernel

Let's start building a data science notebook. For this exercise, I will use the Titanic Dataset which is like a "Hello World" in Data Science world. Keeping it very simple!!!

I have already downloaded data from Kaggle. Link here. I have uploaded into my Fabric Workspace.


Data uploaded to Fabric Workspace

Next, I will create a new data science notebook and attach the lakehouse which already has my Titanic datasets.


Data science Notebook

Once you click on VS Code button at the top of the Notebook, it will open VS Code and you see something similar to this:


Data Science Notebook in VS Code

One you execute the cell, you will see the output as follows:


Train Dataset display

If you notice the above image, you will note that there is an option to " Launch Data Wrangler". It is similar to the one you get in Fabric Data Science Notebook inside your workspace. Let's explore the Data Wrangler experience within your VS Code.

The experience and functionality are very much similar to the one in Fabric Workspace.

Data Wrangler in VS Code

Click on any column and you can see all the operations that you can perform on it alongside Data Summary and Statistics.


Operations and Data Summary for the column Cabin

Lets try performing some operations on it. We do following steps:

  1. Change data type from Text to int or float.
  2. Remove missing values.
  3. One-hot-encode categorical variable.
  4. Drop some unwanted columns.

Looks at the steps we perform and corresponding code.


Data Wrangling inside VS Code

We can copy the code and run that in our notebook. The code runs very fast as it leverages the Python kernel of my machine instead of the remote spark kernel of fabric.

Now, we will explore Github Copilot to perform Exploratory Data Analysis ( EDA). I will use Github Copilot Chat to perform EDA using natural language. First, I will ask to create a correlation chart between columns. I ask Copilot to use my cleansed dataframe as context and create corrplot. Surprisingly, it was very quick suggest a correlation chart using SNS package. I copy the code and run. Err! I get an error message!


Error message while I execute the Copilot chat code

In the above image you can see the prompt chat in the left window. It shows the code for my ask. When executed it shows an error message - "No Model name seaborn. Normally, I would immediately search in stackoverflow for solutions. Luckily our Github Copilot is super powerful. I click on Quickfix and select 'Fix using Copilot' and it automatically suggests me the necessary code change to remove the error.


Copilot Suggested changes.

I accept the changes and add that to my code. After I install in remote and local machines and I run the code. After removing some non-numeric columns, I try to find corr plot using SNS and below is the output.


Correlation Plot using SNS

As you can see, 'Survived' has high correlation with 'Sex_female' and 'Fare'. Now, let's say you are short of ideas for exploratory analysis, you can ask Copilot for it! I did that and here is the result:


EDA suggestions from Copilot

Why stop just here, lets ask code for each of the suggestions. To my surprise, Copilot provides the results equally fast. Snapshot below.


Code suggestions from Copilot.

I use the code suggested for "Proportions of passengers who survived by age group". Below is the result:


Proportions of passengers who survived by age group

Next, we will create a Machine Learning Model. Let's ask Copilot to build a ML model and find its accuracy. Here is the result:


ML Model code from Copilot

We can also start a chat with Copilot and ask it to change the algorithm to DecisionTree. Here is the result:


Chat with Copilot

As you can see, it understands where to modify the code by highlighting the changes in green/red. We can 'Accept of Discard accordingly.

Finally I ask Copilot to log the parameters and model using Mlflow. It re-writes the code as shown below:


Using MLflow

To summarize, this blog post explains how to increase productivity of data scientist by using Fabric Data Science notebook along with Visual Studio Code and Github Copilot. It guides through enabling Github Copilot, Synapse VS Code extension, and Fabric integration. Then it showcases a workflow using the Titanic Dataset, including Data Wrangler for easy operations. It highlights Github Copilot's role in Exploratory Data Analysis and ML model building, underlining its error fixing capabilities. Overall, the post demonstrates how these tools streamline tasks, from data handling to machine learning, enhancing productivity.


Debananda Ghosh

Cloud Analytics Business Lead- APJ market | Author

1 年

Good one Ravikanth Musti

Pradeep Menon

Creating impact through Technology | AI Technologist and Futurist | Blogger | Public Speaker | Published Author | Active Startup Mentor | Board Member

1 年

Loved this post Ravikanth Musti. Super comprehensive. I can't wait to try it out.

Shanoop Krishnan

Startups | Digital Natives | Data & AI | ISV

1 年

Thanks Ravi, great article and super helpful!

要查看或添加评论,请登录

Ravikanth Musti的更多文章

  • Clarity is all I need

    Clarity is all I need

    In today's digital landscape, the sheer number of AI applications can be overwhelming. With thousands available, each…

    9 条评论

社区洞察

其他会员也浏览了