登录查看更多内容

Fabric Data Science with Github Copilot in VS Code

Ravikanth Musti

Senior Data & Analytics Architect

发布日期: 2023年9月7日

There is lot of excitement surrounding Microsoft Fabric's Co-pilot, but there's another avenue to explore the world of copiloting alongside Fabric in the interim. Delving into this alternative approach is precisely what this blog post aims to accomplish.

Prerequisites

I have already enabled Github Copilot in VS code using the steps mentioned here and here. I also suggest you enable Copilot chat inside VS code using this link. Once you have setup all the prerequisites for Github Copilot VS Code, lets complete the setup for Fabric VS Code integration. Follow this documentation for detailed instructions. Summarizing the setup below:

Install Synapse VS Code extension. Complete the prereqs and make sure you have JAVA_HOME and Conda in System Path Variables.
If you face problems during conda installation, I suggest you follow this stackoverflow url - python - 'Conda' is not recognized as internal or external command - Stack Overflow
Get yourself familiar with the features of Synapse VS Code extension especially around listing of notebooks and publishing your notebooks.
You can either open a notebook directly from VS Code or you can click on VS Code button within a Fabric notebook as shown below.

If you have followed all the instructions and completed the setup, you should see something similar to this:

You can notice following items:

Fabric Workspace which holds all your notebooks and Lakehouse
Synapse VS Code extension icon. Once you click on this you can see all the notebooks, Spark Job Definitions and Lakehouse
Github Copilot Chat extension to assist you while writing code.

It is worth mentioning that Synapse VS Code extension ships synapse-spark-kernel alongside. This kernel provides the ability to execute code cells atop the remote Fabric Spark compute.Upon selecting this kernel, the extension seamlessly intercepts all PySpark API calls during runtime, seamlessly translating them into the appropriate HTTP calls to the remote Spark compute. While the local environment handles pure Python code execution, this process ensures efficient utilization of both local and remote resources.

Let's start building a data science notebook. For this exercise, I will use the Titanic Dataset which is like a "Hello World" in Data Science world. Keeping it very simple!!!

I have already downloaded data from Kaggle. Link here. I have uploaded into my Fabric Workspace.

Next, I will create a new data science notebook and attach the lakehouse which already has my Titanic datasets.

Once you click on VS Code button at the top of the Notebook, it will open VS Code and you see something similar to this:

One you execute the cell, you will see the output as follows:

If you notice the above image, you will note that there is an option to " Launch Data Wrangler". It is similar to the one you get in Fabric Data Science Notebook inside your workspace. Let's explore the Data Wrangler experience within your VS Code.

The experience and functionality are very much similar to the one in Fabric Workspace.

Click on any column and you can see all the operations that you can perform on it alongside Data Summary and Statistics.

Operations and Data Summary for the column Cabin

Lets try performing some operations on it. We do following steps:

Change data type from Text to int or float.
Remove missing values.
One-hot-encode categorical variable.
Drop some unwanted columns.

领英推荐

Exploring Data Operations with PySpark, Pandas…

Alex Merced 5 个月前

GroupBy #10: Netflix's Psyberg, Parquet format, SQL…

Vu Trinh 1 年前

Just Enough Spark! Core Concepts Revisited !!

Deepak Rajak 4 年前

Looks at the steps we perform and corresponding code.

We can copy the code and run that in our notebook. The code runs very fast as it leverages the Python kernel of my machine instead of the remote spark kernel of fabric.

Now, we will explore Github Copilot to perform Exploratory Data Analysis ( EDA). I will use Github Copilot Chat to perform EDA using natural language. First, I will ask to create a correlation chart between columns. I ask Copilot to use my cleansed dataframe as context and create corrplot. Surprisingly, it was very quick suggest a correlation chart using SNS package. I copy the code and run. Err! I get an error message!

Error message while I execute the Copilot chat code

In the above image you can see the prompt chat in the left window. It shows the code for my ask. When executed it shows an error message - "No Model name seaborn. Normally, I would immediately search in stackoverflow for solutions. Luckily our Github Copilot is super powerful. I click on Quickfix and select 'Fix using Copilot' and it automatically suggests me the necessary code change to remove the error.

I accept the changes and add that to my code. After I install in remote and local machines and I run the code. After removing some non-numeric columns, I try to find corr plot using SNS and below is the output.

As you can see, 'Survived' has high correlation with 'Sex_female' and 'Fare'. Now, let's say you are short of ideas for exploratory analysis, you can ask Copilot for it! I did that and here is the result:

Why stop just here, lets ask code for each of the suggestions. To my surprise, Copilot provides the results equally fast. Snapshot below.

I use the code suggested for "Proportions of passengers who survived by age group". Below is the result:

Next, we will create a Machine Learning Model. Let's ask Copilot to build a ML model and find its accuracy. Here is the result:

We can also start a chat with Copilot and ask it to change the algorithm to DecisionTree. Here is the result:

As you can see, it understands where to modify the code by highlighting the changes in green/red. We can 'Accept of Discard accordingly.

Finally I ask Copilot to log the parameters and model using Mlflow. It re-writes the code as shown below:

To summarize, this blog post explains how to increase productivity of data scientist by using Fabric Data Science notebook along with Visual Studio Code and Github Copilot. It guides through enabling Github Copilot, Synapse VS Code extension, and Fabric integration. Then it showcases a workflow using the Titanic Dataset, including Data Wrangler for easy operations. It highlights Github Copilot's role in Exploratory Data Analysis and ML model building, underlining its error fixing capabilities. Overall, the post demonstrates how these tools streamline tasks, from data handling to machine learning, enhancing productivity.

Debananda Ghosh

Cloud Analytics Business Lead- APJ market | Author

1 年

Good one Ravikanth Musti

1 次回应

Pradeep Menon

1 年

Loved this post Ravikanth Musti. Super comprehensive. I can't wait to try it out.

2 次回应

Shanoop Krishnan

Startups | Digital Natives | Data & AI | ISV

1 年

Thanks Ravi, great article and super helpful!

1 次回应

查看更多评论

要查看或添加评论，请登录

Ravikanth Musti的更多文章

Clarity is all I need

2024年4月11日

Clarity is all I need

In today's digital landscape, the sheer number of AI applications can be overwhelming. With thousands available, each…

9 条评论

Fabric Data Science with Github Copilot in VS Code

Ravikanth Musti

Senior Data & Analytics Architect

领英推荐

Ravikanth Musti的更多文章

社区洞察

其他会员也浏览了

Picking Snowflake Open Catalog as a managed Iceberg catalog for Open Lakehouse

?? DATA Pill #107 - dbt 1.8 is just wow, How Twitter processes 4 billion events in real-time daily

Dask vs. Spark: Which Big Data Tool Should Data Scientists Choose?

Building a Robust Data Engineering Pipeline with Snowflake and Python

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

Best Ways to Use Pandas with PySpark

Building an Open, Multi-Engine Data Lakehouse with S3 and Python

Loading Data into Snowflake using Snowpark DataFrames

Apache Spark 3.0 for Data Scientists : Best Practices

Microsoft Fabric Data Engineering - To infinity and beyond

领英推荐

Ravikanth Musti的更多文章

Clarity is all I need

社区洞察

其他会员也浏览了

Picking Snowflake Open Catalog as a managed Iceberg catalog for Open Lakehouse

?? DATA Pill #107 - dbt 1.8 is just wow, How Twitter processes 4 billion events in real-time daily

Dask vs. Spark: Which Big Data Tool Should Data Scientists Choose?

Building a Robust Data Engineering Pipeline with Snowflake and Python

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

Best Ways to Use Pandas with PySpark

Building an Open, Multi-Engine Data Lakehouse with S3 and Python

Loading Data into Snowflake using Snowpark DataFrames

Apache Spark 3.0 for Data Scientists : Best Practices

Microsoft Fabric Data Engineering - To infinity and beyond