Get Started with Data Science in Microsoft Fabric
Image provided by Microsoft

Get Started with Data Science in Microsoft Fabric

Data science is at the core of making informed decisions within any organization. Whether you're managing stock levels in a supermarket to prevent waste, or analyzing customer data to craft personalized offers, data science provides the tools to unlock valuable insights. By coHowever, taking a data science project from concept to deployment can be a complex task, requiring the right tools for each phase of the process. This is where Microsoft Fabric comes into play. As a unified platform, Microsoft Fabric simplifies the data science workflow by offering an all-in-one solution for managing and executing data projects. From data ingestion to visualization, Fabric provides a seamless experience, enabling data scientists to focus on deriving actionable insights and building AI models.

In this article, we'll explore how Microsoft Fabric can support you throughout the data science lifecycle, particularly in developing machine learning models. You'll also get hands-on experience with tools and features that Microsoft Fabric offers for each step of your journey.

Understanding the Data Science Process

A common way to extract insights from data is by visualizing it. When dealing with complex datasets, you may need to dive deeper to identify intricate patterns. This deeper analysis is essential for generating valuable insights, and as a data scientist, you can use these patterns to train machine learning models.

Machine learning models are designed to recognize patterns and make predictions based on the data provided. For example, with these models, you can predict how many products you're likely to sell in the coming week. However, training the model is only one part of the data science process.

Before diving into a typical data science process, it's crucial to understand the common machine learning models you can train. These models play a key role in generating new insights and making more accurate predictions.

Exploring Common Machine Learning Models

The purpose of machine learning is to train models that can identify patterns in large amounts of data. These patterns can then be used to make predictions, providing new insights that enable more informed decision-making.

The possibilities with machine learning may seem endless, so let’s begin by understanding the four most common types of machine learning models:

Image provided by Microsoft

  1. Classification: Used to predict categorical values, such as determining whether a customer may churn.
  2. Regression: Used to predict numerical values, such as the price of a product.
  3. Clustering: Organizes similar data points into clusters or groups, allowing for segmentation of information.
  4. Forecasting: Predicts future numerical values based on time-series data, such as expected sales for the upcoming month.

To decide which type of machine learning model you need to train, it’s essential to first understand the business problem and the available data. This preliminary analysis helps determine the correct approach and optimize the desired outcomes.

Understanding the Data Science Process

To train a machine learning model, the process commonly involves the following steps:

Image provided by Microsoft

  1. Define the problem: Together with business users and analysts, decide what the model should predict and what defines its success.
  2. Get the data: Find data sources and store that data in a Lakehouse for access and processing.
  3. Prepare the data: Load the data from the Lakehouse into a notebook, explore, clean, and transform the data according to the model's requirements.
  4. Train the model: Choose an algorithm and hyperparameter values through trial and error, tracking your experiments with MLflow.
  5. Generate insights: Use batch scoring to generate the requested predictions from the model.

As a data scientist, most of your time will be spent preparing the data and training the model.

You can prepare and train a model using open-source libraries in the programming language of your choice. For example, if you're working with Python, you can use Pandas and Numpy to prepare the data and train models with libraries such as Scikit-Learn , PyTorch , or SynapseML .

When experimenting, it’s important to maintain an overview of all the models you've trained and understand how your choices influence the model's success. By tracking your experiments with MLflow in Microsoft Fabric, you can easily manage and deploy the models you've developed.

Explore and Process Data with Microsoft Fabric

Data is the cornerstone of data science, especially when the goal is to train a machine learning model for achieving artificial intelligence. Typically, models exhibit enhanced performance as the size of the training dataset increases. However, in addition to the quantity of data, the quality of the data is also crucial.

To ensure both the quality and quantity of your data, it is worthwhile to utilize Microsoft Fabric's robust data ingestion and processing engines. The platform offers flexibility, allowing you to choose between a low-code or code-first approach when setting up essential data ingestion, exploration, and transformation pipelines.

These tools facilitate the preparation of your dataset, enabling data scientists to focus on deep analysis and the effective training of their machine learning models. By using Microsoft Fabric, you can ensure that your data workflow is optimized in terms of both scale and quality, supporting more accurate and effective predictive modeling.

Ingest Your Data into Microsoft Fabric

To work with data in Microsoft Fabric, the first step is to ingest the data. You can ingest data from multiple sources, including both local and cloud data sources. For instance, you can ingest data from a CSV file stored on your local machine or from Azure Data Lake Storage (Gen2).

Once you connect to a data source, you can save the data into a Microsoft Fabric lakehouse. This lakehouse serves as a central location to store any structured, semi-structured, and unstructured files. It provides a unified storage solution that simplifies data management.

With your data stored in the lakehouse, you can easily connect to it whenever you need to access it for exploration or transformation. This centralized approach not only streamlines your workflow but also ensures that all your data is readily available for analysis and model training.

Explore and Transform Your Data

As a data scientist, you may be most comfortable writing and executing code in notebooks. Microsoft Fabric provides a familiar notebook experience powered by Spark compute.

Apache Spark is an open-source parallel processing framework designed for large-scale data processing and analytics. It allows you to handle vast amounts of data efficiently.

Notebooks in Microsoft Fabric are automatically attached to Spark compute. When you run a cell in a notebook for the first time, a new Spark session is initiated. This session persists as you run subsequent cells, enabling a continuous workflow. To optimize costs, the Spark session will automatically stop after a period of inactivity, but you also have the option to stop the session manually.

While working in a notebook, you can choose the programming language that best suits your needs. For data science workloads, you will likely be using PySpark (Python) or SparkR (R). This flexibility allows you to leverage the tools and libraries you are most comfortable with while taking advantage of Spark's powerful processing capabilities.

Image provided by Microsoft

Prepare Your Data with the Data Wrangler

To expedite the process of exploring and transforming your data, Microsoft Fabric provides the easy-to-use Data Wrangler.

Upon launching the Data Wrangler, you will receive a descriptive overview of the data you are working with. This includes summary statistics that help identify any issues, such as missing values.

To clean your data, you can utilize any of the built-in data-cleaning operations available in the Data Wrangler. When you select an operation, a preview of the result and the corresponding code is automatically generated for you. This feature allows you to see the impact of your changes in real time.

Once you have selected all the necessary operations for data cleaning and transformation, you can easily export these transformations as code and execute them on your data. This seamless integration not only streamlines your workflow but also enhances productivity by minimizing the amount of manual coding required.

Train and Score Models with Microsoft Fabric

Once you have ingested, explored, and preprocessed your data, you can proceed to use that data to train a model. Training a model is an iterative process, and it's essential to track your work effectively.

Microsoft Fabric integrates seamlessly with MLflow, allowing you to easily track and log your work. This integration enables you to review your progress at any time, helping you determine the best approach for training your final model. By tracking your work, you ensure that your results are easily reproducible, which is crucial for validating your findings.

Any work you want to track can be organized as experiments in MLflow. This structured approach not only keeps your workflow organized but also allows you to compare different models and their performances efficiently. With Microsoft Fabric and MLflow, you can streamline the model training process, making it easier to refine your approach and achieve optimal results.

Understand Experiments

Whenever you train a model in a notebook that you want to track, you create an experiment in Microsoft Fabric.

An experiment can consist of multiple runs, with each run representing a specific task executed in a notebook, such as training a machine learning model. This structure allows you to keep your work organized and facilitates easy comparisons between different approaches.

For instance, if you are training a machine learning model for sales forecasting, you might experiment with various training datasets using the same algorithm. Each time you train a model with a different dataset, you create a new experiment run. This way, you can systematically compare the performance of each run to identify the best-performing model.

By leveraging the concept of experiments in Microsoft Fabric, you can enhance your model development process, ensuring that you make data-driven decisions based on thorough analysis and experimentation.

Start Tracking Metrics

To effectively compare experiment runs, you can track parameters, metrics, and artifacts for each run in Microsoft Fabric.

All parameters, metrics, and artifacts you track in an experiment run are displayed in the experiments overview. You can view each experiment run individually in the Run details tab or compare across multiple runs using the Run list. This organized approach makes it easier to analyze your results and draw meaningful conclusions.

Image provided by Microsoft

By tracking your work with MLflow, you gain valuable insights into your model training iterations. This capability allows you to assess which configuration yielded the best model for your specific use case. Whether it's optimizing hyperparameters or evaluating different algorithms, having a clear overview of your tracked metrics ensures that you make informed decisions based on data.

Understand Models

After you train a model, the next step is to use it for scoring. Scoring involves applying the model to new data to generate predictions or insights. When you train and track a model using MLflow, the artifacts associated with the model, along with its metadata, are stored within the experiment run.

In Microsoft Fabric, you can save these artifacts as a model. By registering your model artifacts in Microsoft Fabric, you gain the ability to easily manage your models over time.

Whenever you train a new model and save it under the same name, you effectively add a new version to the model. This versioning system allows you to keep track of improvements, experiment with different configurations, and revert to previous versions if necessary. With this structured approach to model management, you can ensure that your best-performing models are easily accessible and organized.

Image provided by Microsoft

Use a Model to Generate Insights

To leverage a model for generating predictions, you can utilize the PREDICT function in Microsoft Fabric. This function is designed to seamlessly integrate with MLflow models, enabling you to generate batch predictions with ease.

For example, imagine you receive weekly sales data from several stores. Based on historical sales data, you’ve trained a model that predicts the sales for the upcoming week, using the sales figures from previous weeks as input. After tracking this model with MLflow and saving it in Microsoft Fabric, you can automate the prediction process.

Whenever new weekly sales data arrives, you simply use the PREDICT function to allow the model to generate the forecast for the next week. The forecasted sales data is then stored as a table in a lakehouse. This table can be visualized in a Power BI report, making the insights easily accessible for business users to consume.

By using the PREDICT function in conjunction with MLflow and Power BI, you can create a streamlined workflow that transforms raw data into actionable insights, supporting informed decision-making across the organization.

Let′s take a Lab

In this lab, you will take a hands-on approach to data science by ingesting data, exploring it in a notebook, processing it with the Data Wrangler, and training two types of machine learning models. By completing all these steps, you'll gain valuable experience with the data science features available in Microsoft Fabric.

Throughout the lab, you will have the opportunity to learn about various components such as notebooks, the Data Wrangler, experiments, and models. This practical experience will enhance your understanding of machine learning and model tracking within the Microsoft Fabric environment.

The lab is designed to be completed in approximately 20 minutes, providing you with a quick yet comprehensive introduction to the capabilities of Microsoft Fabric for data science.

Create a workspace

Before working with data in Fabric, create a workspace with the Fabric trial enabled.

  • Navigate to the Microsoft Fabric home page at "https://app.fabric.microsoft.com/home?experience=fabric" in a browser.
  • Select Synapse Data Science.
  • In the menu bar on the left, select Workspaces (the icon looks similar to ??).
  • Create a new workspace with a name of your choice, selecting a licensing mode that includes Fabric capacity (Trial, Premium, or Fabric).
  • When your new workspace opens, it should be empty.

Create a notebook

To run code, you can create a notebook. Notebooks provide an interactive environment in which you can write and run code (in multiple languages).

  • In the Synapse Data Science home page, create a new Notebook.
  • Select the first cell (which is currently a code cell), and then in the dynamic tool bar at its top-right, use the M↓ button to convert the cell to a markdown cell.
  • Use the ?? (Edit) button to switch the cell to editing mode, then delete the content and enter the following text:

# Data science in Microsoft Fabric        

Get the data

Now you’re ready to run code to get data and train a model. You’ll work with the diabetes dataset from the Azure Open Datasets. After loading the data, you’ll convert the data to a Pandas dataframe: a common structure for working with data in rows and columns.

  • In your notebook, use the + Code icon below the latest cell output to add a new code cell to the notebook.
  • Enter the following code in the new code cell
  • Use the ? Run cell button on the left of the cell to run it. Alternatively, you can press SHIFT + ENTER on your keyboard to run a cell.

# Azure storage access info for open dataset diabetes
blob_account_name = "azureopendatastorage"
blob_container_name = "mlsamples"
blob_relative_path = "diabetes"
blob_sas_token = r"" # Blank since container is Anonymous access
    
# Set Spark config to access  blob storage
wasbs_path = f"wasbs://%s@%s.blob.core.windows.net/%s" % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set("fs.azure.sas.%s.%s.blob.core.windows.net" % (blob_container_name, blob_account_name), blob_sas_token)
print("Remote blob path: " + wasbs_path)
    
# Spark read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)

display(df)        

The output shows the rows and columns of the diabetes dataset.

  • There are two tabs at the top of the rendered table: Table and Chart. Select Chart.
  • Select the Customize chart at the right top of the chart to change the visualization.
  • Change the chart to the following settings:Chart Type: Box plotKey: Leave emptyValues: Y
  • Select Apply to render the new visualization and explore the output.

Prepare the data

Now that you have ingested and explored the data, you can transform the data. You can either run code in a notebook, or use the Data Wrangler to generate code for you.

  • The data is loaded as a Spark dataframe. While the Data Wrangler accepts either Spark or Pandas dataframes, it is currently optimized to work with Pandas. Therefore, you will convert the data to a Pandas dataframe. Run the following code in your notebook:

df = df.toPandas()
df.head()        

  • Select Data Wrangler in the notebook ribbon, and then select the df dataset. When Data Wrangler launches, it generates a descriptive overview of the dataframe in the Summary panel.
  • Select the Y column in the Data Wrangler. Note that there is a decrease in frequency for the 220-240 bin. The 75th percentile 211.5 roughly aligns with transition of the two regions in the histogram. Let’s use this value as the threshold for low and high risk.
  • Navigate to the Operations panel, expand Formulas, and then select Create column from formula.
  • Create a new column with the following settings:Column name: RiskColumn formula: (df['Y'] > 211.5).astype(int)
  • Review the new column Risk that is added to the preview. Verify that the count of rows with value 1 should be roughly 25% of all rows (as it’s the 75th percentile of Y).
  • Select Apply.
  • Select Add code to notebook.
  • Run the cell with the code that is generated by Data Wrangler.
  • Run the following code in a new cell to verify that the Risk column is shaped as expected:

df_clean.describe()        

Train machine learning models

Now that you’ve prepared the data, you can use it to train a machine learning model to predict diabetes. We can train two different types of models with our dataset: a regression model (predicting Y) or a classification model (predicting Risk). You’ll train the models using the scikit-learn library and track the models with MLflow.

Train a regression model

  • Run the following code to split the data into a training and test dataset, and to separate the features from the label Y you want to predict:

from sklearn.model_selection import train_test_split
    
X, y = df_clean[['AGE','SEX','BMI','BP','S1','S2','S3','S4','S5','S6']].values, df_clean['Y'].values
    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)        

  • Add another new code cell to the notebook, enter the following code in it, and run it:

import mlflow
experiment_name = "diabetes-regression"
mlflow.set_experiment(experiment_name)        

The code creates an MLflow experiment named diabetes-regression. Your models will be tracked in this experiment.

  • Add another new code cell to the notebook, enter the following code in it, and run it:

from sklearn.linear_model import LinearRegression
    
with mlflow.start_run():
   mlflow.autolog()
    
   model = LinearRegression()
   model.fit(X_train, y_train)        

The code trains a regression model using Linear Regression. Parameters, metrics, and artifacts, are automatically logged with MLflow.

Train a classification model

  • Run the following code to split the data into a training and test dataset, and to separate the features from the label Risk you want to predict:

from sklearn.model_selection import train_test_split
    
X, y = df_clean[['AGE','SEX','BMI','BP','S1','S2','S3','S4','S5','S6']].values, df_clean['Risk'].values
    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)        

  • Add another new code cell to the notebook, enter the following code in it, and run it:

import mlflow
experiment_name = "diabetes-classification"
mlflow.set_experiment(experiment_name)        

The code creates an MLflow experiment named diabetes-classification. Your models will be tracked in this experiment.

  • Add another new code cell to the notebook, enter the following code in it, and run it:

from sklearn.linear_model import LogisticRegression
    
with mlflow.start_run():
    mlflow.sklearn.autolog()

    model = LogisticRegression(C=1/0.1, solver="liblinear").fit(X_train, y_train)        

The code trains a classification model using Logistic Regression. Parameters, metrics, and artifacts, are automatically logged with MLflow.

Explore your experiments

Microsoft Fabric will keep track of all your experiments and allows you to visually explore them.

  • Navigate to your workspace from the hub menu bar on the left.
  • Select the diabetes-regression experiment to open it.
  • Review the Run metrics to explore accurate your regression model is.
  • Navigate back to the home page and select the diabetes-classification experiment to open it.
  • Review the Run metrics to explore the accuracy of the classification model. Note that the type of metrics are different as you trained a different type of model.

Save the model

After comparing machine learning models that you’ve trained across experiments, you can choose the best performing model. To use the best performing model, save the model and use it to generate predictions.

  • Select Save as ML model in the experiment ribbon.
  • Select Create a new ML model in the newly opened pop-up window.
  • Select the model folder.
  • Name the model model-diabetes, and select Save.
  • Select View ML model in the notification that appears at the top right of your screen when the model is created. You can also refresh the window. The saved model is linked under ML model versions.

Note that the model, the experiment, and the experiment run are linked, allowing you to review how the model is trained.

Save the notebook and end the Spark session

Now that you’ve finished training and evaluating the models, you can save the notebook with a meaningful name and end the Spark session.

  • In the notebook menu bar, use the ?? Settings icon to view the notebook settings.
  • Set the Name of the notebook to Train and compare models, and then close the settings pane.
  • On the notebook menu, select Stop session to end the Spark session.

Clean up resources

In this exercise, you have created a notebook and trained a machine learning model. You used scikit-learn to train the model and MLflow to track its performance.

If you’ve finished exploring your model and experiments, you can delete the workspace that you created for this exercise.

  • In the bar on the left, select the icon for your workspace to view all of the items it contains.
  • In the menu on the toolbar, select Workspace settings.
  • In the General section, select Remove this workspace.

Conclusion

Microsoft Fabric provides a comprehensive and centralized workspace to facilitate data science projects from beginning to end. The journey begins with defining the problem, followed by identifying and ingesting the necessary data into the platform. Once your data is ingested, you can explore and prepare it using powerful tools like notebooks or the intuitive Data Wrangler.

As you progress through your data science project, you can effectively track your work with experiments to ensure that your model training is organized and reproducible. Finally, the built-in PREDICT function enables you to leverage your trained models to generate valuable insights from new data.

By utilizing Microsoft Fabric, data scientists can streamline their workflows, enhance collaboration, and ultimately derive actionable insights that drive informed decision-making within their organizations. Whether you're a seasoned data scientist or just starting, Microsoft Fabric offers the tools you need to succeed in your data-driven endeavors.


Srikanth Nanduri

Data Scientist | Machine Learning and Generative AI

3 周

Love this! Microsoft Fabric really nails the all-in-one experience for data science workflows. It reminds me a lot of what Teradata’s?AI Unlimited?brings to the table—breaking down silos and scaling AI across any cloud effortlessly. Fabric’s seamless data-to-AI flow pairs perfectly with Teradata’s multi-cloud fabric, keeping analytics where the data lives. The combo makes it way easier to unlock insights fast, build smarter models, and make decisions in real-time without all the heavy lifting.

要查看或添加评论,请登录

社区洞察