Get Started with Data Science in Microsoft Fabric
Rafael Luz
Azure Cloud Solution Architect - Data, AI & Machine Learning | Data Architect | Data Engineer | Data Scientist | Trusted Advisor | Leading Expert in Innovative AI Solutions
Data science is at the core of making informed decisions within any organization. Whether you're managing stock levels in a supermarket to prevent waste, or analyzing customer data to craft personalized offers, data science provides the tools to unlock valuable insights. By coHowever, taking a data science project from concept to deployment can be a complex task, requiring the right tools for each phase of the process. This is where Microsoft Fabric comes into play. As a unified platform, Microsoft Fabric simplifies the data science workflow by offering an all-in-one solution for managing and executing data projects. From data ingestion to visualization, Fabric provides a seamless experience, enabling data scientists to focus on deriving actionable insights and building AI models.
In this article, we'll explore how Microsoft Fabric can support you throughout the data science lifecycle, particularly in developing machine learning models. You'll also get hands-on experience with tools and features that Microsoft Fabric offers for each step of your journey.
Understanding the Data Science Process
A common way to extract insights from data is by visualizing it. When dealing with complex datasets, you may need to dive deeper to identify intricate patterns. This deeper analysis is essential for generating valuable insights, and as a data scientist, you can use these patterns to train machine learning models.
Machine learning models are designed to recognize patterns and make predictions based on the data provided. For example, with these models, you can predict how many products you're likely to sell in the coming week. However, training the model is only one part of the data science process.
Before diving into a typical data science process, it's crucial to understand the common machine learning models you can train. These models play a key role in generating new insights and making more accurate predictions.
Exploring Common Machine Learning Models
The purpose of machine learning is to train models that can identify patterns in large amounts of data. These patterns can then be used to make predictions, providing new insights that enable more informed decision-making.
The possibilities with machine learning may seem endless, so let’s begin by understanding the four most common types of machine learning models:
To decide which type of machine learning model you need to train, it’s essential to first understand the business problem and the available data. This preliminary analysis helps determine the correct approach and optimize the desired outcomes.
Understanding the Data Science Process
To train a machine learning model, the process commonly involves the following steps:
As a data scientist, most of your time will be spent preparing the data and training the model.
You can prepare and train a model using open-source libraries in the programming language of your choice. For example, if you're working with Python, you can use Pandas and Numpy to prepare the data and train models with libraries such as Scikit-Learn , PyTorch , or SynapseML .
When experimenting, it’s important to maintain an overview of all the models you've trained and understand how your choices influence the model's success. By tracking your experiments with MLflow in Microsoft Fabric, you can easily manage and deploy the models you've developed.
Explore and Process Data with Microsoft Fabric
Data is the cornerstone of data science, especially when the goal is to train a machine learning model for achieving artificial intelligence. Typically, models exhibit enhanced performance as the size of the training dataset increases. However, in addition to the quantity of data, the quality of the data is also crucial.
To ensure both the quality and quantity of your data, it is worthwhile to utilize Microsoft Fabric's robust data ingestion and processing engines. The platform offers flexibility, allowing you to choose between a low-code or code-first approach when setting up essential data ingestion, exploration, and transformation pipelines.
These tools facilitate the preparation of your dataset, enabling data scientists to focus on deep analysis and the effective training of their machine learning models. By using Microsoft Fabric, you can ensure that your data workflow is optimized in terms of both scale and quality, supporting more accurate and effective predictive modeling.
Ingest Your Data into Microsoft Fabric
To work with data in Microsoft Fabric, the first step is to ingest the data. You can ingest data from multiple sources, including both local and cloud data sources. For instance, you can ingest data from a CSV file stored on your local machine or from Azure Data Lake Storage (Gen2).
Once you connect to a data source, you can save the data into a Microsoft Fabric lakehouse. This lakehouse serves as a central location to store any structured, semi-structured, and unstructured files. It provides a unified storage solution that simplifies data management.
With your data stored in the lakehouse, you can easily connect to it whenever you need to access it for exploration or transformation. This centralized approach not only streamlines your workflow but also ensures that all your data is readily available for analysis and model training.
Explore and Transform Your Data
As a data scientist, you may be most comfortable writing and executing code in notebooks. Microsoft Fabric provides a familiar notebook experience powered by Spark compute.
Apache Spark is an open-source parallel processing framework designed for large-scale data processing and analytics. It allows you to handle vast amounts of data efficiently.
Notebooks in Microsoft Fabric are automatically attached to Spark compute. When you run a cell in a notebook for the first time, a new Spark session is initiated. This session persists as you run subsequent cells, enabling a continuous workflow. To optimize costs, the Spark session will automatically stop after a period of inactivity, but you also have the option to stop the session manually.
While working in a notebook, you can choose the programming language that best suits your needs. For data science workloads, you will likely be using PySpark (Python) or SparkR (R). This flexibility allows you to leverage the tools and libraries you are most comfortable with while taking advantage of Spark's powerful processing capabilities.
Prepare Your Data with the Data Wrangler
To expedite the process of exploring and transforming your data, Microsoft Fabric provides the easy-to-use Data Wrangler.
Upon launching the Data Wrangler, you will receive a descriptive overview of the data you are working with. This includes summary statistics that help identify any issues, such as missing values.
To clean your data, you can utilize any of the built-in data-cleaning operations available in the Data Wrangler. When you select an operation, a preview of the result and the corresponding code is automatically generated for you. This feature allows you to see the impact of your changes in real time.
Once you have selected all the necessary operations for data cleaning and transformation, you can easily export these transformations as code and execute them on your data. This seamless integration not only streamlines your workflow but also enhances productivity by minimizing the amount of manual coding required.
Train and Score Models with Microsoft Fabric
Once you have ingested, explored, and preprocessed your data, you can proceed to use that data to train a model. Training a model is an iterative process, and it's essential to track your work effectively.
Microsoft Fabric integrates seamlessly with MLflow, allowing you to easily track and log your work. This integration enables you to review your progress at any time, helping you determine the best approach for training your final model. By tracking your work, you ensure that your results are easily reproducible, which is crucial for validating your findings.
Any work you want to track can be organized as experiments in MLflow. This structured approach not only keeps your workflow organized but also allows you to compare different models and their performances efficiently. With Microsoft Fabric and MLflow, you can streamline the model training process, making it easier to refine your approach and achieve optimal results.
Understand Experiments
Whenever you train a model in a notebook that you want to track, you create an experiment in Microsoft Fabric.
An experiment can consist of multiple runs, with each run representing a specific task executed in a notebook, such as training a machine learning model. This structure allows you to keep your work organized and facilitates easy comparisons between different approaches.
For instance, if you are training a machine learning model for sales forecasting, you might experiment with various training datasets using the same algorithm. Each time you train a model with a different dataset, you create a new experiment run. This way, you can systematically compare the performance of each run to identify the best-performing model.
By leveraging the concept of experiments in Microsoft Fabric, you can enhance your model development process, ensuring that you make data-driven decisions based on thorough analysis and experimentation.
Start Tracking Metrics
To effectively compare experiment runs, you can track parameters, metrics, and artifacts for each run in Microsoft Fabric.
All parameters, metrics, and artifacts you track in an experiment run are displayed in the experiments overview. You can view each experiment run individually in the Run details tab or compare across multiple runs using the Run list. This organized approach makes it easier to analyze your results and draw meaningful conclusions.
By tracking your work with MLflow, you gain valuable insights into your model training iterations. This capability allows you to assess which configuration yielded the best model for your specific use case. Whether it's optimizing hyperparameters or evaluating different algorithms, having a clear overview of your tracked metrics ensures that you make informed decisions based on data.
Understand Models
After you train a model, the next step is to use it for scoring. Scoring involves applying the model to new data to generate predictions or insights. When you train and track a model using MLflow, the artifacts associated with the model, along with its metadata, are stored within the experiment run.
In Microsoft Fabric, you can save these artifacts as a model. By registering your model artifacts in Microsoft Fabric, you gain the ability to easily manage your models over time.
Whenever you train a new model and save it under the same name, you effectively add a new version to the model. This versioning system allows you to keep track of improvements, experiment with different configurations, and revert to previous versions if necessary. With this structured approach to model management, you can ensure that your best-performing models are easily accessible and organized.
Use a Model to Generate Insights
To leverage a model for generating predictions, you can utilize the PREDICT function in Microsoft Fabric. This function is designed to seamlessly integrate with MLflow models, enabling you to generate batch predictions with ease.
For example, imagine you receive weekly sales data from several stores. Based on historical sales data, you’ve trained a model that predicts the sales for the upcoming week, using the sales figures from previous weeks as input. After tracking this model with MLflow and saving it in Microsoft Fabric, you can automate the prediction process.
Whenever new weekly sales data arrives, you simply use the PREDICT function to allow the model to generate the forecast for the next week. The forecasted sales data is then stored as a table in a lakehouse. This table can be visualized in a Power BI report, making the insights easily accessible for business users to consume.
By using the PREDICT function in conjunction with MLflow and Power BI, you can create a streamlined workflow that transforms raw data into actionable insights, supporting informed decision-making across the organization.
Let′s take a Lab
In this lab, you will take a hands-on approach to data science by ingesting data, exploring it in a notebook, processing it with the Data Wrangler, and training two types of machine learning models. By completing all these steps, you'll gain valuable experience with the data science features available in Microsoft Fabric.
Throughout the lab, you will have the opportunity to learn about various components such as notebooks, the Data Wrangler, experiments, and models. This practical experience will enhance your understanding of machine learning and model tracking within the Microsoft Fabric environment.
The lab is designed to be completed in approximately 20 minutes, providing you with a quick yet comprehensive introduction to the capabilities of Microsoft Fabric for data science.
Create a workspace
Before working with data in Fabric, create a workspace with the Fabric trial enabled.
Create a notebook
To run code, you can create a notebook. Notebooks provide an interactive environment in which you can write and run code (in multiple languages).
# Data science in Microsoft Fabric
Get the data
Now you’re ready to run code to get data and train a model. You’ll work with the diabetes dataset from the Azure Open Datasets. After loading the data, you’ll convert the data to a Pandas dataframe: a common structure for working with data in rows and columns.
# Azure storage access info for open dataset diabetes
blob_account_name = "azureopendatastorage"
blob_container_name = "mlsamples"
blob_relative_path = "diabetes"
blob_sas_token = r"" # Blank since container is Anonymous access
# Set Spark config to access blob storage
wasbs_path = f"wasbs://%s@%s.blob.core.windows.net/%s" % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set("fs.azure.sas.%s.%s.blob.core.windows.net" % (blob_container_name, blob_account_name), blob_sas_token)
print("Remote blob path: " + wasbs_path)
# Spark read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
display(df)
The output shows the rows and columns of the diabetes dataset.
Prepare the data
Now that you have ingested and explored the data, you can transform the data. You can either run code in a notebook, or use the Data Wrangler to generate code for you.
df = df.toPandas()
df.head()
df_clean.describe()
Train machine learning models
Now that you’ve prepared the data, you can use it to train a machine learning model to predict diabetes. We can train two different types of models with our dataset: a regression model (predicting Y) or a classification model (predicting Risk). You’ll train the models using the scikit-learn library and track the models with MLflow.
Train a regression model
from sklearn.model_selection import train_test_split
X, y = df_clean[['AGE','SEX','BMI','BP','S1','S2','S3','S4','S5','S6']].values, df_clean['Y'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
import mlflow
experiment_name = "diabetes-regression"
mlflow.set_experiment(experiment_name)
The code creates an MLflow experiment named diabetes-regression. Your models will be tracked in this experiment.
from sklearn.linear_model import LinearRegression
with mlflow.start_run():
mlflow.autolog()
model = LinearRegression()
model.fit(X_train, y_train)
The code trains a regression model using Linear Regression. Parameters, metrics, and artifacts, are automatically logged with MLflow.
Train a classification model
from sklearn.model_selection import train_test_split
X, y = df_clean[['AGE','SEX','BMI','BP','S1','S2','S3','S4','S5','S6']].values, df_clean['Risk'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
import mlflow
experiment_name = "diabetes-classification"
mlflow.set_experiment(experiment_name)
The code creates an MLflow experiment named diabetes-classification. Your models will be tracked in this experiment.
from sklearn.linear_model import LogisticRegression
with mlflow.start_run():
mlflow.sklearn.autolog()
model = LogisticRegression(C=1/0.1, solver="liblinear").fit(X_train, y_train)
The code trains a classification model using Logistic Regression. Parameters, metrics, and artifacts, are automatically logged with MLflow.
Explore your experiments
Microsoft Fabric will keep track of all your experiments and allows you to visually explore them.
Save the model
After comparing machine learning models that you’ve trained across experiments, you can choose the best performing model. To use the best performing model, save the model and use it to generate predictions.
Note that the model, the experiment, and the experiment run are linked, allowing you to review how the model is trained.
Save the notebook and end the Spark session
Now that you’ve finished training and evaluating the models, you can save the notebook with a meaningful name and end the Spark session.
Clean up resources
In this exercise, you have created a notebook and trained a machine learning model. You used scikit-learn to train the model and MLflow to track its performance.
If you’ve finished exploring your model and experiments, you can delete the workspace that you created for this exercise.
Conclusion
Microsoft Fabric provides a comprehensive and centralized workspace to facilitate data science projects from beginning to end. The journey begins with defining the problem, followed by identifying and ingesting the necessary data into the platform. Once your data is ingested, you can explore and prepare it using powerful tools like notebooks or the intuitive Data Wrangler.
As you progress through your data science project, you can effectively track your work with experiments to ensure that your model training is organized and reproducible. Finally, the built-in PREDICT function enables you to leverage your trained models to generate valuable insights from new data.
By utilizing Microsoft Fabric, data scientists can streamline their workflows, enhance collaboration, and ultimately derive actionable insights that drive informed decision-making within their organizations. Whether you're a seasoned data scientist or just starting, Microsoft Fabric offers the tools you need to succeed in your data-driven endeavors.
Data Scientist | Machine Learning and Generative AI
3 周Love this! Microsoft Fabric really nails the all-in-one experience for data science workflows. It reminds me a lot of what Teradata’s?AI Unlimited?brings to the table—breaking down silos and scaling AI across any cloud effortlessly. Fabric’s seamless data-to-AI flow pairs perfectly with Teradata’s multi-cloud fabric, keeping analytics where the data lives. The combo makes it way easier to unlock insights fast, build smarter models, and make decisions in real-time without all the heavy lifting.