Python package installation in Microsoft Fabric Notebooks

When working in distributed environments like Spark cluster in Microsoft Fabric, ensuring that required Python packages are installed across all nodes (driver and executors) is important to make sure your code runs perfectly without running into any errors. People normally use pip install to install any kind of python packages. But the question is to use !pip or %pip?

Understanding !pip vs. %pip

  • !pip: Executes as a shell command and installs packages only on the driver node. This means that while the driver has access to the installed package, the executor nodes do not, leading to potential errors during distributed computations.
  • %pip: Functions as an IPython magic command, installing the specified package on both the driver and all executor nodes. This ensures that all parts of the Spark cluster have access to the necessary libraries.

So, %pip is the preferred choice for installing packages in Microsoft fabric spark notebooks.

Practical Example

Consider a scenario where you need to install the 'ibis-framework' library and apply it across your Spark cluster.

We have a code that basically shows whether your installed python package is only available on driver node or executor nodes as well.

import os
import pkg_resources
import pandas as pd

def get_package_status(node_type, package_name="ibis-framework"):
    """
    Check the installation status of a package on a node.
    """
    try:
        package = pkg_resources.get_distribution(package_name)
        return {
            "NodeType": node_type,
            "NodeID": os.environ.get("NM_HOST", f"{node_type}_ID_Unknown"),
            "PackageStatus": f"{package.project_name}=={package.version}"
        }
    except pkg_resources.DistributionNotFound:
        return {
            "NodeType": node_type,
            "NodeID": os.environ.get("NM_HOST", f"{node_type}_ID_Unknown"),
            "PackageStatus": f"{package_name} is not installed"
        }

def verify_ibis_framework():
    """
    Verify the installation of the `ibis-framework` package on the driver and executors.
    """
    # Check on Driver
    driver_result = get_package_status("Driver")

    # Check on Executors
    def check_on_executor(_):
        return get_package_status("Executor")

    # Trying to distribute the task to mode nodes
    executor_results = spark.sparkContext.parallelize(range(10000), 10000).map(check_on_executor).collect()

    # Combine results into a DataFrame
    results = [driver_result] + executor_results
    
    return pd.DataFrame(results).drop_duplicates()

# Run verification and display results
result_df = verify_ibis_framework()

display(result_df)        

As anticipated, using !pip resulted in the package being installed only on the driver node, leaving the executor nodes without the installation, which ultimately causes the jobs to fail.


Using %pip ensured the library was installed on all nodes in the cluster, allows the jobs to complete successfully without any errors.

Another key reason to avoid using !pip is that it doesn’t work with notebooks executed in Fabric pipelines. Instead, you can use %pip along with the _inlineInstallationEnabled parameter in the notebook activity settings.

%pip and High-Concurrency Sessions

In high-concurrency (HC) sessions, %pip is not supported, and you must use !pip for installing libraries (though the exact reason for this limitation is still under investigation). For a more reliable solution in HC mode, it’s recommended to create and use pre-configured environments instead.

Reference: Microsoft docs - Manage Apache Spark libraries - Microsoft Fabric | Microsoft Learn



要查看或添加评论,请登录

Harshadeep Guggilla的更多文ç«