Python package installation in Microsoft Fabric Notebooks
Harshadeep Guggilla
Data Engineer | Microsoft Azure and Fabric | Cloudera On-Prem | ML, AI Enthusiast Trying to learn new things every day
When working in distributed environments like Spark cluster in Microsoft Fabric, ensuring that required Python packages are installed across all nodes (driver and executors) is important to make sure your code runs perfectly without running into any errors. People normally use pip install to install any kind of python packages. But the question is to use !pip or %pip?
Understanding !pip vs. %pip
- !pip: Executes as a shell command and installs packages only on the driver node. This means that while the driver has access to the installed package, the executor nodes do not, leading to potential errors during distributed computations.
- %pip: Functions as an IPython magic command, installing the specified package on both the driver and all executor nodes. This ensures that all parts of the Spark cluster have access to the necessary libraries.
So, %pip is the preferred choice for installing packages in Microsoft fabric spark notebooks.
Practical Example
Consider a scenario where you need to install the 'ibis-framework' library and apply it across your Spark cluster.
We have a code that basically shows whether your installed python package is only available on driver node or executor nodes as well.
import os
import pkg_resources
import pandas as pd
def get_package_status(node_type, package_name="ibis-framework"):
"""
Check the installation status of a package on a node.
"""
try:
package = pkg_resources.get_distribution(package_name)
return {
"NodeType": node_type,
"NodeID": os.environ.get("NM_HOST", f"{node_type}_ID_Unknown"),
"PackageStatus": f"{package.project_name}=={package.version}"
}
except pkg_resources.DistributionNotFound:
return {
"NodeType": node_type,
"NodeID": os.environ.get("NM_HOST", f"{node_type}_ID_Unknown"),
"PackageStatus": f"{package_name} is not installed"
}
def verify_ibis_framework():
"""
Verify the installation of the `ibis-framework` package on the driver and executors.
"""
# Check on Driver
driver_result = get_package_status("Driver")
# Check on Executors
def check_on_executor(_):
return get_package_status("Executor")
# Trying to distribute the task to mode nodes
executor_results = spark.sparkContext.parallelize(range(10000), 10000).map(check_on_executor).collect()
# Combine results into a DataFrame
results = [driver_result] + executor_results
return pd.DataFrame(results).drop_duplicates()
# Run verification and display results
result_df = verify_ibis_framework()
display(result_df)
As anticipated, using !pip resulted in the package being installed only on the driver node, leaving the executor nodes without the installation, which ultimately causes the jobs to fail.
Using %pip ensured the library was installed on all nodes in the cluster, allows the jobs to complete successfully without any errors.
Another key reason to avoid using !pip is that it doesn’t work with notebooks executed in Fabric pipelines. Instead, you can use %pip along with the _inlineInstallationEnabled parameter in the notebook activity settings.
%pip and High-Concurrency Sessions
In high-concurrency (HC) sessions, %pip is not supported, and you must use !pip for installing libraries (though the exact reason for this limitation is still under investigation). For a more reliable solution in HC mode, it’s recommended to create and use pre-configured environments instead.
Reference: Microsoft docs - Manage Apache Spark libraries - Microsoft Fabric | Microsoft Learn