Listing Blobs in Azure Databricks: A Performance Comparison
Umar Asif Qureshi
Senior Consultant Data & AI | Building Scalable Azure Cloud Solutions | Data Engineer
I recently had to fetch a list of blobs for processing in Azure, and I wondered which method would be the most optimal—especially since there are several ways to do it. To get some answers, I ran an experiment with more than 500 blobs in a storage account across 10 iterations, first fetching container names and then listing the blobs within each container.
Below are the core methods I tested, each using a different approach to interact with Azure Blob Storage:
# dbutils.fs.ls (Databricks utility)
def list_blobs_dbutils(container_name):
mount_point = f"/mnt/{container_name}"
blobs = dbutils.fs.ls(mount_point)
return [blob.path for blob in blobs]
# list_blob_names() (Azure Blob SDK)
def list_blob_names_sdk(container_name):
container_client = blob_service_client.get_container_client(container_name)
return [name for name in container_client.list_blob_names()]
# list_blobs() (Azure Blob SDK)
def list_blobs_sdk(container_name):
container_client = blob_service_client.get_container_client(container_name)
return [blob.name for blob in container_client.list_blobs()]
# walk_blobs() (Azure Blob SDK)
def walk_blobs_sdk(container_name):
container_client = blob_service_client.get_container_client(container_name)
return [blob.name for blob in container_client.walk_blobs()]
Each iteration measured execution time and blob count returned.
# Number of iterations for the test
iterations = 10
# Lists to store execution times
dbutils_times = []
list_blob_names_times = []
list_blobs_times = []
walk_blobs_times = []
# Select a container for performance testing
# You can choose any container with a reasonable number of blobs
test_container = container_names[5] # Replace with desired container name if needed
for i in range(iterations):
print(f"\nIteration {i+1}:")
# Test dbutils.fs.ls
dbutils_time, dbutils_blob_count = test_dbutils_fs_ls(test_container)
if dbutils_time is not None:
dbutils_times.append(dbutils_time)
print(f"dbutils.fs.ls time: {dbutils_time:.6f} seconds, Blob count: {dbutils_blob_count}")
else:
print("dbutils.fs.ls failed.")
dbutils_times.append(float('nan')) # Use NaN to indicate failure
# Test list_blob_names()
blob_names_time, blob_names_count = test_list_blob_names(test_container)
list_blob_names_times.append(blob_names_time)
print(f"list_blob_names() time: {blob_names_time:.6f} seconds, Blob count: {blob_names_count}")
# Test list_blobs()
blobs_time, blobs_count = test_list_blobs(test_container)
list_blobs_times.append(blobs_time)
print(f"list_blobs() time: {blobs_time:.6f} seconds, Blob count: {blobs_count}")
# Test walk_blobs()
walk_time, walk_count = test_walk_blobs(test_container)
walk_blobs_times.append(walk_time)
print(f"walk_blobs() time: {walk_time:.6f} seconds, Blob count: {walk_count}")
The results are shown and explained below:
dbutils.fs.ls
Easiest for quick, Databricks-native listings but had the highest overhead (especially on the first call, perhaps due to mounting).
list_blob_names()
Often the fastest. This method returns only names (no metadata), so it’s lightweight if you just need the blob identifiers.
list_blobs()
Provides more details than list_blob_names() (e.g., metadata if requested), so it’s slightly slower in some cases.
walk_blobs()
Useful for recursively traversing any pseudo-directory structure. Performance is mid-range but can be extremely convenient for deeper hierarchies.
Overall, if you want the fastest approach to simply enumerate blob names in a container, list_blob_names() is typically your best bet. But if you’re working within Databricks and just need a quick listing, dbutils.fs.ls is a one-liner—albeit with a bit more overhead. And if you need recursive listing or more details, walk_blobs() or list_blobs() can be your go-to.
Practical takeaway? If you’re running performance-sensitive operations—like large-scale batch jobs that need frequent blob listings—list_blob_names() is a great choice. However, if you’re just doing quick explorations within Databricks notebooks and want an easy, built-in solution, dbutils.fs.ls remains a convenient option. Ultimately, choosing the right listing method depends on how often you need to list blobs, how you intend to use the results, and your performance constraints.