Listing Blobs in Azure Databricks: A Performance Comparison

I recently had to fetch a list of blobs for processing in Azure, and I wondered which method would be the most optimal—especially since there are several ways to do it. To get some answers, I ran an experiment with more than 500 blobs in a storage account across 10 iterations, first fetching container names and then listing the blobs within each container.

Below are the core methods I tested, each using a different approach to interact with Azure Blob Storage:

  • dbutils.fs.ls
  • list_blob_names()
  • list_blobs()
  • walk_blobs()

# dbutils.fs.ls (Databricks utility)
def list_blobs_dbutils(container_name):
    mount_point = f"/mnt/{container_name}"
    blobs = dbutils.fs.ls(mount_point)
    return [blob.path for blob in blobs]

# list_blob_names() (Azure Blob SDK)
def list_blob_names_sdk(container_name):
    container_client = blob_service_client.get_container_client(container_name)
    return [name for name in container_client.list_blob_names()]

# list_blobs() (Azure Blob SDK)
def list_blobs_sdk(container_name):
    container_client = blob_service_client.get_container_client(container_name)
    return [blob.name for blob in container_client.list_blobs()]

# walk_blobs() (Azure Blob SDK)
def walk_blobs_sdk(container_name):
    container_client = blob_service_client.get_container_client(container_name)
    return [blob.name for blob in container_client.walk_blobs()]        

Each iteration measured execution time and blob count returned.

# Number of iterations for the test
iterations = 10

# Lists to store execution times
dbutils_times = []
list_blob_names_times = []
list_blobs_times = []
walk_blobs_times = []

# Select a container for performance testing
# You can choose any container with a reasonable number of blobs
test_container = container_names[5]  # Replace with desired container name if needed

for i in range(iterations):
    print(f"\nIteration {i+1}:")
    
    # Test dbutils.fs.ls
    dbutils_time, dbutils_blob_count = test_dbutils_fs_ls(test_container)
    if dbutils_time is not None:
        dbutils_times.append(dbutils_time)
        print(f"dbutils.fs.ls time: {dbutils_time:.6f} seconds, Blob count: {dbutils_blob_count}")
    else:
        print("dbutils.fs.ls failed.")
        dbutils_times.append(float('nan'))  # Use NaN to indicate failure
    
    # Test list_blob_names()
    blob_names_time, blob_names_count = test_list_blob_names(test_container)
    list_blob_names_times.append(blob_names_time)
    print(f"list_blob_names() time: {blob_names_time:.6f} seconds, Blob count: {blob_names_count}")
    
    # Test list_blobs()
    blobs_time, blobs_count = test_list_blobs(test_container)
    list_blobs_times.append(blobs_time)
    print(f"list_blobs() time: {blobs_time:.6f} seconds, Blob count: {blobs_count}")
    
    # Test walk_blobs()
    walk_time, walk_count = test_walk_blobs(test_container)
    walk_blobs_times.append(walk_time)
    print(f"walk_blobs() time: {walk_time:.6f} seconds, Blob count: {walk_count}")        

The results are shown and explained below:


Execution Times
Average Execution Times


dbutils.fs.ls

Easiest for quick, Databricks-native listings but had the highest overhead (especially on the first call, perhaps due to mounting).

list_blob_names()

Often the fastest. This method returns only names (no metadata), so it’s lightweight if you just need the blob identifiers.

list_blobs()

Provides more details than list_blob_names() (e.g., metadata if requested), so it’s slightly slower in some cases.

walk_blobs()

Useful for recursively traversing any pseudo-directory structure. Performance is mid-range but can be extremely convenient for deeper hierarchies.

Overall, if you want the fastest approach to simply enumerate blob names in a container, list_blob_names() is typically your best bet. But if you’re working within Databricks and just need a quick listing, dbutils.fs.ls is a one-liner—albeit with a bit more overhead. And if you need recursive listing or more details, walk_blobs() or list_blobs() can be your go-to.

Practical takeaway? If you’re running performance-sensitive operations—like large-scale batch jobs that need frequent blob listings—list_blob_names() is a great choice. However, if you’re just doing quick explorations within Databricks notebooks and want an easy, built-in solution, dbutils.fs.ls remains a convenient option. Ultimately, choosing the right listing method depends on how often you need to list blobs, how you intend to use the results, and your performance constraints.

要查看或添加评论,请登录

Umar Asif Qureshi的更多文章

社区洞察