How to query Azure Blob using DuckDB
Query Azure blob using DuckDB

How to query Azure Blob using DuckDB

DuckDB is quickly establishing itself as the default query engine for csv file and parquet files, not only in File store but also in Object stores. Thanks to it super fast vectorized query engines.

Steps below to setup an Azure Service Principal to query Azure Blob using DuckDB

Register an app in Azure

  1. Sign in to the Azure portal. Browse to Applications > App registrations then select New registration.
  2. Name the application, for example "example-app".
  3. Copy the Directory (tenant) ID value and Client ID for later use.

https://learn.microsoft.com/en-us/entra/identity-platform/howto-create-service-principal-portal

Assign a role to the application

  1. Select Subscriptions
  2. Select IAM
  3. Add Role Assignment
  4. In the Role tab, select the Storage Blob Data Reader role

https://learn.microsoft.com/en-us/entra/identity-platform/howto-create-service-principal-portal

Setup Authentication

  1. Browse to Identity > Applications > App registrations, then select your application.
  2. Select Certificates & secrets.
  3. Select Client secrets, and then Select New client secret.
  4. Copy the Client secret for later use.

https://learn.microsoft.com/en-us/entra/identity-platform/howto-create-service-principal-portal

Query csv file

Now that we have setup the credentials we are ready to query the csv file in Python.

import duckdb
from adlfs.spec import AzureBlobFileSystem

active_directory_application_id = "Your Application ID"
active_directory_application_secret = "Your Client Secret"
active_directory_tenant_id = "Your Tenant ID"
accountname = "Your Storage account name"

connection = duckdb.connect()

connection.register_filesystem(AzureBlobFileSystem(account_name=accountname, tenant_id=active_directory_tenant_id, client_id = active_directory_application_id, client_secret = active_directory_application_secret ))

query = connection.sql('''
  SELECT count(*) FROM read_csv_auto('abfs://container/path/blob.csv')
''')
print(query.fetchall())        

That all it takes to query a csv file, you can also query parquet file using read_parquet.

要查看或添加评论,请登录

Vivek Anandaraman的更多文章

社区洞察

其他会员也浏览了