How to query Azure Blob using DuckDB
Vivek Anandaraman
Help Project Managers Estimate and Track AWS Cost during Build using Jira | Mentor | Speaker
DuckDB is quickly establishing itself as the default query engine for csv file and parquet files, not only in File store but also in Object stores. Thanks to it super fast vectorized query engines.
Steps below to setup an Azure Service Principal to query Azure Blob using DuckDB
Register an app in Azure
Assign a role to the application
Setup Authentication
Query csv file
Now that we have setup the credentials we are ready to query the csv file in Python.
import duckdb
from adlfs.spec import AzureBlobFileSystem
active_directory_application_id = "Your Application ID"
active_directory_application_secret = "Your Client Secret"
active_directory_tenant_id = "Your Tenant ID"
accountname = "Your Storage account name"
connection = duckdb.connect()
connection.register_filesystem(AzureBlobFileSystem(account_name=accountname, tenant_id=active_directory_tenant_id, client_id = active_directory_application_id, client_secret = active_directory_application_secret ))
query = connection.sql('''
SELECT count(*) FROM read_csv_auto('abfs://container/path/blob.csv')
''')
print(query.fetchall())
That all it takes to query a csv file, you can also query parquet file using read_parquet.