登录查看更多内容

How to query Azure Blob using DuckDB

Vivek Anandaraman

Help Project Managers Estimate and Track AWS Cost during Build using Jira | Mentor | Speaker

发布日期: 2024年1月20日

DuckDB is quickly establishing itself as the default query engine for csv file and parquet files, not only in File store but also in Object stores. Thanks to it super fast vectorized query engines.

Steps below to setup an Azure Service Principal to query Azure Blob using DuckDB

Register an app in Azure

Sign in to the Azure portal. Browse to Applications > App registrations then select New registration.
Name the application, for example "example-app".
Copy the Directory (tenant) ID value and Client ID for later use.

https://learn.microsoft.com/en-us/entra/identity-platform/howto-create-service-principal-portal

Assign a role to the application

Select Subscriptions
Select IAM
Add Role Assignment
In the Role tab, select the Storage Blob Data Reader role

Setup Authentication

Browse to Identity > Applications > App registrations, then select your application.
Select Certificates & secrets.
Select Client secrets, and then Select New client secret.
Copy the Client secret for later use.

Query csv file

Now that we have setup the credentials we are ready to query the csv file in Python.

import duckdb
from adlfs.spec import AzureBlobFileSystem

active_directory_application_id = "Your Application ID"
active_directory_application_secret = "Your Client Secret"
active_directory_tenant_id = "Your Tenant ID"
accountname = "Your Storage account name"

connection = duckdb.connect()

connection.register_filesystem(AzureBlobFileSystem(account_name=accountname, tenant_id=active_directory_tenant_id, client_id = active_directory_application_id, client_secret = active_directory_application_secret ))

query = connection.sql('''
  SELECT count(*) FROM read_csv_auto('abfs://container/path/blob.csv')
''')
print(query.fetchall())

That all it takes to query a csv file, you can also query parquet file using read_parquet.

要查看或添加评论，请登录

Vivek Anandaraman的更多文章

What is a cost-aware product cycle and why do we need it?

2024年10月9日

What is a cost-aware product cycle and why do we need it?

Bigger context As cloud adoption progresses, addressing costs from both underutilised resources and wasteful resource…

1 条评论
Query finops Azure FOCUS Datasets using duckdb

2024年8月12日

Query finops Azure FOCUS Datasets using duckdb

Now that the #FOCUS datasets are available in #Azure we can gather meaningful Insights using simple queries. Here I…
Cost optimization in Kafka data pipeline

2024年4月10日

Cost optimization in Kafka data pipeline

Customer has event driven Architecture using kafka data pipelines. The producers and consumers are springboot Java…
Stackql - Cloud Governance using SQL

2024年3月23日

Stackql - Cloud Governance using SQL

StackQL Studios will definitely make life simple for Cloud Governance teams. The idea is to replace all api calls to…
Postgres as Analytics DB for Prometheus data

2024年1月11日

Postgres as Analytics DB for Prometheus data

We are seeing increasing scenarios where Prometheus data is required for Analytical purposes beyond the traditional…

See all articles

How to query Azure Blob using DuckDB

Vivek Anandaraman

Help Project Managers Estimate and Track AWS Cost during Build using Jira | Mentor | Speaker

Register an app in Azure

Assign a role to the application

Setup Authentication

Query csv file

Vivek Anandaraman的更多文章

社区洞察

其他会员也浏览了

Navigating Databricks developer tools

Anatomy of Apache Spark's RDD

Apache Spark 101: Window Functions

Build a serverless API using an Azure Function that reads and writes data to an Azure Cosmos DB with little to no code

Xavki's Links : 20220110 week

Apache Spark Optimizations - Compression

Tip of the Apache Iceberg

RisingWave Newsletter September 2024

*msaFilesystem: Practical way of file system management

My recent experiences using Azure ML

Register an app in Azure

Assign a role to the application

Setup Authentication

Query csv file

Vivek Anandaraman的更多文章

What is a cost-aware product cycle and why do we need it?

Query finops Azure FOCUS Datasets using duckdb

Cost optimization in Kafka data pipeline

Stackql - Cloud Governance using SQL

Postgres as Analytics DB for Prometheus data

社区洞察

其他会员也浏览了

Navigating Databricks developer tools

Anatomy of Apache Spark's RDD

Apache Spark 101: Window Functions

Build a serverless API using an Azure Function that reads and writes data to an Azure Cosmos DB with little to no code

Xavki's Links : 20220110 week

Apache Spark Optimizations - Compression

Tip of the Apache Iceberg

RisingWave Newsletter September 2024

*msaFilesystem: Practical way of file system management

My recent experiences using Azure ML