登录查看更多内容

Querying Parquet, CSV Using DuckDB and Python on Amazon S3

Remesh Govind N M

VP Data Eng. | AWS Certified Architect | Software Delivery | Helping Startups / IT Driven companies with Data Integration, Big data, Mobile applications, iOS , Android, Cloud, Web

发布日期: 2023年6月5日

Introduction:

This article will show you how to access Parquet files and CSVs stored on Amazon S3 with DuckDB. DuckDB is a highly-efficient in-memory analytic database. This article is built python.?We will show how to connect to the S3 bucket and load data into DuckDB. Then, we'll perform queries and filtering without the Pandas library.?This approach provides significant performance benefits when working with large datasets. DuckDB optimizes the data processing and uses parallel computing.

Prerequisites:

Be sure to have these items before proceeding:

1.?Install Python 3.x on your machine.

2.?DuckDB Python package is installed

(this can be done via pip: "pip install duckdb" depending on your OS it can even be "pip3 install duckdb" as pip sometimes defaults to Python v 2).

3.?Access credentials (access key secret key and bucket name) for the S3 bucket that you want to target. You can get a Free AWS account here.

4. Internet access (duh!)

To state the obvious: Code can need you to scroll across horizontally

Step 1: Install Required Libraries

Installing the boto3 library is required to connect to S3 and access Parquet or CSV files.?Install it using pip by running the following command:

# shell
pip install boto3

?If you are Jupyter notebook/Google collab person then try

! pip install boto3

Why? ! passes the command to the shell

Step 2: Connecting to S3

We need to enter the required credentials in order to interact with Amazon S3.?Import the necessary modules into a Python script.

import duckdb

import boto3




AWS Credentials

ACCESS_KEY = 'your-access-key'

SECRET_KEY = 'your-secret-key'

BUCKET_NAME = 'your-bucket-name'




# Create S3 Client

s3_client = boto3.client(

's3',

aws_access_key_id=ACCESS_KEY,

aws_secret_access_key=SECRET_KEY

)

Step 3: Loading Parquet or CSV data into DuckDB

The duckdb.query() functions will be used to load the Parquet files and CSV into DuckDB.?This function accepts a SQL query and returns the results in a DuckDB result.

Let's say we have two files named data.csv and data.parquet in the root directory.?Execute the following code to load them:

# Load Parquet file

parquet_data = duckdb.query(
f"COPY INTO (SELECT * FROM parquet_scan\
('s3://BUCKET_NAME/mypqdata.parquet'))")

Soumil S. 5 年前

Handling JSON/XML Data in Cosmos DB with Python and…

Venkatagiri Ramesh 8 个月前

Parquet to the Rescue: Taming Massive CSV Files in…

Mikkel Freltoft Krogsholm 1 年前

# Load CSV File

csv_data = duckdb.query(f"COPY INTO (SELECT * FROM read_csv_auto('s3://BUCKET_NAME/mycsvdata.csv'))")

tip: scroll across to read

```

Step 4: Filtering and Querying Data

After loading the Parquet data and CSV files into DuckDB we can now perform various queries and filtering operations.?DuckDB allows us to use familiar SQL syntax.

Here is an example of querying Parquet data.

# Query Parquet data

result = duckdb.query("SELECT * FROM parquet_data WHERE department = 'Sales'")

Here's an example for filtering CSV data.

# Filter CSV data

filtered_result = duckdb.query("SELECT * FROM csv_data WHERE column_name = 'desired_value'")

Step 5: Fetching Results and Processing:

The fetchall() function can be used to retrieve and process the query results.?This method returns the query results as a list tuples. Each tuple represents one row in the result set.

Pro Tip: if your are coming from the apache spark background this is your df.collect() call.?

Get and process results

Rows = result.fetchall(

Rows in rows

# Process the row

print(row))

Summary:

I hope this article has helped you gain some quick knowledge into DuckDB. Thanks for reading. Please comment and tell me if there is something else you would like to see on this.

Some articles that might be of Interest

Duck db with PyArrow

Polars

Apache Spark vs Polars

要查看或添加评论，请登录

Remesh Govind N M的更多文章

Scala Vs Go

2023年6月17日

Scala Vs Go

What are Go and Scala? ?? Go, a programming language developed by Google in 2009, combines the syntax and run-time of C…

1 条评论
DuckDB Access Over HTTPS

2023年6月9日

DuckDB Access Over HTTPS

Lets do a Deeper dive with an example from hugging face ?? The Hugging Face Hub is dedicated to providing open access…
DuckDB A Server-less Analytics Option

2023年5月24日

DuckDB A Server-less Analytics Option

After Exploring some of the options earlier such as Apache spark and Polars DuckDB (#duckdb) is a lightweight…

1 条评论
Accessing Polars from RUST

2023年5月18日

Accessing Polars from RUST

#Polars is a Rust-based data manipulation library that provides similar functionality as Pandas. It has support for…
Bard vs ChatGPT

2023年5月15日

Bard vs ChatGPT

#Bard and #ChatGPT are two large language models, but they have different strengths and weaknesses. Bard is better…
Polars the nextgen dataframe library.

2023年5月11日

Polars the nextgen dataframe library.

Polars (#polars) is a #DataFrame library written in Rust, which means it is fast and efficient. It supports…

1 条评论
5 Reasons to Choose Rust as Your Next Programming Language

2023年5月9日

5 Reasons to Choose Rust as Your Next Programming Language

Introduction In an era dominated by a plethora of programming languages, #Rust has emerged as a promising contender…
Polars vs Apache Spark from a Developer's Perspective

2023年5月3日

Polars vs Apache Spark from a Developer's Perspective

#Polars and #Spark 3 are both popular frameworks for processing large datasets. But which one is better for you? Let's…
Apache Spark 2 Vs Apache Spark 3

2023年5月1日

Apache Spark 2 Vs Apache Spark 3

Apache Spark is a popular open-source big data processing engine used by many organizations to analyze and process…
Upgrade to Catalina MacOS or Not?

2019年10月12日

Upgrade to Catalina MacOS or Not?

A lot of us like Mac OS for its stability and so, in the usual course of things, its a no brainier to update to the…

See all articles

Querying Parquet, CSV Using DuckDB and Python on Amazon S3

Remesh Govind N M

VP Data Eng. | AWS Certified Architect | Software Delivery | Helping Startups / IT Driven companies with Data Integration, Big data, Mobile applications, iOS , Android, Cloud, Web

领英推荐

Remesh Govind N M的更多文章

社区洞察

其他会员也浏览了

Reading JSON config - Spark vs Python

Polars - An excellent data-wrangling library in Python

How to automate your BI Reports?

Write CSV files in a directory as tables in SQL

How to Connect Python to Google Sheets

Python Data File Formats – How to Read CSV, JSON, and XLS Files

How to read a CSV file quickly using Python, Pandas and save into a PostgreSQL database with SQLAlchemy ORM

Python for data analysis… is it really that simple?!?

Web App in Python Flask which reads Excel file and give output in other Excel file

Running Python Workloads on scalable Snowflake Compute clusters

领英推荐

Remesh Govind N M的更多文章

Scala Vs Go

DuckDB Access Over HTTPS

DuckDB A Server-less Analytics Option

Accessing Polars from RUST

Bard vs ChatGPT

Polars the nextgen dataframe library.

5 Reasons to Choose Rust as Your Next Programming Language

Polars vs Apache Spark from a Developer's Perspective

Apache Spark 2 Vs Apache Spark 3

Upgrade to Catalina MacOS or Not?

社区洞察

其他会员也浏览了

Reading JSON config - Spark vs Python

Polars - An excellent data-wrangling library in Python

How to automate your BI Reports?

Write CSV files in a directory as tables in SQL

How to Connect Python to Google Sheets

Python Data File Formats – How to Read CSV, JSON, and XLS Files

How to read a CSV file quickly using Python, Pandas and save into a PostgreSQL database with SQLAlchemy ORM

Python for data analysis… is it really that simple?!?

Web App in Python Flask which reads Excel file and give output in other Excel file

Running Python Workloads on scalable Snowflake Compute clusters