Querying Parquet, CSV Using DuckDB and Python on Amazon S3
Remesh Govind N M
VP Data Eng. | AWS Certified Architect | Software Delivery | Helping Startups / IT Driven companies with Data Integration, Big data, Mobile applications, iOS , Android, Cloud, Web
Introduction:
This article will show you how to access Parquet files and CSVs stored on Amazon S3 with DuckDB. DuckDB is a highly-efficient in-memory analytic database. This article is built python.?We will show how to connect to the S3 bucket and load data into DuckDB. Then, we'll perform queries and filtering without the Pandas library.?This approach provides significant performance benefits when working with large datasets. DuckDB optimizes the data processing and uses parallel computing.
Prerequisites:
Be sure to have these items before proceeding:
1.?Install Python 3.x on your machine.
2.?DuckDB Python package is installed
(this can be done via pip: "pip install duckdb" depending on your OS it can even be "pip3 install duckdb" as pip sometimes defaults to Python v 2).
3.?Access credentials (access key secret key and bucket name) for the S3 bucket that you want to target. You can get a Free AWS account here.
4. Internet access (duh!)
To state the obvious: Code can need you to scroll across horizontally
Step 1: Install Required Libraries
Installing the boto3 library is required to connect to S3 and access Parquet or CSV files.?Install it using pip by running the following command:
# shell
pip install boto3
?If you are Jupyter notebook/Google collab person then try
! pip install boto3
Why? ! passes the command to the shell
Step 2: Connecting to S3
We need to enter the required credentials in order to interact with Amazon S3.?Import the necessary modules into a Python script.
import duckdb
import boto3
AWS Credentials
ACCESS_KEY = 'your-access-key'
SECRET_KEY = 'your-secret-key'
BUCKET_NAME = 'your-bucket-name'
# Create S3 Client
s3_client = boto3.client(
's3',
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY
)
Step 3: Loading Parquet or CSV data into DuckDB
The duckdb.query() functions will be used to load the Parquet files and CSV into DuckDB.?This function accepts a SQL query and returns the results in a DuckDB result.
Let's say we have two files named data.csv and data.parquet in the root directory.?Execute the following code to load them:
# Load Parquet file
parquet_data = duckdb.query(
f"COPY INTO (SELECT * FROM parquet_scan\
('s3://BUCKET_NAME/mypqdata.parquet'))")
领英推荐
# Load CSV File
csv_data = duckdb.query(f"COPY INTO (SELECT * FROM read_csv_auto('s3://BUCKET_NAME/mycsvdata.csv'))")
tip: scroll across to read
```
Step 4: Filtering and Querying Data
After loading the Parquet data and CSV files into DuckDB we can now perform various queries and filtering operations.?DuckDB allows us to use familiar SQL syntax.
Here is an example of querying Parquet data.
# Query Parquet data
result = duckdb.query("SELECT * FROM parquet_data WHERE department = 'Sales'")
Here's an example for filtering CSV data.
# Filter CSV data
filtered_result = duckdb.query("SELECT * FROM csv_data WHERE column_name = 'desired_value'")
Step 5: Fetching Results and Processing:
The fetchall() function can be used to retrieve and process the query results.?This method returns the query results as a list tuples. Each tuple represents one row in the result set.
Pro Tip: if your are coming from the apache spark background this is your df.collect() call.?
Get and process results
Rows = result.fetchall(
Rows in rows
# Process the row
print(row))
Summary:
I hope this article has helped you gain some quick knowledge into DuckDB. Thanks for reading. Please comment and tell me if there is something else you would like to see on this.
Some articles that might be of Interest