Querying Parquet, CSV Using DuckDB and Python on Amazon S3

Querying Parquet, CSV Using DuckDB and Python on Amazon S3


Introduction:

This article will show you how to access Parquet files and CSVs stored on Amazon S3 with DuckDB. DuckDB is a highly-efficient in-memory analytic database. This article is built python.?We will show how to connect to the S3 bucket and load data into DuckDB. Then, we'll perform queries and filtering without the Pandas library.?This approach provides significant performance benefits when working with large datasets. DuckDB optimizes the data processing and uses parallel computing.


Prerequisites:

Be sure to have these items before proceeding:


1.?Install Python 3.x on your machine.

2.?DuckDB Python package is installed

(this can be done via pip: "pip install duckdb" depending on your OS it can even be "pip3 install duckdb" as pip sometimes defaults to Python v 2).

3.?Access credentials (access key secret key and bucket name) for the S3 bucket that you want to target. You can get a Free AWS account here.

4. Internet access (duh!)

To state the obvious: Code can need you to scroll across horizontally

Step 1: Install Required Libraries

Installing the boto3 library is required to connect to S3 and access Parquet or CSV files.?Install it using pip by running the following command:


# shell
pip install boto3        

?If you are Jupyter notebook/Google collab person then try

! pip install boto3 
        

Why? ! passes the command to the shell

Step 2: Connecting to S3

We need to enter the required credentials in order to interact with Amazon S3.?Import the necessary modules into a Python script.

import duckdb

import boto3




AWS Credentials

ACCESS_KEY = 'your-access-key'

SECRET_KEY = 'your-secret-key'

BUCKET_NAME = 'your-bucket-name'




# Create S3 Client

s3_client = boto3.client(

's3',

aws_access_key_id=ACCESS_KEY,

aws_secret_access_key=SECRET_KEY

)        



Step 3: Loading Parquet or CSV data into DuckDB

The duckdb.query() functions will be used to load the Parquet files and CSV into DuckDB.?This function accepts a SQL query and returns the results in a DuckDB result.


Let's say we have two files named data.csv and data.parquet in the root directory.?Execute the following code to load them:


# Load Parquet file


parquet_data = duckdb.query(
f"COPY INTO (SELECT * FROM parquet_scan\
('s3://BUCKET_NAME/mypqdata.parquet'))")        



# Load CSV File


csv_data = duckdb.query(f"COPY INTO (SELECT * FROM read_csv_auto('s3://BUCKET_NAME/mycsvdata.csv'))")        

tip: scroll across to read

```


Step 4: Filtering and Querying Data

After loading the Parquet data and CSV files into DuckDB we can now perform various queries and filtering operations.?DuckDB allows us to use familiar SQL syntax.


Here is an example of querying Parquet data.



# Query Parquet data


result = duckdb.query("SELECT * FROM parquet_data WHERE department = 'Sales'")        

Here's an example for filtering CSV data.

# Filter CSV data

filtered_result = duckdb.query("SELECT * FROM csv_data WHERE column_name = 'desired_value'")        


Step 5: Fetching Results and Processing:

The fetchall() function can be used to retrieve and process the query results.?This method returns the query results as a list tuples. Each tuple represents one row in the result set.

Pro Tip: if your are coming from the apache spark background this is your df.collect() call.?



Get and process results


Rows = result.fetchall(

Rows in rows

# Process the row

print(row))        


Summary:

I hope this article has helped you gain some quick knowledge into DuckDB. Thanks for reading. Please comment and tell me if there is something else you would like to see on this.

Some articles that might be of Interest

Duck db with PyArrow

Polars

Apache Spark vs Polars

要查看或添加评论,请登录

Remesh Govind N M的更多文章

  • Scala Vs Go

    Scala Vs Go

    What are Go and Scala? ?? Go, a programming language developed by Google in 2009, combines the syntax and run-time of C…

    1 条评论
  • DuckDB Access Over HTTPS

    DuckDB Access Over HTTPS

    Lets do a Deeper dive with an example from hugging face ?? The Hugging Face Hub is dedicated to providing open access…

  • DuckDB A Server-less Analytics Option

    DuckDB A Server-less Analytics Option

    After Exploring some of the options earlier such as Apache spark and Polars DuckDB (#duckdb) is a lightweight…

    1 条评论
  • Accessing Polars from RUST

    Accessing Polars from RUST

    #Polars is a Rust-based data manipulation library that provides similar functionality as Pandas. It has support for…

  • Bard vs ChatGPT

    Bard vs ChatGPT

    #Bard and #ChatGPT are two large language models, but they have different strengths and weaknesses. Bard is better…

  • Polars the nextgen dataframe library.

    Polars the nextgen dataframe library.

    Polars (#polars) is a #DataFrame library written in Rust, which means it is fast and efficient. It supports…

    1 条评论
  • 5 Reasons to Choose Rust as Your Next Programming Language

    5 Reasons to Choose Rust as Your Next Programming Language

    Introduction In an era dominated by a plethora of programming languages, #Rust has emerged as a promising contender…

  • Polars vs Apache Spark from a Developer's Perspective

    Polars vs Apache Spark from a Developer's Perspective

    #Polars and #Spark 3 are both popular frameworks for processing large datasets. But which one is better for you? Let's…

  • Apache Spark 2 Vs Apache Spark 3

    Apache Spark 2 Vs Apache Spark 3

    Apache Spark is a popular open-source big data processing engine used by many organizations to analyze and process…

  • Upgrade to Catalina MacOS or Not?

    Upgrade to Catalina MacOS or Not?

    A lot of us like Mac OS for its stability and so, in the usual course of things, its a no brainier to update to the…

社区洞察

其他会员也浏览了