Evaluating Which Python library is Best suitable for Bulk insert into Aurora Postgres SQL | Speed Comparison

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue Professional | YouTuber

发布日期: 2022年11月5日

Authors

Soumil Nitin Shah (Data collection and Processing Team lead)

I earned a Bachelor of Science in Electronic Engineering and a double master’s in electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as a Data Engineer?Lead at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS?

Hari Om Dubey(Consultant Software Engineer, Python developer)

I have completed a Master’s in Computer Application, and I have 5 years of experience in developing software applications using Python and Django frameworks. I love to code in Python and creating a solution for a problem by coding excites me. I have been working at Jobtarget for like past 8 months as a Software Engineer in a Data Team.

Overview:

Amazon Aurora PostgreSQL is a fully managed, PostgreSQL–compatible, and ACID–compliant relational database engine that combines the speed, reliability, and manageability of Amazon Aurora with the simplicity and cost-effectiveness of open-source databases. Aurora PostgreSQL is a drop-in replacement for PostgreSQL and makes it simple and cost-effective to set up, operate, and scale your new and existing PostgreSQL deployments, thus freeing you to focus on your business and applications. In this I wanted to test various python libraries for speed and performance and present the findings with community?

?Introduction?

? The most well-known PostgreSQL database adapter for Python is called Pycopg. The full implementation of the Python DB API 2.0 specification and thread safety are its key features (several threads can share the same connection). It was created for heavily multi-threaded programs that make a lot of concurrent "INSERT" or "UPDATE" requests and generate a lot of cursor creation and destruction.

Psycopg 2 is efficient and secure because it is primarily implemented in C as a libpq wrapper. It includes "COPY TO/COPY FROM" capabilities, client-side and server-side cursors, asynchronous communication, and notifications. A versatile objects adaptation system allows for the extension and customization of the adaptation of several Python kinds to correspond with PostgreSQL data types.

Under the terms of the MIT License, SQLAlchemy is an open-source SQL toolkit and object-relational mapper for Python. The Python SQL toolkit and Object Relational Mapper, SQLAlchemy, provides application developers with all of SQL's functionality and versatility. It offers a complete set of well-known enterprise-level persistence patterns created for fast, effective database access and translated into a straightforward, Pythonic domain language

The ease of implementation, speedy code development, and lack of need for prior SQL knowledge are all reasons why SQLAlchemy is so well-liked. The high-level approach is emphasized in practically all online programming tutorials and courses for this reason. The majority of working software engineers also appear to choose SQLAlchemy.

Aurora PostgreSQL SQL Configuration??

For the Entire test we shall be Using following configuration for Aurora?

Create a Table

CREATE TABLE IF NOT EXISTS public.user
                (
                    first_name character varying(256) COLLATE pg_catalog."default",
                    last_name character varying(256) COLLATE pg_catalog."default",
                    address character varying(256) COLLATE pg_catalog."default",
                    text character varying(256) COLLATE pg_catalog."default",
                    id character varying(256) COLLATE pg_catalog."default",
                    city character varying(256) COLLATE pg_catalog."default",
                    state character varying(256) COLLATE pg_catalog."default"
                )s

Figure 1: Shows Creating SQL table

After creating the Table, we shall be inserting data into this tables in batches with some common libraries used for Aurora and shall measure the time it takes to insert batches of data and derive insights which is faster in terms of speed.

Results

Deepak Kumar Mishra 3 年前

Create an AWS Lambda function in Python that interacts…

Shyam Prasad Veerapally 3 个月前

Discover 5 cutting-edge data science tools that are…

Expanz. 10 个月前

It won't matter too much if you are working with tiny amounts of data. But as the size of the data grows, it will definitely get more interesting to explore and use these alternative methods to speed up the process up to 13 times!

SQL alchemy Insert Many?

Comparison?

Graph : Shows comparison for SqlAlchemy and psycopg2 Python library for Bulk insert into Aurora?

Conclusion

SqlAlchemy should always be the primary option when working with and entering bulk items into AWS Aurora because it is obvious from testing that it is a faster approach to insert data into Aurora PostgreSQL. When compared to psycopg2(executemany), VS SqlAlchemy is almost 60 to 70% faster. Comparing batch Size 30,000 Bulk Insert using psycopg2(executemany) we found it took around 1248 seconds vs when using psycopg2(execute_batch_method ) took 19.4 seconds VS SQLAlchemy took only 1.5 seconds.

Future Work :

I will do further tests to make sure the results are accurate. I would want to assess the Read Speed later. For bulk reads, which library would be better?

Code base Used to Test can be found.

Do not hesitate to fork the repository, contribute your discoveries, and submit a merge request.

https://github.com/soumilshah1995/Evaluating-Which-Python-library-is-Best-suitable-for-Bulk-insert-into-Aurora-Postgres-SQL-Speed-Co/tree/main

During the test, I saved the log file.

References

Aurora PostgreSQL Insert Many Performances Test Using Various Python Library. Accessed 27 Oct. 2022.

“SQLAlchemy.” Pypi, pypi.org/project/SQLAlchemy. Accessed 27 Oct. 2022.

“Psycopg2.” Psycopg2, pypi.org/project/psycopg2. Accessed 27 Oct. 2022.

“Pandas to PostgreSQL Using Psycopg2: Bulk Insert Performance Benchmark.” Naysan, naysan.ca/2020/05/09/pandas-to-postgresql-using-psycopg2-bulk-insert-performance-benchmark. Accessed 27 Oct. 2022.

“Improve Your Psycopg2 Executions for PostgreSQL in Python.” Datacareer, www.datacareer.de/blog/improve-your-psycopg2-executions-for-postgresql-in-python. Accessed 27 Oct. 2022.

要查看或添加评论，请登录

查看全部

Evaluating Which Python library is Best suitable for Bulk insert into Aurora Postgres SQL | Speed Comparison

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue Professional | YouTuber

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Discover 5 cutting-edge data science tools that are essential for your Python toolkit

How do you establish a connection to a SQL database using Python?

Programmatically manipulate a target database in AWS Glue using an external Python library.

Database Connectivity in Python: A Comprehensive Guide

PySPARK

Things to consider before writing a Pyspark Code

Interacting with Cosmos DB through Python

Fast Data Processing in Python with Apache Arrow and Apache Parquet

Learn 7 Database in 7 Week(introduction)

Instantly Create and Run Database APIs - Flask, SQLAlchemy. Customize with Python and Rules.

领英推荐

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

2024年9月21日

Building a Universal Data Lake with EMR Serverless: Hands-On Labs for Querying with Snowflake, Athena, and Spark – A Guide for Beginners, Leaders

2024年9月15日

How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide

2024年9月8日

Developer Guide: How to Submit Hudi PySpark(Python) Jobs to EMR Serverless (7.1.0) with AWS Glue Hive MetaStore

2024年9月4日

Learn How to Use DeltaStreamer with JDBC Puller to Pull Data from Snowflake into Universal Table Formats (Iceberg | Delta | Hudi): Hands-on Labs

2024年8月30日

Leveraging S3 External Volumes in Snowflake for Iceberg Tables and Using X-Table for Universal Data Lakes | Step by Step Guides

2024年8月30日

How We Used Bucket Index with Apache Hudi and Right Partitioning for ~660GB Tables with Nearly 4.4B+ Records Using AWS Glue 4.0

2024年8月6日

Learn How to Run the Apache X Table Sync Command in Docker Environments with Rocky Linux

2024年8月2日

Understanding Apache Hudi's MERGE INTO Command with Minio and HiveMetaStore

2024年7月31日

Simplify Your Workflow: Compiling Apache Hudi JARs in a Containerized Environment

2024年7月24日