Evaluating Which Python library is Best suitable for Bulk insert into Aurora Postgres SQL | Speed Comparison
Authors
Soumil Nitin Shah (Data collection and Processing Team lead)
I earned a Bachelor of Science in Electronic Engineering and a double master’s in electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as a Data Engineer?Lead at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS?
Hari Om Dubey(Consultant Software Engineer, Python developer)
I have completed a Master’s in Computer Application, and I have 5 years of experience in developing software applications using Python and Django frameworks. I love to code in Python and creating a solution for a problem by coding excites me. I have been working at Jobtarget for like past 8 months as a Software Engineer in a Data Team.
Overview:
Amazon Aurora PostgreSQL is a fully managed, PostgreSQL–compatible, and ACID–compliant relational database engine that combines the speed, reliability, and manageability of Amazon Aurora with the simplicity and cost-effectiveness of open-source databases. Aurora PostgreSQL is a drop-in replacement for PostgreSQL and makes it simple and cost-effective to set up, operate, and scale your new and existing PostgreSQL deployments, thus freeing you to focus on your business and applications. In this I wanted to test various python libraries for speed and performance and present the findings with community?
?Introduction?
? The most well-known PostgreSQL database adapter for Python is called Pycopg. The full implementation of the Python DB API 2.0 specification and thread safety are its key features (several threads can share the same connection). It was created for heavily multi-threaded programs that make a lot of concurrent "INSERT" or "UPDATE" requests and generate a lot of cursor creation and destruction.
Psycopg 2 is efficient and secure because it is primarily implemented in C as a libpq wrapper. It includes "COPY TO/COPY FROM" capabilities, client-side and server-side cursors, asynchronous communication, and notifications. A versatile objects adaptation system allows for the extension and customization of the adaptation of several Python kinds to correspond with PostgreSQL data types.
Under the terms of the MIT License, SQLAlchemy is an open-source SQL toolkit and object-relational mapper for Python. The Python SQL toolkit and Object Relational Mapper, SQLAlchemy, provides application developers with all of SQL's functionality and versatility. It offers a complete set of well-known enterprise-level persistence patterns created for fast, effective database access and translated into a straightforward, Pythonic domain language
The ease of implementation, speedy code development, and lack of need for prior SQL knowledge are all reasons why SQLAlchemy is so well-liked. The high-level approach is emphasized in practically all online programming tutorials and courses for this reason. The majority of working software engineers also appear to choose SQLAlchemy.
Aurora PostgreSQL SQL Configuration??
For the Entire test we shall be Using following configuration for Aurora?
Create a Table
CREATE TABLE IF NOT EXISTS public.user
(
first_name character varying(256) COLLATE pg_catalog."default",
last_name character varying(256) COLLATE pg_catalog."default",
address character varying(256) COLLATE pg_catalog."default",
text character varying(256) COLLATE pg_catalog."default",
id character varying(256) COLLATE pg_catalog."default",
city character varying(256) COLLATE pg_catalog."default",
state character varying(256) COLLATE pg_catalog."default"
)s
Figure 1: Shows Creating SQL table
After creating the Table, we shall be inserting data into this tables in batches with some common libraries used for Aurora and shall measure the time it takes to insert batches of data and derive insights which is faster in terms of speed.
Results
领英推荐
It won't matter too much if you are working with tiny amounts of data. But as the size of the data grows, it will definitely get more interesting to explore and use these alternative methods to speed up the process up to 13 times!
SQL alchemy Insert Many?
Comparison?
Graph : Shows comparison for SqlAlchemy and psycopg2 Python library for Bulk insert into Aurora?
Conclusion
SqlAlchemy should always be the primary option when working with and entering bulk items into AWS Aurora because it is obvious from testing that it is a faster approach to insert data into Aurora PostgreSQL. When compared to psycopg2(executemany), VS SqlAlchemy is almost 60 to 70% faster. Comparing batch Size 30,000 Bulk Insert using psycopg2(executemany) we found it took around 1248 seconds vs when using psycopg2(execute_batch_method ) took 19.4 seconds VS SQLAlchemy took only 1.5 seconds.
Future Work :
I will do further tests to make sure the results are accurate. I would want to assess the Read Speed later. For bulk reads, which library would be better?
Code base Used to Test can be found.
Do not hesitate to fork the repository, contribute your discoveries, and submit a merge request.
https://github.com/soumilshah1995/Evaluating-Which-Python-library-is-Best-suitable-for-Bulk-insert-into-Aurora-Postgres-SQL-Speed-Co/tree/main
During the test, I saved the log file.
References
Aurora PostgreSQL Insert Many Performances Test Using Various Python Library. Accessed 27 Oct. 2022.
“SQLAlchemy.” Pypi, pypi.org/project/SQLAlchemy. Accessed 27 Oct. 2022.
“Psycopg2.” Psycopg2, pypi.org/project/psycopg2. Accessed 27 Oct. 2022.
“Pandas to PostgreSQL Using Psycopg2: Bulk Insert Performance Benchmark.” Naysan, naysan.ca/2020/05/09/pandas-to-postgresql-using-psycopg2-bulk-insert-performance-benchmark. Accessed 27 Oct. 2022.
“Improve Your Psycopg2 Executions for PostgreSQL in Python.” Datacareer, www.datacareer.de/blog/improve-your-psycopg2-executions-for-postgresql-in-python. Accessed 27 Oct. 2022.