How to process a 100 GB file or "it is possible to embrace a huge, but in parts".

How to process a 100 GB file or "it is possible to embrace a huge, but in parts".

Hello everyone,

I want to share my experience with handling large (very large) files. Here, I tested a few of my ideas and, in the end, achieved results. But it would be nice to hear other advice and thoughts as well.

The task was simple - there was a 110 GB CSV file with 500 million rows. It was a database, and I needed to process it in a way that allows for easy searching.

The structure of the file presented differs from what I was working with. But for understanding the structure, I simplified it to this level.

?

1. For convenience, I tried loading it into Power BI. The database loaded, but then everything stopped.

?

2. If you open such a file not for editing (F4) but for viewing (F3), you can see what's inside. Structurally, it consisted of 6 columns: {'ID', 'Name', 'Experience', 'Contact', 'Project', 'Year'}. My first thought, later confirmed, was that it's essentially a merged table from three others, maybe even with some data in JSON format.

When viewed, it looked something like this:

?Hypothetically, the structure looked like this:

Here are 3 separate databases from which it was composed:


?

?

3. It was decided to cut off the 'Project' and 'Year' columns. I did this in Python using the 'csv' and 'chardet' libraries. Since I couldn't load it all into memory, I did it in small chunks. I set the chunk size to 100,000 and immediately rewrote it to CSV. I initially operated from an external USB 3.2 hard drive, with a write speed of around 800GB/s. The entire processing took about 2 hours, and in the end, the file was reduced to 50 GB!

?

4. In fact, I encountered something like this:

As you can see, there are many duplicates. Removing them and splitting them into two tables might significantly reduce the size, making it more manageable.

?

5. The next plan was great - set up SQL and throw my file in there using Python, then remove the duplicates through SQL queries and split it into 2 tables, joining them to get a working database. After studying the topic, I understood that I had to choose between MySQL and PostgreSQL. As I understood it then, PostgreSQL works better with very large databases (over 1 TB) with complex queries, while MySQL is a bit simpler in commands and shows slightly better results on large databases. After all, it became clear that there's no simple answer to which is better, but I eventually settled on MySQL. Another advantage of MySQL is its wider usage and ease of answering questions.

After reading the instructions, I started installing SQL, got a bit confused with the components, and decided it would be faster to do everything with Python.

?

6. I started removing duplicates with Python. When you're coding in Python (at least for me), it takes at least 10 attempts even for a small program. And when dealing with nearly half a billion rows, each attempt can take a very long time. So, I made a shortened version of my database, leaving only the first 1 million rows. I did this similarly using the csv and chardet libraries, simply reading the required number of lines sequentially: num_lines_to_read = 1,000,000. Then you can test your algorithms on these shortened samples, estimate how long they will take, and extrapolate to the total time.

?

?

7. I faced some difficulties due to Cyrillic encoding. Usually, 'encoding='utf-8'' solves everything, but here I had to use 'encoding='utf-8-sig' (with BOM).

?

8. I removed duplicates using the csv and chardet libraries through chunks. It worked fine on small samples. However, when I tried it on larger ones, I kept encountering rows with errors, causing everything to stop. Here, I also had to remove double quotes and split the line by delimiter (assuming it's a CSV file) because extra columns kept popping up. I tried to perform a check before processing, but after 1.5 days, I realized Python couldn't handle it all, and I had to install MySQL.

?

9. Setting up MySQL: Thanks to the YouTube channel @Amit.Thinks and his 10:25-minute video "Python MySQL Database Connectivity - MySQL Connector | 2022 Complete Guide," I quickly did everything. You just need to install:

? MySQL Server

? MySQL Workbench

? MySQL Shell

Then simply configure the product: Keep port 3306, specify a password for the database. Then, through the search "Edit the system Environment Variables," specify the path to MySQL. Next, check in the command line 'mysql --version'; if you correctly specified the path, you'll see the MySQL version. After that, you can connect to the server: 'mysql -u root -p,' enter the password, and play with the commands. Of course, it's better to do any manipulations with MySQL Workbench.

Then you need to install the connector through which Python will communicate with the database - 'pip install mysql-connector-python.'

Now, connect to the database:

?import mysql.connector

conn = mysql.connector.connect(

??? host='localhost',

??? user='root',

??? password='********',

??? database="*****"

)

?

And perform all the manipulations.

?

10. To place the physical database not on the installation disk, you need to change it in the "my.ini" file. Changes are made after stopping the MySQL Server, which can be done through services.msc (type in the console), then stop the "MySQL80" service. Then find and change the parameter "datadir=E:/...". To be able to load the database from a folder on the disk, you also need to specify it in the my.ini file in the parameter "secure-file-priv=". Also, move temporary files - "tmpdir=". "my.ini" is similar to CONFIG.SYS, which allowed configuring the system in DOS.

?

11. Attempts to load the CSV file into MySQL: According to the instructions - you just need to create a cursor in Python and execute the command to load the CSV file into MySQL. Here you need to specify data types and the number of characters. I understood the data types, but for the number of characters, I had to use a DAX query after loading the database into PowerBi. Then I came up with: python cursor.execute("CREATE TABLE IF NOT EXISTS a2 (ID VARCHAR(15), Name VARCHAR(100), Experience VARCHAR(10), Contact VARCHAR(66))"). And then we specify the loading into the database. But it turned out that due to constant problems, the database didn't load - sometimes an additional column in the row caused the database to crash.

I'll tell you right away that you should specify closing the connector in case of an error in the program; otherwise, you may have problems with the database:

?

except mysql.connector.Error as error:

??? print("Error: {}".format(error))

finally:

??? # Close the cursor and connection

??? if 'cursor' in locals() and cursor is not None:

??????? cursor.close()

??? if 'connection' in locals() and connection.is_connected():

??????? connection.close()

?

12. I realized there was an error somewhere: I tried using another program. I used "re" along with "csv" to preserve Cyrillic symbols while removing non-ASCII characters:

# Preserve Cyrillic symbols while removing non-ASCII characters

??????????? cleaned_value = re.sub(r'[^\x00-\x7F\x80-\xFF\u0400-\u04FF]', '', column_value

I renamed delimiters: riter.writerow([elem.replace(';', ',') for elem in row]). I changed the encoding: input_encoding = 'utf-8-sig' output_encoding = 'utf-8'. Simply cut out suspicious rows.

?

13. Then I simply split the 50 GB into 10 files of 5 GB each and started loading them into MySQL separately. 8 out of 10 pieces were loaded, and 2 were not. Then I went through those specific rows and saw that I had '"Dave, Oliver"' - interpreted as two different columns and didn't recognize the quotes as indicating that it's one column. It turns out you just need to specify in Load_query = OPTIONALLY ENCLOSED BY '". But besides these moments, there were other errors that I simply decided to cut out.

?

14. Now that all parts of the database are loaded and I finally managed to run the SQL query to remove duplicates, I decided to test it on a "small" 2 GB file. I ran the process, and after 17 hours of work, I realized it was taking too long. Of course, I understood that MySQL on my local computer, Core i7, 8th Gen with 32 GB of memory, isn't quite the same as on a powerful server. But I hoped it would be faster than 9 days just to remove duplicates.

?

15.I moved all the databases to my SSD to speed up the process and made this change in "my.ini". I also increased the innodb_buffer_pool_size from 128 MB to 2 GB, and then to 16 GB. It didn't help.

Calculating that technically I could rent a virtual machine with 128 GB of memory for $50 per month, but it meant dealing with the setup of this service.

During my attempts, MySQL "broke" on me, and I had to reinstall it. After a couple more tries, I thought it would be faster to do this processing in Python.

16. Now everything looked more realistic. I had 10 files of 5 GB each, each of which could fit into memory (loading 1 GB requires about 2.2 GB of RAM) and be processed by the "pandas" library.

import pandas as pd

df = pd.read_csv('DT_part01.csv', encoding='utf-8')

df['ID'] = df['ID'].astype(str)

df['Contact'] = df['Contact'].astype(str)

unique_df = df.drop_duplicates(subset=['ID', 'Contact'])

unique_df.to_csv('DT_part01_proc.csv', index=False, encoding='utf-8')

From each 5 GB file, I got about 1.5 GB. Then I combined these 10 files into a 15 GB file and tried to remove duplicates again using "pandas," but I ran out of memory.

?

17. Then I split the database into two: {'ID', 'Name', 'Experience', 'Contact'} -> [Name]{'ID', 'Name', 'Experience'} + [Contact]{'ID', 'Contact'}. And I ran each one for duplicates — I ended up with only 6 GB. So I finally got what I wanted — a cleaned, trimmed database.

?

18. Processing and presenting the results. The simplest way was to load my two databases with the "pandas" library and then do a "Merge DataFrames" (for small databases, it's just a super tool). And work with them as with one database. In my case, 2 CSV files, 6 GB in size, were loaded into memory and took about 13-15 GB, then after "Merge DataFrames," the database increased to 28-29 GB. And I could just search through it. I wanted the process to be faster — loading into RAM took about 6 minutes, and "Merge DataFrames" another 3-4. And it used less memory.

?

19. I managed to save my merged file as a CSV file. Now I could simply load it without spending time on "Merge DataFrames," but I still used too much RAM. I decided to save the merged file not as CSV but as another type of database "Feather" — the size reduced to 4.3 GB, and "Parquet" to 3.2 GB. Trying to load "Parquet" databases into "pandas," I saw that although the file on disk was smaller, it expanded in memory to the same size as CSV.

?

20. I had to slightly complicate the Python program: I separately loaded two CSV files ([Name], [Contact]), then used "set_index" to create columns for search. Then I entered part of the name for the search. Then through "df_name.index.str.contains" I searched for all similar values in the [Name] database and placed them in "name_find," then searched for all identical 'ID' in the [Contact] database and placed them in "contact_find." In the end, I simply merged "name_find" and "contact_find" into the final table and saved the result as CSV. For searching by contact, there was an additional stage where I first searched for all matching contacts in the [Contact] database, then created a separate 'ID' table, read [Contact] again, and only after that searched in [Name] and merged.

?

21. I converted the final Python into an EXE file.


Natalia R.

agri researcher

1 年

Great! Thank you! Very useful!!!

回复

要查看或添加评论,请登录

Andriy Bilenky的更多文章

社区洞察

其他会员也浏览了