Reducing memory usage in Pandas
As a data scientist, I use the python Pandas library a lot to manipulate my data. Sometimes, my data can be extremely large when loaded into Pandas, such as when I load a large CSV file with millions of rows and dozens of columns.
Large filesizes can make manipulations very slow and time consuming.
A quick solution to this problem is to examine your dataframe for the number of object columns present. Often, "object" columns can contribute to high memory usage, as Pandas stores each value in each row of the object column as a unique string in memory. However, we can trick Pandas into mapping each value to a key, and then simply storing the memory locations each key points to. To do this in Pandas, we can cast object columns as "category" type columns instead. This is a good idea when the unique values in an object column comprise less than 50% of the total values in the column.
So, a column of 5 million "Yes's" and "No's" (5 million strings) can be hashed into only two values in memory, thus significantly compressing the size of that column.