Reducing memory usage in Pandas

As a data scientist, I use the python Pandas library a lot to manipulate my data. Sometimes, my data can be extremely large when loaded into Pandas, such as when I load a large CSV file with millions of rows and dozens of columns.

Large filesizes can make manipulations very slow and time consuming.

A quick solution to this problem is to examine your dataframe for the number of object columns present. Often, "object" columns can contribute to high memory usage, as Pandas stores each value in each row of the object column as a unique string in memory. However, we can trick Pandas into mapping each value to a key, and then simply storing the memory locations each key points to. To do this in Pandas, we can cast object columns as "category" type columns instead. This is a good idea when the unique values in an object column comprise less than 50% of the total values in the column.

So, a column of 5 million "Yes's" and "No's" (5 million strings) can be hashed into only two values in memory, thus significantly compressing the size of that column.

要查看或添加评论,请登录

Naman Bhandari的更多文章

  • What do we use probability distributions for?

    What do we use probability distributions for?

    You've all heard of the Normal distribution. You might have heard of the Binomial distribution too.

  • The power of visualization

    The power of visualization

    And the power of our visual cortex. Below is a chart of four different (x,y) coordinate sets.

社区洞察

其他会员也浏览了