Pandas: 5 Very Different Performances, more than 2000x

No alt text provided for this image
No alt text provided for this image

Our dataset consists of 1631 records for locations of hotels in New York. The above is the output from 5 different ways to call the Haverside function - which is the great circle distance between 2 points given (lat, lon).

'haverside_looping' is the way a lot of us might have written it in Pandas - 1) use C-style looping + indexing using .iloc etc.

Just using iterrows():

No alt text provided for this image

The runtime went from 40ms to 14ms ! Wow.. can it make that much differences? Yes it can.

Now let's use apply() and even use lambda (instead of a 'def'):

No alt text provided for this image

Runtime went from 40 -> 14 -> 6.2ms !! Besides apply() will be more parallel friendly for other Pandas like implementations because we do not impose a sequential execution on the order the records are being processed.

Now, use Pandas vectorization. It is too fast I have to loop 10 times:

No alt text provided for this image

We are down to .31ms!! Not bad considering we were at > 40ms.

One might ask, where is the vectorization? Well look at the haverside(). We are passing the entire Pandas Series and call haverside() once! This is the power of Python operator overloading and dynamic dispatching.

Can you do better than this? Yes, we can use numpy directly:

No alt text provided for this image

The .values is a property on the Series that give you the numpy.ndarray. We are now at .02ms. That is over 10x faster than the previous Pandas based vectorization. This I must admit I was amazed myself.

Kudos to https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6. I only typed in the code and make the output more understandable and make the different implementation easier to comprehend.

Remember our original runtime (> 42ms), the final winner is .02ms :). In case it is not obvious I never had to change the haversine() function for the 5 different methods.

In Python the best way to loop is not to loop at all.

thanks, manny!

回复
Naser Tamimi

Senior Data Scientist | GenAI @ AWS

4 年

What a nice comparison of doing the same thing using different tools and getting different performances.

回复
Maksym Voitko

AI | Data Engineering & Back End & MLOps | Python, Big Data, AWS, GCP | Angel Investor

5 年

Awesome improvements! I will definitely try the last improvement to boost the performance in all my apps.

回复

Modins:What I like about Modin is how it splits the input to be able to handle both dataset with many rows and those that are many columns using a Partition Manager:

  • 该图片无替代文字

要查看或添加评论,请登录

Manny Ko的更多文章

  • Improved Enum for Python

    Improved Enum for Python

    Sample code to demonstrate how to use some of added methods in Enumbase to write a command line declaration that is…

    8 条评论
  • Claude Shannon

    Claude Shannon

    Well he only invented the whole 'entropy' thing. Started the field of Information Theory (Mutual-information…

    14 条评论
  • Accurate and fast PI in Python

    Accurate and fast PI in Python

    In my last article I show how to use numpy and numba to speed up a naive Monte Carlo method to compute PI. We managed…

    1 条评论
  • Numpy+numba for 20x in Python

    Numpy+numba for 20x in Python

    Method 1: Naive Monte-Carlo rejection sampling to compute PI - C/C++ style. I am using 10 million MC samples…

    24 条评论
  • 6th method for Pandas: swifter

    6th method for Pandas: swifter

    In my previous Pandas article I show 5 different ways to apply a heaverside() function to a small dataset of NY hotel…

    6 条评论
  • Rotation Invariance in Neural Nets

    Rotation Invariance in Neural Nets

    A recent CVPR paper "Strike (with) a Pose:" very elegantly and forcefully demonstrate the importance of invariance in…

    8 条评论
  • Deep Image Prior

    Deep Image Prior

    We might be aware of recent amazing results of generative-networks being able to upsample a very low quality/low-res…

    1 条评论

社区洞察

其他会员也浏览了