Pandas: 5 Very Different Performances, more than 2000x

No alt text provided for this image
No alt text provided for this image

Our dataset consists of 1631 records for locations of hotels in New York. The above is the output from 5 different ways to call the Haverside function - which is the great circle distance between 2 points given (lat, lon).

'haverside_looping' is the way a lot of us might have written it in Pandas - 1) use C-style looping + indexing using .iloc etc.

Just using iterrows():

No alt text provided for this image

The runtime went from 40ms to 14ms ! Wow.. can it make that much differences? Yes it can.

Now let's use apply() and even use lambda (instead of a 'def'):

No alt text provided for this image

Runtime went from 40 -> 14 -> 6.2ms !! Besides apply() will be more parallel friendly for other Pandas like implementations because we do not impose a sequential execution on the order the records are being processed.

Now, use Pandas vectorization. It is too fast I have to loop 10 times:

No alt text provided for this image

We are down to .31ms!! Not bad considering we were at > 40ms.

One might ask, where is the vectorization? Well look at the haverside(). We are passing the entire Pandas Series and call haverside() once! This is the power of Python operator overloading and dynamic dispatching.

Can you do better than this? Yes, we can use numpy directly:

No alt text provided for this image

The .values is a property on the Series that give you the numpy.ndarray. We are now at .02ms. That is over 10x faster than the previous Pandas based vectorization. This I must admit I was amazed myself.

Kudos to https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6. I only typed in the code and make the output more understandable and make the different implementation easier to comprehend.

Remember our original runtime (> 42ms), the final winner is .02ms :). In case it is not obvious I never had to change the haversine() function for the 5 different methods.

In Python the best way to loop is not to loop at all.

thanks, manny!

回复
Naser Tamimi

Data Engineer @ Expedia

3 年

What a nice comparison of doing the same thing using different tools and getting different performances.

回复
Maksym Voitko

AI | Data Engineering & Back End & MLOps | Python, Big Data, AWS, GCP | Angel Investor

4 年

Awesome improvements! I will definitely try the last improvement to boost the performance in all my apps.

回复

Modins:What I like about Modin is how it splits the input to be able to handle both dataset with many rows and those that are many columns using a Partition Manager:

  • 该图片无替代文字

要查看或添加评论,请登录

社区洞察

其他会员也浏览了