登录查看更多内容

Pandas: 5 Very Different Performances, more than 2000x

Manny Ko

发布日期: 2019年12月7日

Our dataset consists of 1631 records for locations of hotels in New York. The above is the output from 5 different ways to call the Haverside function - which is the great circle distance between 2 points given (lat, lon).

'haverside_looping' is the way a lot of us might have written it in Pandas - 1) use C-style looping + indexing using .iloc etc.

Just using iterrows():

The runtime went from 40ms to 14ms ! Wow.. can it make that much differences? Yes it can.

Now let's use apply() and even use lambda (instead of a 'def'):

Runtime went from 40 -> 14 -> 6.2ms !! Besides apply() will be more parallel friendly for other Pandas like implementations because we do not impose a sequential execution on the order the records are being processed.

Now, use Pandas vectorization. It is too fast I have to loop 10 times:

We are down to .31ms!! Not bad considering we were at > 40ms.

One might ask, where is the vectorization? Well look at the haverside(). We are passing the entire Pandas Series and call haverside() once! This is the power of Python operator overloading and dynamic dispatching.

Can you do better than this? Yes, we can use numpy directly:

The .values is a property on the Series that give you the numpy.ndarray. We are now at .02ms. That is over 10x faster than the previous Pandas based vectorization. This I must admit I was amazed myself.

Kudos to https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6. I only typed in the code and make the output more understandable and make the different implementation easier to comprehend.

Remember our original runtime (> 42ms), the final winner is .02ms :). In case it is not obvious I never had to change the haversine() function for the 5 different methods.

In Python the best way to loop is not to loop at all.

Tamires D.

3 年

thanks, manny!

Naser Tamimi

Senior Data Scientist | GenAI @ AWS

4 年

What a nice comparison of doing the same thing using different tools and getting different performances.

Maksym Voitko

AI | Data Engineering & Back End & MLOps | Python, Big Data, AWS, GCP | Angel Investor

5 年

Awesome improvements! I will definitely try the last improvement to boost the performance in all my apps.

Manny Ko

5 年

this article in KdNuggets is nice: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html

Manny Ko

5 年

Modins:What I like about Modin is how it splits the input to be able to handle both dataset with many rows and those that are many columns using a Partition Manager:

1 次回应

查看更多评论

要查看或添加评论，请登录

Manny Ko的更多文章

Improved Enum for Python

2021年12月21日

Improved Enum for Python

Sample code to demonstrate how to use some of added methods in Enumbase to write a command line declaration that is…

8 条评论
Claude Shannon

2020年12月27日

Claude Shannon

Well he only invented the whole 'entropy' thing. Started the field of Information Theory (Mutual-information…

14 条评论
Accurate and fast PI in Python

2020年2月3日

Accurate and fast PI in Python

In my last article I show how to use numpy and numba to speed up a naive Monte Carlo method to compute PI. We managed…

1 条评论
Numpy+numba for 20x in Python

2020年2月3日

Numpy+numba for 20x in Python

Method 1: Naive Monte-Carlo rejection sampling to compute PI - C/C++ style. I am using 10 million MC samples…

24 条评论
6th method for Pandas: swifter

2019年12月10日

6th method for Pandas: swifter

In my previous Pandas article I show 5 different ways to apply a heaverside() function to a small dataset of NY hotel…

6 条评论
Rotation Invariance in Neural Nets

2019年8月9日

Rotation Invariance in Neural Nets

A recent CVPR paper "Strike (with) a Pose:" very elegantly and forcefully demonstrate the importance of invariance in…

8 条评论
Deep Image Prior

2019年1月21日

Deep Image Prior

We might be aware of recent amazing results of generative-networks being able to upsample a very low quality/low-res…

1 条评论

See all articles

Pandas: 5 Very Different Performances, more than 2000x

Manny Ko

Manny Ko的更多文章

社区洞察

其他会员也浏览了

?? Unraveling Correlations in Movie Data with Python! ??

Generating Awesome Graphs for ALM Octane using REST API with Dash & Plotly

Stop throwing your data into file cabinets! Share it online, instead.

Python For Kids (Part 23: String Primitive Data Type)

How to Build a Faster Bayesian Linear Regression Model with Bambi + BRMS (Even With NUTS)

Can I use NumPy instead of Pandas?

How to Measure Winning Streaks (and the Improve Your Forecasts) in Python + R

Decoding Data: Unraveling the Mysteries with Python and R

How to Plot a Histogram with Matplotlib

Manny Ko的更多文章

Improved Enum for Python

Claude Shannon

Accurate and fast PI in Python

Numpy+numba for 20x in Python

6th method for Pandas: swifter

Rotation Invariance in Neural Nets

Deep Image Prior

社区洞察

其他会员也浏览了

?? Unraveling Correlations in Movie Data with Python! ??

Generating Awesome Graphs for ALM Octane using REST API with Dash & Plotly

Stop throwing your data into file cabinets! Share it online, instead.

Python For Kids (Part 23: String Primitive Data Type)

How to Build a Faster Bayesian Linear Regression Model with Bambi + BRMS (Even With NUTS)

Can I use NumPy instead of Pandas?

How to Measure Winning Streaks (and the Improve Your Forecasts) in Python + R

Decoding Data: Unraveling the Mysteries with Python and R

How to Plot a Histogram with Matplotlib