Pandas: 5 Very Different Performances, more than 2000x
Our dataset consists of 1631 records for locations of hotels in New York. The above is the output from 5 different ways to call the Haverside function - which is the great circle distance between 2 points given (lat, lon).
'haverside_looping' is the way a lot of us might have written it in Pandas - 1) use C-style looping + indexing using .iloc etc.
Just using iterrows():
The runtime went from 40ms to 14ms ! Wow.. can it make that much differences? Yes it can.
Now let's use apply() and even use lambda (instead of a 'def'):
Runtime went from 40 -> 14 -> 6.2ms !! Besides apply() will be more parallel friendly for other Pandas like implementations because we do not impose a sequential execution on the order the records are being processed.
Now, use Pandas vectorization. It is too fast I have to loop 10 times:
We are down to .31ms!! Not bad considering we were at > 40ms.
One might ask, where is the vectorization? Well look at the haverside(). We are passing the entire Pandas Series and call haverside() once! This is the power of Python operator overloading and dynamic dispatching.
Can you do better than this? Yes, we can use numpy directly:
The .values is a property on the Series that give you the numpy.ndarray. We are now at .02ms. That is over 10x faster than the previous Pandas based vectorization. This I must admit I was amazed myself.
Kudos to https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6. I only typed in the code and make the output more understandable and make the different implementation easier to comprehend.
Remember our original runtime (> 42ms), the final winner is .02ms :). In case it is not obvious I never had to change the haversine() function for the 5 different methods.
In Python the best way to loop is not to loop at all.
thanks, manny!
Data Engineer @ Expedia
3 年What a nice comparison of doing the same thing using different tools and getting different performances.
AI | Data Engineering & Back End & MLOps | Python, Big Data, AWS, GCP | Angel Investor
4 年Awesome improvements! I will definitely try the last improvement to boost the performance in all my apps.
this article in KdNuggets is nice: https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html
Modins:What I like about Modin is how it splits the input to be able to handle both dataset with many rows and those that are many columns using a Partition Manager: