6th method for Pandas: swifter

In my previous Pandas article I show 5 different ways to apply a heaverside() function to a small dataset of NY hotel locations. The heaverside() function is just an example of an expensive function you need to apply to a dataset:

I decide to try to use 'swifter' which is a Pandas acceleration package that tries to be smart about when to apply different strategies (vectorization vs. Dask parallel using multi-cores) by sampling the actually dataset and functor.

No alt text provided for this image
No alt text provided for this image

swifter.apply() took .91ms which is much better than using df.apply() - 5.47ms. The code change is literally import swifter than use df.swifter.apply instead of df.apply(). However, directly using numpy still kills it at .02ms. In this case swifter decide not to use Dask. In a future post I will show an example of swifter picking Dask to parallel the apply() across cores.

I replicated the input 500 times to get about 800k rows:

No alt text provided for this image

One can see the last 3 methods are almost neck-to-neck. Method 5 is using Pandas' vectorization directly instead of retrieving the numpy array behind the Series in method 6 (last one). Nice to see 'swifter' automatically came up with the strategy that is basically the same final result. That means we do not have to hand tune whether to use vectorization vs. Dask etc. Of course this is one data point.


I replicated the input 500 times to get > 800k rows and re-run the test. The results is appended to the article. Method 4, 5, 6 basically are neck-to-neck. 4 use 'swifter, 5 use Pandas vectorization, 6 use numpy's vectorization.

回复
Maksym Voitko

AI | Data Engineering & Back End & MLOps | Python, Big Data, AWS, GCP | Angel Investor

5 年

Have you ever faced cases when swifter beat numpy?

回复
Kevin Tran

Senior Data Scientist | LinkedIn Top Voice 2019 in Data Science & Analytics

5 年

Moral of the story: Numpy is still currently the king for many computations in many instances. Other shinny tools like Swifter, Dask, Modin are in general faster than Pandas itself. However, there are cases where the gain is small so it is kind of hit or miss. In Numpy we trust :)

要查看或添加评论,请登录

Manny Ko的更多文章

  • Improved Enum for Python

    Improved Enum for Python

    Sample code to demonstrate how to use some of added methods in Enumbase to write a command line declaration that is…

    8 条评论
  • Claude Shannon

    Claude Shannon

    Well he only invented the whole 'entropy' thing. Started the field of Information Theory (Mutual-information…

    14 条评论
  • Accurate and fast PI in Python

    Accurate and fast PI in Python

    In my last article I show how to use numpy and numba to speed up a naive Monte Carlo method to compute PI. We managed…

    1 条评论
  • Numpy+numba for 20x in Python

    Numpy+numba for 20x in Python

    Method 1: Naive Monte-Carlo rejection sampling to compute PI - C/C++ style. I am using 10 million MC samples…

    24 条评论
  • Pandas: 5 Very Different Performances, more than 2000x

    Pandas: 5 Very Different Performances, more than 2000x

    Our dataset consists of 1631 records for locations of hotels in New York. The above is the output from 5 different ways…

    12 条评论
  • Rotation Invariance in Neural Nets

    Rotation Invariance in Neural Nets

    A recent CVPR paper "Strike (with) a Pose:" very elegantly and forcefully demonstrate the importance of invariance in…

    8 条评论
  • Deep Image Prior

    Deep Image Prior

    We might be aware of recent amazing results of generative-networks being able to upsample a very low quality/low-res…

    1 条评论

社区洞察

其他会员也浏览了