6th method for Pandas: swifter
In my previous Pandas article I show 5 different ways to apply a heaverside() function to a small dataset of NY hotel locations. The heaverside() function is just an example of an expensive function you need to apply to a dataset:
I decide to try to use 'swifter' which is a Pandas acceleration package that tries to be smart about when to apply different strategies (vectorization vs. Dask parallel using multi-cores) by sampling the actually dataset and functor.
swifter.apply() took .91ms which is much better than using df.apply() - 5.47ms. The code change is literally import swifter than use df.swifter.apply instead of df.apply(). However, directly using numpy still kills it at .02ms. In this case swifter decide not to use Dask. In a future post I will show an example of swifter picking Dask to parallel the apply() across cores.
I replicated the input 500 times to get about 800k rows:
One can see the last 3 methods are almost neck-to-neck. Method 5 is using Pandas' vectorization directly instead of retrieving the numpy array behind the Series in method 6 (last one). Nice to see 'swifter' automatically came up with the strategy that is basically the same final result. That means we do not have to hand tune whether to use vectorization vs. Dask etc. Of course this is one data point.
I replicated the input 500 times to get > 800k rows and re-run the test. The results is appended to the article. Method 4, 5, 6 basically are neck-to-neck. 4 use 'swifter, 5 use Pandas vectorization, 6 use numpy's vectorization.
AI | Data Engineering & Back End & MLOps | Python, Big Data, AWS, GCP | Angel Investor
5 年Have you ever faced cases when swifter beat numpy?
Senior Data Scientist | LinkedIn Top Voice 2019 in Data Science & Analytics
5 年Moral of the story: Numpy is still currently the king for many computations in many instances. Other shinny tools like Swifter, Dask, Modin are in general faster than Pandas itself. However, there are cases where the gain is small so it is kind of hit or miss. In Numpy we trust :)