登录查看更多内容

6th method for Pandas: swifter

Manny Ko

发布日期: 2019年12月10日

In my previous Pandas article I show 5 different ways to apply a heaverside() function to a small dataset of NY hotel locations. The heaverside() function is just an example of an expensive function you need to apply to a dataset:

I decide to try to use 'swifter' which is a Pandas acceleration package that tries to be smart about when to apply different strategies (vectorization vs. Dask parallel using multi-cores) by sampling the actually dataset and functor.

swifter.apply() took .91ms which is much better than using df.apply() - 5.47ms. The code change is literally import swifter than use df.swifter.apply instead of df.apply(). However, directly using numpy still kills it at .02ms. In this case swifter decide not to use Dask. In a future post I will show an example of swifter picking Dask to parallel the apply() across cores.

I replicated the input 500 times to get about 800k rows:

One can see the last 3 methods are almost neck-to-neck. Method 5 is using Pandas' vectorization directly instead of retrieving the numpy array behind the Series in method 6 (last one). Nice to see 'swifter' automatically came up with the strategy that is basically the same final result. That means we do not have to hand tune whether to use vectorization vs. Dask etc. Of course this is one data point.

Manny Ko

5 年

I replicated the input 500 times to get > 800k rows and re-run the test. The results is appended to the article. Method 4, 5, 6 basically are neck-to-neck. 4 use 'swifter, 5 use Pandas vectorization, 6 use numpy's vectorization.

Maksym Voitko

AI | Data Engineering & Back End & MLOps | Python, Big Data, AWS, GCP | Angel Investor

5 年

Have you ever faced cases when swifter beat numpy?

Kevin Tran

Senior Data Scientist | LinkedIn Top Voice 2019 in Data Science & Analytics

5 年

Moral of the story: Numpy is still currently the king for many computations in many instances. Other shinny tools like Swifter, Dask, Modin are in general faster than Pandas itself. However, there are cases where the gain is small so it is kind of hit or miss. In Numpy we trust :)

10 次回应

查看更多评论

要查看或添加评论，请登录

Manny Ko的更多文章

Improved Enum for Python

2021年12月21日

Improved Enum for Python

Sample code to demonstrate how to use some of added methods in Enumbase to write a command line declaration that is…

8 条评论
Claude Shannon

2020年12月27日

Claude Shannon

Well he only invented the whole 'entropy' thing. Started the field of Information Theory (Mutual-information…

14 条评论
Accurate and fast PI in Python

2020年2月3日

Accurate and fast PI in Python

In my last article I show how to use numpy and numba to speed up a naive Monte Carlo method to compute PI. We managed…

1 条评论
Numpy+numba for 20x in Python

2020年2月3日

Numpy+numba for 20x in Python

Method 1: Naive Monte-Carlo rejection sampling to compute PI - C/C++ style. I am using 10 million MC samples…

24 条评论
Pandas: 5 Very Different Performances, more than 2000x

2019年12月7日

Pandas: 5 Very Different Performances, more than 2000x

Our dataset consists of 1631 records for locations of hotels in New York. The above is the output from 5 different ways…

12 条评论
Rotation Invariance in Neural Nets

2019年8月9日

Rotation Invariance in Neural Nets

A recent CVPR paper "Strike (with) a Pose:" very elegantly and forcefully demonstrate the importance of invariance in…

8 条评论
Deep Image Prior

2019年1月21日

Deep Image Prior

We might be aware of recent amazing results of generative-networks being able to upsample a very low quality/low-res…

1 条评论

See all articles

6th method for Pandas: swifter

Manny Ko

Manny Ko的更多文章

社区洞察

其他会员也浏览了

A Slap in the Face with Pandas

Data Science #5

Exploring Two-Sample Kolmogorov-Smirnov Test with Simulations

Unified

Change the data type of columns in Pandas

Party Buzz Kill: modifying data

Data Discovery Just Got Easier with GraphRAG

Data Discovery Just Got Easier with GraphRAG ??

Vectorizing Functions in NumPy

Creating a Bar Plot with Seaborn

Manny Ko的更多文章

Improved Enum for Python

Claude Shannon

Accurate and fast PI in Python

Numpy+numba for 20x in Python

Pandas: 5 Very Different Performances, more than 2000x

Rotation Invariance in Neural Nets

Deep Image Prior

社区洞察

其他会员也浏览了

A Slap in the Face with Pandas

Data Science #5

Exploring Two-Sample Kolmogorov-Smirnov Test with Simulations

Unified

Change the data type of columns in Pandas

Party Buzz Kill: modifying data

Data Discovery Just Got Easier with GraphRAG

Data Discovery Just Got Easier with GraphRAG ??

Vectorizing Functions in NumPy

Creating a Bar Plot with Seaborn