登录查看更多内容

Choosing right PyFramework: Pandas/SFrame/Dask

Puneet Tripathi

Head of Data at Wakefit

发布日期: 2016年9月28日

Introduction

I’ve recently been working on evaluation some of the python data processing tools primarily Pandas, SFrame and Dask. Pandas and Dask are pydata tools while SFrame is from Turi as part of graphlab create package.

A little intro to each tool,

Pandas is probably the most popular tool in python for data processing it is in memory tool which means that all the processing happens in the RAM. It is really reach in the methods it provide, it extends its support to a lot of data format – CSV, HDF5, SAS etc.

SFrame, like pandas, is a computation library which provides dataframes to work upon. “It is a scalable, out-of-core dataframe, which allows you to work with datasets that are larger than the amount of RAM on your system.” This essentially means that you can compute on datasets that are larger than memory or keep computing till you consume entire RAM and disk space.

Dask is a flexible parallel computing library which provides dataframe data structure. It is out-of-core and scalable framework. It uses numpy and pandas under the hood for processing and introduces use of multi-core and parallel chunk processing.

What we expected out of our tests?

1. We wanted to understand which one to use in which situation along with their capabilities towards solving the problem.

2. Another thing which we wanted to understand was whether one tool is sufficient enough for entire task or we need to take help from other packages to make it work holistically.

Tests Performed

The aim was simple to benchmark the performance of these tools and frameworks.

In order to achieve so we ran the same tests through sample data of 3 type : 2k records, 5k records and 50k records.
Then we ran the same test suite over full data which was more than 2.5M each file
We worked on 10 files at a time, each round of tests were performed by concatenating data read from 10 files.

Tests performed were :

Reading different file sizes in different frameworks
Applying transformations on Data – Validation, Type conversion,
Creating columns on sample business rules [derived by equations, if else etc.]
Summarizing/aggregating the data – performing basic stats using group by order by
Reporting capabilities of the frameworks – graphs , pictures etc
Daily data tasks like – Subseting, deduplication, Sorting etc.
Writing to a data format – csv / native to framework

Results

The results that we found cleared what frameworks work best for us. Pandas and SFrame works best at large because they are very well established and are constantly evolving. Dask is a modern framework from Pydata and is getting a lot of attention. It is very good in terms of processing data larger than RAM and in parallel threads. It is good to have this framework in your stack but the usage will be limited for now.

See results captured below:

Pandas

PROs:

1. It is a great tool to use when your data can fit in memory(RAM).

2. It is amazingly fast and is very programmer friendly

3. This framework has got the easiest implementation

4. Very well documented and a lot of resources available on the internet

CONs:

1. If your data gets bigger pandas starts throwing out of memory errors

2. You need to be mindful of not consuming all the memory available so if you got like 16 GB RAM of which 10G is available then I suggest don’t consume more than 9G.

3. You need to take help from other libraries like sklearn, statsmodel for performing high statistical tasks/ machine learning

Sframe

PROs:

1. SFrames can work with data that is larger than memory

2. Very well documented as well as turi maintains a lot of tutorial

3. Easy implementation

4. Fantastic graph creations and visualization

5. It’s a complete package, it has got a lot of machine learning libraries etc.

6. Lazy computations

CONs:

1. It is available free for educational purposes only

2. No parallel processing

Dask

PROs:

1. Dask works very well with data that is larger than memory

2. It can perform the tasks in parallel, you can even tell dask how many cores to use

3. Dask computes lazily, so nothing happens till an action is performed

4. It can be applied for distributed computing. However the distributed framework is now separated from dask.

5. Dask also works well with other distributed computing tools like spark etc.

CONs:

1. Dask lets you use threads, but you’ll need pandas 0.18.0

2. Dask dataframe implements a commonly used subset of Pandas functionality, not all of it

3. Dask dataframes are only updateable(add a new column to dataframe etc) with version 0.10.2

Conclusion

I am going to say that I am with Pandas for datasets that fit in memory while SFrames for anything that spills over the memory. I am keeping dask as an option for the tasks that involves processing of larger chunks of data in parallel.

David Fleckinger

7 年

also i don't think SFrame cons n°2 is right. SFrame is using all the available cores to process the data, so it's doing parallel processing, while pandas does not parallelize processing. Hence on 8 CPU server, SFrame is 8 times faster as pandas for the same apply function.

3 次回应

David Fleckinger

7 年

Puneet Tripathi 1) sframe has been released in Open source (https://github.com/turi-code/SFrame) so can be used for free. However all machine learnings algorithms are not available in the open source release. And since Turi has been bought by Apple in July 2016, algorithms can not be used as Turi offer can not be bought.

1 次回应

查看更多评论

要查看或添加评论，请登录

Puneet Tripathi的更多文章

Stream twitter data to Elasticsearch with Python & dashboard with Kibana

2017年9月11日

Stream twitter data to Elasticsearch with Python & dashboard with Kibana

Of late a lot of people have been asking me on using Kibana for real time dashboarding. So we are going to do a little…

1 条评论

Choosing right PyFramework: Pandas/SFrame/Dask

Puneet Tripathi

Head of Data at Wakefit

Introduction

What we expected out of our tests?

Tests Performed

Results

Pandas

Sframe

Dask

Conclusion

Puneet Tripathi的更多文章

社区洞察

其他会员也浏览了

Spark 3.0 : Adaptive Query Execution & Dynamic Partition Pruning

Getting Started with Pandas: A Beginner's Guide to Data Analysis

Data Manipulation in Python

Introduction to Pandas

Introduction to Pandas: Start Your Data Journey

Python Data Types & Data Structures

?? Top Python Libraries for Data Science ??

Using Python Pandas to turn ISO Country Codes into a string to use as values for a SQL Query

How to Write More Efficient Pandas Code in Python

Beyond Pandas: How to tame your large Datasets in Python

Introduction

What we expected out of our tests?

Tests Performed

Results

Pandas

Sframe

Dask

Conclusion

Puneet Tripathi的更多文章

Stream twitter data to Elasticsearch with Python & dashboard with Kibana

社区洞察

其他会员也浏览了

Spark 3.0 : Adaptive Query Execution & Dynamic Partition Pruning

Getting Started with Pandas: A Beginner's Guide to Data Analysis

Data Manipulation in Python

Introduction to Pandas

Introduction to Pandas: Start Your Data Journey

Python Data Types & Data Structures

?? Top Python Libraries for Data Science ??

Using Python Pandas to turn ISO Country Codes into a string to use as values for a SQL Query

How to Write More Efficient Pandas Code in Python

Beyond Pandas: How to tame your large Datasets in Python