Choosing right PyFramework: Pandas/SFrame/Dask
Introduction
I’ve recently been working on evaluation some of the python data processing tools primarily Pandas, SFrame and Dask. Pandas and Dask are pydata tools while SFrame is from Turi as part of graphlab create package.
A little intro to each tool,
Pandas is probably the most popular tool in python for data processing it is in memory tool which means that all the processing happens in the RAM. It is really reach in the methods it provide, it extends its support to a lot of data format – CSV, HDF5, SAS etc.
SFrame, like pandas, is a computation library which provides dataframes to work upon. “It is a scalable, out-of-core dataframe, which allows you to work with datasets that are larger than the amount of RAM on your system.” This essentially means that you can compute on datasets that are larger than memory or keep computing till you consume entire RAM and disk space.
Dask is a flexible parallel computing library which provides dataframe data structure. It is out-of-core and scalable framework. It uses numpy and pandas under the hood for processing and introduces use of multi-core and parallel chunk processing.
What we expected out of our tests?
1. We wanted to understand which one to use in which situation along with their capabilities towards solving the problem.
2. Another thing which we wanted to understand was whether one tool is sufficient enough for entire task or we need to take help from other packages to make it work holistically.
Tests Performed
The aim was simple to benchmark the performance of these tools and frameworks.
- In order to achieve so we ran the same tests through sample data of 3 type : 2k records, 5k records and 50k records.
- Then we ran the same test suite over full data which was more than 2.5M each file
- We worked on 10 files at a time, each round of tests were performed by concatenating data read from 10 files.
Tests performed were :
- Reading different file sizes in different frameworks
- Applying transformations on Data – Validation, Type conversion,
- Creating columns on sample business rules [derived by equations, if else etc.]
- Summarizing/aggregating the data – performing basic stats using group by order by
- Reporting capabilities of the frameworks – graphs , pictures etc
- Daily data tasks like – Subseting, deduplication, Sorting etc.
- Writing to a data format – csv / native to framework
Results
The results that we found cleared what frameworks work best for us. Pandas and SFrame works best at large because they are very well established and are constantly evolving. Dask is a modern framework from Pydata and is getting a lot of attention. It is very good in terms of processing data larger than RAM and in parallel threads. It is good to have this framework in your stack but the usage will be limited for now.
See results captured below:
Pandas
PROs:
1. It is a great tool to use when your data can fit in memory(RAM).
2. It is amazingly fast and is very programmer friendly
3. This framework has got the easiest implementation
4. Very well documented and a lot of resources available on the internet
CONs:
1. If your data gets bigger pandas starts throwing out of memory errors
2. You need to be mindful of not consuming all the memory available so if you got like 16 GB RAM of which 10G is available then I suggest don’t consume more than 9G.
3. You need to take help from other libraries like sklearn, statsmodel for performing high statistical tasks/ machine learning
Sframe
PROs:
1. SFrames can work with data that is larger than memory
2. Very well documented as well as turi maintains a lot of tutorial
3. Easy implementation
4. Fantastic graph creations and visualization
5. It’s a complete package, it has got a lot of machine learning libraries etc.
6. Lazy computations
CONs:
1. It is available free for educational purposes only
2. No parallel processing
Dask
PROs:
1. Dask works very well with data that is larger than memory
2. It can perform the tasks in parallel, you can even tell dask how many cores to use
3. Dask computes lazily, so nothing happens till an action is performed
4. It can be applied for distributed computing. However the distributed framework is now separated from dask.
5. Dask also works well with other distributed computing tools like spark etc.
CONs:
1. Dask lets you use threads, but you’ll need pandas 0.18.0
2. Dask dataframe implements a commonly used subset of Pandas functionality, not all of it
3. Dask dataframes are only updateable(add a new column to dataframe etc) with version 0.10.2
Conclusion
I am going to say that I am with Pandas for datasets that fit in memory while SFrames for anything that spills over the memory. I am keeping dask as an option for the tasks that involves processing of larger chunks of data in parallel.
also i don't think SFrame cons n°2 is right. SFrame is using all the available cores to process the data, so it's doing parallel processing, while pandas does not parallelize processing. Hence on 8 CPU server, SFrame is 8 times faster as pandas for the same apply function.
Puneet Tripathi 1) sframe has been released in Open source (https://github.com/turi-code/SFrame) so can be used for free. However all machine learnings algorithms are not available in the open source release. And since Turi has been bought by Apple in July 2016, algorithms can not be used as Turi offer can not be bought.