Different random forest packages in R

One of the important steps in using analytics to generate insights is model fitting. Typical projects involve a lot of data cleaning so that high accuracy is achieved on application of the model. Competitions are all about data cleaning and models. There are various models which can be fitted on data under different conditions. One of the most intuitive of those models is decision trees. Decision trees classify data into buckets based on “decisions” based on the feature values. Most of the competitions start with bench-marking based on results from ensemble of trees, known as random decision forests. Random Forests, as they are called, use ensemble of trees based and are the best examples of ‘Bagging’ techniques. R, the popular language for model fitting has made a variety of random forest packages available for use. Let’s discuss a few of them (in no way this list is exhaustive.)

  • randomForest- The ‘classic’ package in R which implements the most basic random forest logic and is really robust. The package is very user friendly and provides the user with the option to tune features such as number of trees and depth of trees. The package optionally provides the ability to derive feature importance and proximity measures. Feature importance is based on the error increase when OOB data is changed while keep all other things same. On the other hand, Proximity measure is a matrix where (i,j) element indicates fraction of trees in which elements i and j fall in the same terminal node. The package can be used for classification or regression problems and can be learnt with ease
  • cforest - This package is computationally more expensive and better than the randomForest package in terms of accuracy. cforest uses OOB data which means more information and higher accuracy. At the same time it is slower and can handle less data for the same memory. It then uses weighted average of the trees to get the final ensemble. However, the main cause for cforest having a more reliable predictions is the fact that it produces unbiased trees. randomForest have a drawback that the simple algorithm is invariably biased towards features with many cut points. There are features which are continuous or have many categories and can be preferred. Whenever you have large computational resources at your disposal, do use cforest for accuracy.
  • obliqueRF - “Oblique” forests is an underrated, advanced yet useful concept which is based on separating trees using hyper planes instead of features. They can easily outperform randomForest especially in cases when all the features are discrete or we have spectral data. Just like randomForest, Oblique forests are also governed by subspace dimensions(or number of features) and ensemble size(or number of trees). However, since they make oblique cuts rather than orthogonal ones, recursive binary splits and ridge regression are also involved for splitting. I have seen a cool implementation of oblique random forests as the prize winning code in a kaggle competition! Hence oblique random forests sure pack a punch. obliqueRF does end up having a higher bias and lower variance than randomForest
  • ParallelForest - ParallelForest is an implementation to run randomForest using parallel computing. The package has functions grow.forest. Its pretty handy when there are millions of rows in the training set. A data set which took days for randomForest package to fit on was handled by ParallelForest in under an hour. However, there are still doubts on whether the accuracy is the same for both packages under all conditions and whether classification can be implemented using parallel processing. (Another package bigrf is also based on using multi-threading and caching for very large data but it was not built with the objective to speed up processing rather it is based on handling very large data)
  • randomUniformForest - This package produces unpruned trees and are useful for regression, classification and unsupervised learning. If cforest is slower but more accurate than randomForest then randomUniformForest falls on the other end of being the faster but slightly less accurate version. The trees have lower correlation, thereby resulting in lower bias but higher variance. Moreover, they involve use of uniform distribution. Since we don’t care much about bias as perfectly randomized trees will cancel it out, randomUniformForests are useful in situations where the features themselves follow specified distributions
  • Randomforest SRC Survival, Regression and Classification(SRC) are the three types of models this package provides a unified function for. Additionally, there are multivariate and unsupervised extensions as well as parallel processing through openMP. I have come to use this package whenever there is doubt on what should be the best approach for data model fitting. Coupled with missing value imputation, the package provides a first look kind of model useful for further exploration and deep dive analysis.
  • ranger - ranger comes to the rescue when you have high dimensional data and want a memory efficient yet fast implementation of randomForest. The word ranger came from RANdom forest GEneRator. The main purpose where I have used ranger is to build models quickly and find out optimal parameter values using parameter tuning
  • Rborist - Rborist is a high performance implementation of randomForest. Compared to original randomForest, this package optimizes the algorithms such that model fitting is performed with less data movement within memory and create opportunities for scaling up performance. Hence, as the features increase, the processing time increases only linearly (as opposed to exponential increase expected for randomForests). The package also supports missing value imputation. Hence, in projects where we ourselves generate a lot of features, this package becomes seemingly more suitable.

Since the idea being first suggested in the 90’s Random forests have become a popular method of model fitting and are used in various forms. There are even more implementations such as rotationForests(based on fitting features over principal components), xgboost(extreme gradient boosting - a clever tree based technique that uses boosting) and rFerns (useful for comparing images) and regularized random forests. This article will be useful for those who have had gone through decision tree and basic random forest concepts and are willing to learn its different variations in R.

P.S. Let me know if there are other implementations out there.



Harshvardhan .

PhD Candidate in Business Analytics | University of Tennessee | HP Inc. | IIM Indore

3 年

Great article. There's also spark which is based on Apache Spark. https://spark.apache.org/docs/latest/api/R/spark.randomForest.html

回复
Collin Ching

Data Scientist at WBD

6 年

will add that Rborist outperforms Ranger on lower-dim data with large sample sizes, whereas Ranger is better with data with more features

Dinar Mingaliev

Data Scientist at Drei ?sterreich

6 年

Thanks! I personally like performance of cforest and?obliqueRF. Are there similar algorithms in Python? and do you have accuracy benchmarks for the algorithms??

龚自清

中国太平洋保险 - 大数据建模分析经理

7 年

Thank u for sharing! Do u think cforest could beat xgboost or lightgbm in terms of accuracy?

回复
Ziyue Gao

Head of Data & Operations @ Elven

7 年

Thanks for your post! It's really helpful to R users who want to use RF!

回复

要查看或添加评论,请登录

Madhur Modi的更多文章

  • Problems with data

    Problems with data

    Data is central to almost every kind of problem and is very important. Your model is as good as the data it is built…

    14 条评论
  • Software Project management

    Software Project management

    Introduction: What is it? Software companies work by allocating projects with multiple constraints, some of which…

  • Hands-on Spectral clustering in R

    Hands-on Spectral clustering in R

    Spectral Clustering, what is it? Spectral clustering is a class of techniques that perform cluster division using…

    2 条评论

社区洞察

其他会员也浏览了