XGBoost — The Undisputed GOAT!
Ojash Shrestha
Senior Software Engineer | MS CS @ CSUF '24 | Prev @ Neplopers | MIT - HMS Bootcamp '19 | MPP Grad - Data Science '17
In this article, we’ll learn about XGBoost, its background, its widely accepted usage in competitions such as Kaggle’s and help you build an intuitive understanding of it by diving into the foundation of this algorithm.
XGBoost
XGBoost is an algorithm that is highly flexible, portable, and efficient which is based on a decision tree for ensemble learning for Machine Learning that uses the distributed gradient boosting framework. Machine Learning algorithms are implemented with XGBoost under the Gradient boosting framework. XGBoost is capable of solving data science problems accurately in a short duration with its parallel tree boosting which is also called Gradient Boosting Machine (GBM), Gradient Boosting Decision Trees (GBDT). It is extremely portable and cross-platform enabled such that the very same code can be run on the different major distributed environments such as Hadoop, MPI, and SGE and enables solving problems with well over billions of examples.
In recent times, the XGBoost has literally and figuratively dominated the entire domain of Applied Machine Learning and every major competition including the Kaggle Competitions. With both speed and performance provided by this algorithm, the implementation, acceptance, and love of XGBoost have grown exponentially in the last 5 years. It is a highly optimized algorithm with the approach of parallel processing and tree-pruning with the ability to handle missing values and regularization to avoid situations of bias and overfitting.
First of all, if we break down the name itself, XGBoost stands for eXtreme Gradient Boosting. It was created by?Tianqi Chen and Carlos Guestrin at the University of Washington in 2014. Over time, numerous developers have contributed to this open-source library enabling a framework for a range of programming languages such as Python, R, C++, Java, Julia, Perl, and Scala for all major operating systems such as Windows, Linux, and macOS. It started as a research project for the Distributed (Deep) Machine Learning Community group and over the years has become the go-to algorithm of choice for teams participating in machine learning competitions.
To get the perspective of the wide use of XG Boost, according to the blog of Kaggle, among the winning solutions in the challenge of Kaggle Competition, 17 out of 29 winning solutions used XGBoost. 8 of these completely used the XGBoost to train the model with others combining XGBoost is an ensemble with a neural network. Comparing with the second most popular method of the deep neural network, only 11 solutions used it. Just imagine the difference of 11 and 29 solutions using XGBoost compared to the second most used approach. This demonstrates the efficiency of XGBoost and its par excellence output capabilities.
Understanding the foundation for XGBoost with an analogy to real world
Let us understand intuitively about XG Boost. For this, we’ll take a series of examples taking the US Election as an analogy to grasp the underlying concepts building up to XGBoost.
Decision Tree
Every voter has the underlying desires that they want the president of their country want to address and support. This sets a range of criteria for issues the voters in a particular state would support such as gun control, penchant towards immigration, conservative and democratic views can be attributed with the analogy of the decision tree.
Bagging (Bootstrap Aggregating)
We know that each state victory is just a piece of the presidential run. It is the aggregation of the total number of wins in different states that boosts the presidential position victory. Similarly, bagging is the aggregation of all the inputs combined thus attributing to the final decision.
Random Forest
Random Forest is based on the bagging algorithm such that the comparative difference to bagging with the random forest is that, here only a subset of features is selected for the process randomly. This can be related to the different criteria and different inclinations of various states. Most of the mid-west states believe in conservative values while the west and east coast are more inclined towards democratic nature. However, California and Massachusetts vote for different groups for different reasons.
Boosting
Boosting refers to this dynamic evaluation approach where the evaluation criteria are constantly upgraded with the feedback from each new passing election which boosts the efficiency with the deployment of a more evolving process.
Gradient Boosting
Gradient Boosting is one special case of boosting and the minimization of error is obtained by it. To compare this, we could talk about the consequential victory is the earliest state of Iowa for the presidential election. It is barely 1000 miles in size and rural than average America but is the election where both the parties of Democrats and Republican chooses its presidential candidate which helps the prospective gain momentum further down the election. Similarly, weeding wide less qualified or less appealing candidate, the party gets assured which candidate has more chances of victory — similar case with our algorithms where errors are minimized.
领英推荐
XGBoost
You see the X in XGBoost, right? That literally stands for Extreme Gradient Boosting; basically, Gradient Boosting but all swirled up in steroids; thus, calling Extreme Gradient Boosting for a reason. XGBoost helps reduce error significantly while at the same time require a minimum amount of computing resources in the shortest duration of time with this beautiful combination approach for both software and hardware optimization techniques.
Ensemble Learning
When we talk about XGBoost, we concur it is an approach to ensemble learning. But what exactly is Ensemble Learning and why do we use it?
Ensemble learning can be understood as this process where generation and combination of different numerous models, let’s say classifiers and experts are performed to solve a specific computational intelligence problem. It helps to produce better predictive performance with minimal error on regression problems and higher accuracy in the case of classification. We have assumed for a really long time, that the choice of the right algorithms is the most important task alone. But in ensemble learning, it has proved that gathering models into ensembles can produce an even better outcome.
Ensemble Methods in Machine Learning
This approach of ensemble learning, where we combine several base models such that one optimal predictive model can be obtained for Machine Learning is the ensemble method. Using the approach of ensemble methods such as bagging and random forest, we can obtain outcomes at a different level of the decision trees, and instead of building one single model hoping it will provide the best prediction, the different levels of the trees can be taken into account for different models and average them to obtain one final model for prediction. Decision Trees however aren’t the only method to approach ensemble learning but one of the used one. Moreover, this method of ensemble has been evolving in the field of Machine Learning to produce better predictions with the different use cases of numerous machine learning techniques for various models to produce the optimal predictive model.
Why is XGBoost so good?
The reason for this vague outstanding performance of XGBoost can be attributed to various reasons. However, it can be classified basically into two parts, one for the optimization of the system or hardware and another for the enhancements at the algorithm level.
The XGBoost penalizes for overfitting with regularization methods of L1 and L2 i.e., Lasso Regression and Ridge Regression. It also can handle the missing value which is one of the major drawbacks in most of the other algorithms. This sparsity awareness helps it outperform other algorithms too. Besides, the deployment of distributed weighted Quantile Sketch algorithm helps in efficient proposal calculation for optimal split points in datasets. Also, the requirement of specifically declaring the number of boosting iterations isn’t there as the cross-validation provided built-in into the algorithm takes care of it. These numerous approaches in the XG Boost help it stand out algorithmically.
Besides, it isn’t the excelsior algorithmic level alone that makes XGBoost so lucrative to use. Its requirement of minimal hardware usage to run this algorithm can be attributed to the parallel processing implementation thus significantly improving the run time in comparison to any of the other algorithms. Moreover, the depth-first approach implementation in XGBoost through the max_depth parameter provides the reverse tree pruning which is comparatively highly effective to improve computational performance. Moreover, the cache awareness in XGBoost and enhancements such as out-of-core computing helps to handle data-frames that are bigger than size to fit into memory itself. This terrific hardware optimization approach fit into the algorithm itself makes XGBoost too hard to resist to solve Machine Learning problems.
Read the Full article in?C# Corner: