Effective XGBoost by Matt Harrison

Sydney Lewis

AI/ML

发布日期: 2023年11月7日

+ 关注

?? Matt Harrison

Ever felt overwhelmed by the multitude of gradient boosting models out there? XGBoost, CatBoost, AdaBoost, LightGBM - they all deliver comparable results, and it’s hard to pick a clear winner. That’s why I decided to focus on just one - XGBoost. Amidst the sea of machine learning resources, I found a book dedicated entirely to XGBoost. It’s one of the few books that delve into the intricacies of this powerful model. So, if you’re curious about classification and want to explore XGBoost with me, keep reading!

Chapter 1:?Introduction and expectations

Chapter 2:?The datasets are introduced and prepared for modelling. This process is fairly involved and introduces the reader to pipelines, thereby providing a glimpse into production environments.

Chapter 3:?This chapter focuses on Exploratory Data Analysis (EDA), arguably the most critical task in the process. It could save you hours of trouble. The reader is introduced to the dataset through beautiful visualizations that reveal the data in all its glory. When done right, you can already identify what the important features are and how they need to be shaped further for your model. The exercises encourage further exploration.

Chapter 4:?Tree creation - Using simple synthetic data, the intuition behind separation between labels is explored using various techniques.

Chapter 5:?Stumps on real data – The method from the previous chapter is applied on the chosen Kaggle data with a tree depth of 1 (and hence a stump). At this point, one can appreciate that the most significant factor we noted in our EDA is chosen by the model to separate the label. Continuing with this thought, one also notes that when the depth of our tree is increased, we will uncover the next strongly correlated variable and so on.

Chapter 6:?Model complexity and Hyperparameters – Here, the concepts of underfitting and overfitting (bias/variance, simple/complex) are explored as a continuation of the previous chapter. The depth hyperparameter alone is tuned to show that there is an improvement in the model accuracy. The pros and cons of focusing on model accuracy are also clarified with visualizations and suitable explanations.

Chapter 7:?Tree Hyperparameters – The models with the same single hyperparameter (depth) and practical aspects of model validation and testing are explored. Also, the simplest form of exploring a multi-hyperparameter space is tried (grid search) with a brute search. Cross-validation strategies are also discussed across training and validation sets, and how they are effective in achieving accuracy in the real world. The yellow brick library is used to visually contrast the training and validation curves and we notice how the sweet spot for depth is ascertained.

Chapter 8:?Random Forest – The reader is introduced to sklearn’s Random Forest and how the trees impact the accuracy. Moving on from here, XGBoost’s Random Forest Classifier is introduced and compared with the earlier sklearn model. While we are introduced to a fair bit of XGBoost hyperparameters here, we also learn how these may be tweaked to achieve LightGBM behavior! We again use yellowbrick, custom code, dtreeviz to visualize our trees and also achieve a comparable accuracy between our training and validation scores, which is a pretty big deal.

Chapter 9:?XGBoost – We progress from the random forest concept to extreme gradient boosting. Here, we continue our journey by learning about prediction probabilities. After all, there is no such thing as 100% probability (=certainty) in the real world. At this point, we are still looking only at two hyperparameters and trying to understand how the splits happen.

Chapter 10:?Early Stopping – In machine learning, techniques like early stopping mean that you do not waste computing resources and time. We explore the ways to visualize the impact of the early stopping hyperparameter to limit the number of trees and develop an intuition for choosing the optimal value. We are also introduced to evaluation metrics and understand how it works in conjunction with early stopping.

Chapter 11:?This chapter explores the salient hyperparameters of XGboost and the learning rate hyperparameter in particular. The consistency that we can achieve with k-fold cross-validation which was discussed briefly in chapter 6 becomes clearer with the breakup of the individual fold scores on our dataset.

Chapter 12:?Hyperopt – Until now, we have been using gridsearchcv and we are introduced to the powerful (and complex) hyperopt library. I have struggled with this library in this past and I came away with a better understanding of usage and visualization techniques from this chapter. I especially enjoyed the Bayesian search visualization which clearly showed the focus on the best loss.

Chapter 13:?Step-wise tuning with Hyperopt – The search for optimal hyperparameters is a long and painful one, and here’s where step-wise tuning comes in handy. We make logical groups of hyperparameters and explore them individually, save them and bring them together. This obviously can save humungous amounts of time since we explore parameters within a group at a time.

Chapter 14:?Do you have enough data? – You should never ask a data scientist this question. In this chapter, learning curves are explored in detailed and appropriate interpretations are made.

Chapter 15:?Model evaluation – Accuracy is the default metric reported by classification models and obviously this does not cover all needs. Various evaluation metrics are discussed – confusion matrix, precision, recall, F1; and also, the visualizations with ROC curve, cumulative gains curve, lift curves to relate it to real-world problems.

Chapter 16:?Training for different metrics – In the real-world, one has to optimize the metrics which solves the business problem while models, by default, try to optimize across all of them. The reader learns how to tweak a particular metric to address the business needs.

Chapter 17:?Model Interpretation – Logistic regression, decision tree, and XGBoost are used to rank the importance of features and these are compared. The interesting technique of surrogate models is used to interpret the xgboost predictions.

Chapter 18:?xgbfir – The xgbfir library is used to understand the interactions between the various features in the model. After due exploration, we create a model with selected features alone to compare the performance.

Chapter 19:?Exploring SHAP – This library is a must in your toolkit to interpret models (especially blackbox models like xgboost). We develop an understanding of this powerful tool and visualize with and without feature interactions.

Chapter 20:?Better models with ICE, PDP, monotonic constraints and calibration - We also note monotonic and non-monotonic behaviour of features, and further use them to add constraints to our model and note the improvement in model accuracy. We also calibrate the model using sigmoid and isotonic and note the calibration curves. The interpretation of this curve helps us to ascertain the model’s prediction reliability.

Chapter 21:?Serving models with MLFlow – If you have been following the book along till now, you will realize that you will be hard put to compare the different models. This is a common problem and can become a nightmare in production. MLFlow helps to track, compare and deploy the model of choice quickly in production. Here, this library is explored and of course, Docker!

At this point, you should have all the tools necessary to start working on real-world classification problems. Congratulations!

Who is this book for??Familiarity with python is needed. A bit of general machine learning concepts would help, it is not a must since detailed explanations are made. The author’s experience in training shines through and this book is beautifully annotated with visualizations and explanations which help the reader get into the mind of a master. Compare this to watching a World Championship chess game with commentary from world-class grandmasters. We get a peek into how a problem is dissected by a professional. If you have a basic ML knowledge and work on the concepts outlined in the book (along with the exercises) you can expect to be proficient and confident in classification and boosting.

While the author titles this book effective xgboost, it could easily be called effective classification since xgboost is the only the tool of choice but classification is explored deeply and extensively. To me, this is a book to come back to and relish again.

Book code: https://github.com/mattharrison/effective_xgboost_book

Did this review help you? How could I make this better? Please let me know!

Effective XGBoost by Matt Harrison

Sydney Lewis

AI/ML

更多精彩文章

社区洞察

The World Champion and the Hippo

2022年7月7日

社区洞察