Understanding CatBoost!
Damien Benveniste, PhD
Founder @ TheAiEdge | Follow me to learn about Machine Learning Engineering, Machine Learning System Design, MLOps, and the latest techniques and news about the field.
CatBoost might be the easiest supervised learning algorithm to use today on large tabular data. It is highly parallelizable, it automatically deals with missing values and categorical variables, and even more than Xgboost, it is built to prevent overfitting. If you throw some data into it, without much work, you are pretty much guaranteed to get state-of-the-art results. This assumes your data is training-ready, but even then, it is almost too good to be true!
I made a previous video explaining how the gradient boosting algorithm works and another explaining how Xgboost works. Catboost builds on top of those two algorithms, so make sure to check them out. Now, let's dig into how Catboost works!
CatBoost was developed by Yandex in 2017: CatBoost: unbiased boosting with categorical features. They realized that the boosting process induces a special case of data leakage. To prevent that, they developed two new techniques, the expanding mean target encoding and the ordered boosting.
A target encoding is a technique to encode a categorical variable into a numerical one. It allows the building of a variable that has a very simple and easy-to-learn relationship to the target variable. I always found Leave-One-Out (LOO) Target Encoding to be a powerful but simple method! It is also a very risky method: it is so easy to overfit! The trick is to smooth the encoding by computing a weighted mean of the category mean and the global mean and cross-validating the weighting parameter: such a pain when you need to do that for all the categorical features! Also, it is tricky to apply to the test set without data leakage. One advantage, though, is that you effectively linearize a variable that originally had a highly non-linear relationship to the target. You can read more about it here: “Getting Deeper into Categorical Encodings for Machine Learning”. Here is the category encoder Python package: “Leave One Out“.
Another powerful technique is Expending Mean Target Encoding. The idea is very similar to LOO. Instead of considering all the values, we perform a cumulative average of the target while omitting the target itself from the average. When predicting on unseen data, we just use the full mean encoding of the categorical value.
It is the method used by Catboost. In the category encoder package, they call it the "Catboost encoder". It doesn't overfit (not as much), requires no hyperparameters to cross-validate, and takes three lines of code. Such a simple method!
cumsum = df.groupby(cal)['target'].cumsum() - df['target']
cumcount = df.groupby(cal)['target'].cumcount()
df[col + '_encoded'] = cumsum / cumcount
Each tree is trained with a different permutation of the training data.
领英推荐
The target encoding is performed by considering the rows appearing before the current row in the specific permutation of the data. This means that each permutation will generate a different encoding, minimizing the overfitting effect.
When we learn the trees, we need to compute the gradients and Hessians by using the samples in each tree node.
The ordered boosting is meant to avoid computing the residuals (gradients and Hessians) using the current sample itself, but only samples that appear earlier in the specific permutation of the training data.
This means that each sample will yield a different computed gradient and Hessian for the different permutations of the training, leading to trees that are more robust to overfitting.
Articles You May Have Missed!
Principal Data Scientist | AI | Data Science | Digital | Automation
9 个月I have used CatBoost for a long time in time series forecasting and found it better than the rest in most of the cases.
RN & Clinical SME - Data Analyst 14 yr track record building, launching and maintaining Payment Integrity ML Models (Gen AI LLM, NLP, XgBoost) to reduce costs Healthcare costs
9 个月I just love the name reminds me of how Andrew Gelman in his blog posted a cat picture with any posts in order to help it rank highly because people on the internet love cat pictures so much.
Enterprise Sales Director at Zoho | Fueling Business Success with Expert Sales Insights and Inspiring Motivation
9 个月CatBoost is truly a game-changer for handling large tabular data! Its ability to manage missing values and categorical variables effortlessly while preventing overfitting makes it a go-to for many data scientists!
Agree. Much better than XGboost