登录查看更多内容

Understanding CatBoost!

Damien Benveniste, PhD

Founder @ TheAiEdge | Follow me to learn about Machine Learning Engineering, Machine Learning System Design, MLOps, and the latest techniques and news about the field.

发布日期: 2024年6月10日

CatBoost might be the easiest supervised learning algorithm to use today on large tabular data. It is highly parallelizable, it automatically deals with missing values and categorical variables, and even more than Xgboost, it is built to prevent overfitting. If you throw some data into it, without much work, you are pretty much guaranteed to get state-of-the-art results. This assumes your data is training-ready, but even then, it is almost too good to be true!

I made a previous video explaining how the gradient boosting algorithm works and another explaining how Xgboost works. Catboost builds on top of those two algorithms, so make sure to check them out. Now, let's dig into how Catboost works!

CatBoost was developed by Yandex in 2017: CatBoost: unbiased boosting with categorical features. They realized that the boosting process induces a special case of data leakage. To prevent that, they developed two new techniques, the expanding mean target encoding and the ordered boosting.

A target encoding is a technique to encode a categorical variable into a numerical one. It allows the building of a variable that has a very simple and easy-to-learn relationship to the target variable. I always found Leave-One-Out (LOO) Target Encoding to be a powerful but simple method! It is also a very risky method: it is so easy to overfit! The trick is to smooth the encoding by computing a weighted mean of the category mean and the global mean and cross-validating the weighting parameter: such a pain when you need to do that for all the categorical features! Also, it is tricky to apply to the test set without data leakage. One advantage, though, is that you effectively linearize a variable that originally had a highly non-linear relationship to the target. You can read more about it here: “Getting Deeper into Categorical Encodings for Machine Learning”. Here is the category encoder Python package: “Leave One Out“.

Another powerful technique is Expending Mean Target Encoding. The idea is very similar to LOO. Instead of considering all the values, we perform a cumulative average of the target while omitting the target itself from the average. When predicting on unseen data, we just use the full mean encoding of the categorical value.

It is the method used by Catboost. In the category encoder package, they call it the "Catboost encoder". It doesn't overfit (not as much), requires no hyperparameters to cross-validate, and takes three lines of code. Such a simple method!

cumsum = df.groupby(cal)['target'].cumsum() - df['target']
cumcount = df.groupby(cal)['target'].cumcount()
df[col + '_encoded'] = cumsum / cumcount

Each tree is trained with a different permutation of the training data.

领英推荐

Choosing the Right Machine Learning Algorithm: A…

Doug Rose 1 个月前

Using Azure ML to Train a Serengeti Data Model, Fast…

Open Data Science Conference (ODSC) 1 年前

Introduction to Simple Linear Regression in Machine…

Learnbay 2 年前

The target encoding is performed by considering the rows appearing before the current row in the specific permutation of the data. This means that each permutation will generate a different encoding, minimizing the overfitting effect.

When we learn the trees, we need to compute the gradients and Hessians by using the samples in each tree node.

The ordered boosting is meant to avoid computing the residuals (gradients and Hessians) using the current sample itself, but only samples that appear earlier in the specific permutation of the training data.

This means that each sample will yield a different computed gradient and Hessian for the different permutations of the training, leading to trees that are more robust to overfitting.

Articles You May Have Missed!

The AiEdge

51,716 位关注者

Biswajyoti Kar

Principal Data Scientist | AI | Data Science | Digital | Automation

9 个月

I have used CatBoost for a long time in time series forecasting and found it better than the rest in most of the cases.

2 次回应

Andrew Brady

RN & Clinical SME - Data Analyst 14 yr track record building, launching and maintaining Payment Integrity ML Models (Gen AI LLM, NLP, XgBoost) to reduce costs Healthcare costs

9 个月

I just love the name reminds me of how Andrew Gelman in his blog posted a cat picture with any posts in order to help it rank highly because people on the internet love cat pictures so much.

1 次回应

Hardeep Chawla

Enterprise Sales Director at Zoho | Fueling Business Success with Expert Sales Insights and Inspiring Motivation

9 个月

CatBoost is truly a game-changer for handling large tabular data! Its ability to manage missing values and categorical variables effortlessly while preventing overfitting makes it a go-to for many data scientists!

1 次回应

Pawan Agrawal

9 个月

Agree. Much better than XGboost

1 次回应

查看更多评论

要查看或添加评论，请登录

Damien Benveniste, PhD的更多文章

New Chapter: Attention Is All You Need - The Original Transformer Architecture

2025年2月11日

New Chapter: Attention Is All You Need - The Original Transformer Architecture

The second chapter of the Big Book of Large Language Models is now available in preview: Attention Is All You Need: The…

9 条评论
Introducing The Big Book of Large Language Models!

2025年1月30日

Introducing The Big Book of Large Language Models!

For the past years, I have been creating educational content around machine learning and, specifically, large language…

13 条评论
Latest AI News and Research: Meta's Controversial Move and AI's Future in Healthcare and Gaming

2025年1月17日

Latest AI News and Research: Meta's Controversial Move and AI's Future in Healthcare and Gaming

Hey, this issue covers updates on Meta's decision to halt its fake news filters, a transformative soft robotic armband…
Today AI in the News: Brain-Mimicking Chips, Eco-Focused Models, and Google's News-Powered Gemini

2025年1月16日

Today AI in the News: Brain-Mimicking Chips, Eco-Focused Models, and Google's News-Powered Gemini

Inside this edition: a brain-mimicking AI chip enhancing battery life, machine learning models for sustainable hydrogen…

1 条评论
Today AI in the News: AI's Bold Advances in Healthcare and Beyond

2025年1月15日

Today AI in the News: AI's Bold Advances in Healthcare and Beyond

In this edition: Nvidia's latest leap in AI robotics, a pioneering approach for more efficient neural networks, AI's…

1 条评论
The Machine Learning Fundamentals Bootcamp V2: Live Sessions Starting Soon!

2025年1月14日

The Machine Learning Fundamentals Bootcamp V2: Live Sessions Starting Soon!

I am glad to teach again the Machine Learning Fundamentals Bootcamp V2. On February 12th, 2025, I am going to start…

10 条评论
The AiEdge: From IVF Successes to Evolving Esports and Billion-Dollar Ventures

2025年1月13日

The AiEdge: From IVF Successes to Evolving Esports and Billion-Dollar Ventures

In this edition: AI's role in IVF breakthroughs; real-time translation headsets and subtitles; HPE's billion-dollar…

2 条评论
New Live Bootcamp: Introduction to Data Science and Machine Learning Bootcamp!

2024年12月18日

New Live Bootcamp: Introduction to Data Science and Machine Learning Bootcamp!

It is almost Christmas, so it is time for a little gift! I am launching a new live bootcamp: Introduction to Data…

2 条评论
Happy Thanksgiving!

2024年11月28日

Happy Thanksgiving!

Happy Thanksgiving, everyone! I want to thank all of you readers for continuing to learn machine learning together! To…
How To Bring Machine Learning Projects to Success

2024年8月9日

How To Bring Machine Learning Projects to Success

To build a successful machine learning product, you need to understand how to manage a machine learning project. This…

7 条评论

See all articles

Understanding CatBoost!

Damien Benveniste, PhD

Founder @ TheAiEdge | Follow me to learn about Machine Learning Engineering, Machine Learning System Design, MLOps, and the latest techniques and news about the field.

领英推荐

Articles You May Have Missed!

The AiEdge

51,716 位关注者

Damien Benveniste, PhD的更多文章

社区洞察

其他会员也浏览了

Fraud Detection Using Isolation Forest Machine Learning Model.

AN INTRODUCTION TO MULTIPLE LINEAR REGRESSION IN ML

KD 17:n01: 5 Machine Learning Projects You Can’t Overlook; Future of Deep Learning

December 06, 2023

Supervised Machine Learning: Step-by-Step Guide (with code)

Encode-Categorical-Features

Understanding K-Means and K-Nearest Neighbours: Key Differences and Confusing Similarities

Class 20 - MODEL EVALUATION METRICS Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Embeddings explained in plain English

Automating data preparation and preprocessing in ML models

领英推荐

Articles You May Have Missed!

The AiEdge

51,716 位关注者

Damien Benveniste, PhD的更多文章

New Chapter: Attention Is All You Need - The Original Transformer Architecture

Introducing The Big Book of Large Language Models!

Latest AI News and Research: Meta's Controversial Move and AI's Future in Healthcare and Gaming

Today AI in the News: Brain-Mimicking Chips, Eco-Focused Models, and Google's News-Powered Gemini

Today AI in the News: AI's Bold Advances in Healthcare and Beyond

The Machine Learning Fundamentals Bootcamp V2: Live Sessions Starting Soon!

The AiEdge: From IVF Successes to Evolving Esports and Billion-Dollar Ventures

New Live Bootcamp: Introduction to Data Science and Machine Learning Bootcamp!

Happy Thanksgiving!

How To Bring Machine Learning Projects to Success

社区洞察

其他会员也浏览了

Fraud Detection Using Isolation Forest Machine Learning Model.

AN INTRODUCTION TO MULTIPLE LINEAR REGRESSION IN ML

KD 17:n01: 5 Machine Learning Projects You Can’t Overlook; Future of Deep Learning

December 06, 2023

Supervised Machine Learning: Step-by-Step Guide (with code)

Encode-Categorical-Features

Understanding K-Means and K-Nearest Neighbours: Key Differences and Confusing Similarities

Class 20 - MODEL EVALUATION METRICS Notes from the AI Basic Course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

Embeddings explained in plain English

Automating data preparation and preprocessing in ML models