登录查看更多内容

Common XGBoost Mistakes to Avoid

Indrajit S.

Senior Data Scientist @ Citi | GenAI | Kaggle Competition Expert | PHD research scholar in Data Science

发布日期: 2024年12月31日

+ 关注

Using Default Hyperparameters

- Why Wrong: Different datasets need different settings

- Fix: Always tune learning_rate, max_depth, min_child_weight based on your data size and complexity

Not Handling Class Imbalance

- Why Wrong: Leads to biased models favoring majority class

- Fix: Use scale_pos_weight or class_weight parameters

Ignoring Feature Importance

- Why Wrong: Redundant/noisy features hurt performance

- Fix: Use feature_importances_ to remove low-impact features

Overfitting with Deep Trees

- Why Wrong: Deep trees memorize training data

- Fix: Limit max_depth (3-10), use early stopping

Wrong Evaluation Metric

- Why Wrong: Default metrics may not match business goals

- Fix: Choose appropriate eval_metric (auc, error, rmse)

Not Scaling Features

- Why Wrong: While XGBoost is scale-invariant, extreme values cause instability

- Fix: Use StandardScaler or RobustScaler

Insufficient Cross-Validation

- Why Wrong: Single train-test split may give unreliable results

- Fix: Use k-fold CV with appropriate stratification

Memory Issues with Large Datasets

- Why Wrong: Default 'exact' method is memory-intensive

- Fix: Use 'hist' method, adjust max_bin parameter

#DataScience #XGBoost #MachineLearning

要查看或添加评论，请登录

Indrajit S.的更多文章

Processing Large Multiline Files in Spark: Strategies and Best Practices

2024年11月10日

Processing Large Multiline Files in Spark: Strategies and Best Practices

Handling large, multiline files can be a tricky yet essential task when working with different types of data from…
Integrating a Hugging Face Model with Google Colab

2024年5月23日

Integrating a Hugging Face Model with Google Colab

Integrating models from Hugging Face with Google Colab. Install Hugging Face Transformers Install required libs…
PyTorch GPU

2023年12月23日

PyTorch GPU

Check if CUDA is Available: This command returns True if PyTorch can access a CUDA-enabled GPU, otherwise False. Get…
How to choose the right model

2023年8月4日

How to choose the right model

Choosing the right model for a machine learning problem involves multiple steps, each of which can influence the…
???? #DataScience Insight: The Significance of Data Cleaning ????

2023年7月29日

???? #DataScience Insight: The Significance of Data Cleaning ????

In the world of Data Science, it's often said that 80% of a data scientist's valuable time is spent simply finding…
Machine Learning Model Monitoring

2023年3月18日

Machine Learning Model Monitoring

Machine Learning Model Monitoring ML monitoring verifies model behavior in the early phases of the MLOps lifecycle and…
How to optimise XGBOOST MODEL

2022年12月23日

How to optimise XGBOOST MODEL

How to optimise XGBOOST model XGBoost is a powerful tool for building and optimizing machine learning models, and there…

1 条评论
why you should not give too much stress on this value in ML ?

2022年9月1日

why you should not give too much stress on this value in ML ?

What is seed Seed in machine learning means the initialization state of a pseudo-random number generator. If you use…

1 条评论
Performance Tuning in join Spark 3.0

2020年10月23日

Performance Tuning in join Spark 3.0

When we perform join in spark and if your data is small in size .Then spark by default applies the broad cast join .
Spark concepts deep dive

2020年8月22日

Spark concepts deep dive

Spark core architecture To summerize it in simple line Spark runs in local and cluster and Messos mode . Image copied…

1 条评论

See all articles

Common XGBoost Mistakes to Avoid

Indrajit S.

Senior Data Scientist @ Citi | GenAI | Kaggle Competition Expert | PHD research scholar in Data Science

Using Default Hyperparameters

Not Handling Class Imbalance

Ignoring Feature Importance

Overfitting with Deep Trees

Wrong Evaluation Metric

Not Scaling Features

Insufficient Cross-Validation

Memory Issues with Large Datasets

Indrajit S.的更多文章

社区洞察

其他会员也浏览了

Understanding QuickSort: A Deep Dive into One of the Fastest Sorting Algorithms

Quick Sort

?? Day 91 of 365: Review and EDA Summary Report ??

?? Day 89 of 365: Correlation and Covariance ??

K-NN ( K Nearest Neighbor )

Unleashing the Power of Gini Impurity measure in Decision Tree

?? Day 96 of 365: Detecting Outliers in Multivariate Data ??

You will not believe your eyes if 10,000 fireflies .... join MCMC

Bond BWi P100-P80 relationship

?? Day 150 of 365: Cross-Validation (K-Fold) ??

Using Default Hyperparameters

Not Handling Class Imbalance

Ignoring Feature Importance

Overfitting with Deep Trees

Wrong Evaluation Metric

Not Scaling Features

Insufficient Cross-Validation

Memory Issues with Large Datasets

Indrajit S.的更多文章

Processing Large Multiline Files in Spark: Strategies and Best Practices

Integrating a Hugging Face Model with Google Colab

PyTorch GPU

How to choose the right model

???? #DataScience Insight: The Significance of Data Cleaning ????

Machine Learning Model Monitoring

How to optimise XGBOOST MODEL

why you should not give too much stress on this value in ML ?

Performance Tuning in join Spark 3.0

Spark concepts deep dive

社区洞察

其他会员也浏览了

Understanding QuickSort: A Deep Dive into One of the Fastest Sorting Algorithms

Quick Sort

?? Day 91 of 365: Review and EDA Summary Report ??

?? Day 89 of 365: Correlation and Covariance ??

K-NN ( K Nearest Neighbor )

Unleashing the Power of Gini Impurity measure in Decision Tree

?? Day 96 of 365: Detecting Outliers in Multivariate Data ??

You will not believe your eyes if 10,000 fireflies .... join MCMC

Bond BWi P100-P80 relationship

?? Day 150 of 365: Cross-Validation (K-Fold) ??