登录查看更多内容

What Is Xgboost and How actually it works internally?

Indrajit S.

Senior Data Scientist @ Citi | GenAI | Kaggle Competition Expert | PHD research scholar in Data Science

发布日期: 2020年7月26日

What Is Xgboost ?

It is an efficient implementation of gradient boosting (GB). I don't think it has any new mathematical breakthrough. It is just a practically well designed version of GB for optimal use of multi CPU and caching hardware.

It is a penalized gradient boosting method with a variety of baselearners.

XGBoost, short for (extreme) gradient boosting, is a fast, portable, and distributed implementation of the gradient boosting (trees) algorithm.

What is gradient boosting? => It is an ensemble (i.e. meta) machine learning algorithm that builds a strong model based on many weaker ones sequentially. To do so, it uses gradient descent. When this technique is used with decision trees, it is called gradient boosting trees.

Why is it fast? => The library is written in C++ and offers many useful wrappers in higher-level languages (Python and R for instance).

What does portable and distributed mean in this context? Portable means that it can be easily used in different settings/architectures. Distributed means that it can run on multiple cores (single machine) and/or on multiple servers (in a cluster).

If you want to understand some of the theory behind boosting trees, I refer to this excellent introduction from the official website.

Now, for the second part of your question about the best way to optimize its (hyper)parameters.

As with many real-world applications, the answer is invariably “it depends”?.

In fact, it depends on the size of the dataset at hand, the metric to optimize, the used features, the computational power available, and many more things.

That being said, here is a systematic strategy that you can adapt (that I often use myself):

Start with “good candidates” for the problem you are trying to solve. By good candidates, I mean standard values for hyper parameters that experts have been using and are known to yield good results. This is enough in a lot of cases. For a table of “good candidates”, I refer to this table from this great article.

If the performance is still not acceptable by your standards, try random search and/or grid search. This type of approach is highly parallelizable (sometimes dubbed embarrassingly parallel problem) and you are only limited by the number of nodes you can set in your computing cluster.

Finally, you can get clever by trying one the many Bayesian optimization techniques. If you are a Python user, I recommend trying hyperopt and/or skopt. Other similar libraries exist of course.

Alternatively, you can decide that you aren’t good at this and you look for a “HpoaaS” solution (short for hyperparameters optimization as a service). Sigopt and/or Datarobot are interesting ones to try.

要查看或添加评论，请登录

Indrajit S.的更多文章

Common XGBoost Mistakes to Avoid

2024年12月31日

Common XGBoost Mistakes to Avoid

Using Default Hyperparameters - Why Wrong: Different datasets need different settings - Fix: Always tune learning_rate,…
Processing Large Multiline Files in Spark: Strategies and Best Practices

2024年11月10日

Processing Large Multiline Files in Spark: Strategies and Best Practices

Handling large, multiline files can be a tricky yet essential task when working with different types of data from…
Integrating a Hugging Face Model with Google Colab

2024年5月23日

Integrating a Hugging Face Model with Google Colab

Integrating models from Hugging Face with Google Colab. Install Hugging Face Transformers Install required libs…
PyTorch GPU

2023年12月23日

PyTorch GPU

Check if CUDA is Available: This command returns True if PyTorch can access a CUDA-enabled GPU, otherwise False. Get…
How to choose the right model

2023年8月4日

How to choose the right model

Choosing the right model for a machine learning problem involves multiple steps, each of which can influence the…
???? #DataScience Insight: The Significance of Data Cleaning ????

2023年7月29日

???? #DataScience Insight: The Significance of Data Cleaning ????

In the world of Data Science, it's often said that 80% of a data scientist's valuable time is spent simply finding…
Machine Learning Model Monitoring

2023年3月18日

Machine Learning Model Monitoring

Machine Learning Model Monitoring ML monitoring verifies model behavior in the early phases of the MLOps lifecycle and…
How to optimise XGBOOST MODEL

2022年12月23日

How to optimise XGBOOST MODEL

How to optimise XGBOOST model XGBoost is a powerful tool for building and optimizing machine learning models, and there…

1 条评论
why you should not give too much stress on this value in ML ?

2022年9月1日

why you should not give too much stress on this value in ML ?

What is seed Seed in machine learning means the initialization state of a pseudo-random number generator. If you use…

1 条评论
Performance Tuning in join Spark 3.0

2020年10月23日

Performance Tuning in join Spark 3.0

When we perform join in spark and if your data is small in size .Then spark by default applies the broad cast join .

See all articles

What Is Xgboost and How actually it works internally?

Indrajit S.

Senior Data Scientist @ Citi | GenAI | Kaggle Competition Expert | PHD research scholar in Data Science

Indrajit S.的更多文章

社区洞察

其他会员也浏览了

Parallel and distributed computation in C++17

Unlocking the Power of Recursion: Understanding its Rules and the Classic Example of the Fibonacci Sequence

Random Algorithm

Visualize TensorFlow's Transformer Code: (Multi-Headed Attention)

Data analysis, Paris, metro (ep.6)

R-Kleene: an algorithm in Algorithmica

Faster big-data analysis

Why Recursion in Go is a Game-Changer: A Deep Dive with Practical Examples

Big O Notation Explained

Design and analysis of algorithm

Indrajit S.的更多文章

Common XGBoost Mistakes to Avoid

Processing Large Multiline Files in Spark: Strategies and Best Practices

Integrating a Hugging Face Model with Google Colab

PyTorch GPU

How to choose the right model

???? #DataScience Insight: The Significance of Data Cleaning ????

Machine Learning Model Monitoring

How to optimise XGBOOST MODEL

why you should not give too much stress on this value in ML ?

Performance Tuning in join Spark 3.0

社区洞察

其他会员也浏览了

Parallel and distributed computation in C++17

Unlocking the Power of Recursion: Understanding its Rules and the Classic Example of the Fibonacci Sequence

Random Algorithm

Visualize TensorFlow's Transformer Code: (Multi-Headed Attention)

Data analysis, Paris, metro (ep.6)

R-Kleene: an algorithm in Algorithmica

Faster big-data analysis

Why Recursion in Go is a Game-Changer: A Deep Dive with Practical Examples

Big O Notation Explained

Design and analysis of algorithm