登录查看更多内容

Powering Predictive Precision: XGBoost and LightGBM

Ihtisham Mehmood

Co-Founder @ DMC | Data Scientist | Generative AI | Agentic AI | MLOps | Data Analyst | MBA | BBA

发布日期: 2023年10月19日

In the ever-evolving landscape of machine learning and data science, the arsenal of tools available to data scientists and analysts continues to expand. Among these, two standout algorithms, XGBoost and LightGBM, have gained immense popularity and have proven their mettle in various predictive tasks. Notably, they have been instrumental in revolutionizing the prediction of car prices. This article delves into these robust algorithms and explores how they have become exceptionally effective in predicting car prices, showcasing their significance in the field of machine learning.

Click here access Dashboard

Click here to Access EDA + Analysis in Python

Click Click here to see ML Source code

XGBoost and LightGBM: A Brief Overview

Before we delve into their role in car price prediction, let's briefly introduce both XGBoost and LightGBM.

XGBoost (Extreme Gradient Boosting):

XGBoost is a powerful gradient boosting framework designed for optimizing machine learning tasks. It is renowned for its efficiency and effectiveness in handling structured data. One of its key strengths is its ability to handle missing data, regularization techniques, and parallel processing, making it highly versatile for various applications.

LightGBM:

LightGBM, or Light Gradient Boosting Machine, is an open-source, distributed, high-performance gradient boosting framework. Developed by Microsoft, LightGBM is known for its speed and efficiency, making it an excellent choice for large datasets. It utilizes histogram-based learning, which helps accelerate the training process while preserving accuracy.

Car Price Prediction: A Complex Challenge

Predicting car prices is a complex task due to the myriad of variables involved. Factors such as make, model, age, mileage, condition, and even the current market trends all play a significant role in determining a car's worth. Traditional regression models often struggle with this complexity. However, XGBoost and LightGBM have emerged as robust contenders in this domain.

Steps:

The Data preprocessing has already been done . The Process can be accessed in above given links

Important Libraries

import numpy as np 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import LabelEncoder 


import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
import xgboost as xgb

Feature Selection

## Feature Selection
correlation_matrix = df.select_dtypes(include='number').corr()

threshold = 0.8
corr_features = set()

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            colname = correlation_matrix.columns[i]
            corr_features.add(colname)

df.drop(corr_features, axis=1, inplace=True)

# Visualize the correlation matrix 
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()

The goal of this code is to remove features that exhibit a high correlation (above a specified threshold of 0.8) with other features. First, it calculates the correlation matrix for numerical columns in the DataFrame using Pearson correlation coefficients. The corr() function computes this matrix.

Next, it iterates through the correlation matrix, identifying pairs of features with a correlation magnitude greater than the threshold. For each pair exceeding the threshold, it adds the feature with higher index (column) to a set called corr_features. This set will contain features that need to be removed due to high correlation with other features.

After identifying the correlated features, it drops them from the DataFrame using df.drop() along the columns (axis=1).

Finally, the code visualizes the correlation matrix after removal of highly correlated features as a heatmap using seaborn and matplotlib. The heatmap color-codes the correlation coefficients, providing a visual representation of the correlation relationships between features. This visualization helps in understanding the strength and direction of correlations, aiding in the interpretation of the feature selection process based on correlation.

A correlation matrix is a statistical technique used to evaluate the relationship between two variables in a data set. The matrix is a table in which every cell contains a correlation coefficient, where 1 is considered a strong relationship between variables, 0 a neutral relationship and -1 a not strong relationship. It’s most commonly used in building regression models.?

It is recommended to avoid highly correlated columns as they increase the complexity in the model

Label Encoding

Label encoding is a technique used to convert categorical data into numerical format, making it compatible for machine learning algorithms. In this process, each unique category in a categorical feature is assigned a unique numerical label. For instance, if we have categories like 'red,' 'green,' and 'blue,' we might assign them labels 0, 1, and 2, respectively. However, it's important to note that label encoding introduces ordinal relationships between categories, which might be misleading to the model. For categories without a natural ordering, it's better to use techniques like one-hot encoding to represent them without implying any hierarchy. Label encoding is quick and simple, making it a commonly used preprocessing step in data preparation for machine learning tasks. The following code is used to encode the the dataset

le = LabelEncoder()
df['city'] = le.fit_transform(df[['city']])
df['make'] = le.fit_transform(df[['make']])
df['transmission'] = le.fit_transform(df[['transmission']])
df['registered'] = le.fit_transform(df[['registered']])
df['registration status'] = le.fit_transform(df[['registration status']])
df['fuel'] = le.fit_transform(df[['fuel']])
df['color'] = le.fit_transform(df[['color']])
df['model'] = df['model'].astype(str)
df['model'] = le.fit_transform(df[['model']])

Data Split and Normalize

x = df.drop(['price'],axis =1)
y = df['price']

X_norm = df.drop(['year','price'],axis =1)

## normalize the data
from sklearn.preprocessing import MinMaxScaler
scaler  =MinMaxScaler()
df_scaled = scaler.fit_transform(X_norm)
X_norm = pd.DataFrame(df_scaled, columns=X_norm.columns)

x = pd.concat([x['year'], X_norm], axis=1)

This code prepares the data by splitting features and target, normalizing the features (excluding 'year' and 'price'), and then concatenating the normalized features with the 'year' column to create the final set of features for modeling.

x_train, x_test , y_train, y_test = train_test_split(x,y, test_size = 0.1,random_state=42)

This code utilizes the train_test_split function from a machine learning library, scikit-learn, to split the data into training and testing sets for both features (x) and the target (y) which code efficiently partitions the dataset into training and testing sets, facilitating the subsequent training and evaluation of machine learning models. The training set (x_train and y_train) is used to train the model, and the testing set (x_test and y_test) is employed to evaluate its performance and generalization to unseen data.

XG boost

model = xgb_regressor = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=200, max_depth=6, random_state=42)

model.fit(x_train, y_train)

In this code snippet, we are creating and training a predictive model using XGBoost, a popular gradient boosting framework. First, we define our model using xgb.XGBRegressor(). This sets up an instance of the XGBoost Regressor, specifying our objective as 'reg:squarederror,' indicating that we're aiming to minimize the mean squared error during training. We set the number of estimators (trees) to 200, controlling the complexity of the model, and the maximum depth of each tree to 6, which helps prevent overfitting. Additionally, we set the random state to 42 to ensure reproducibility of results.

Next, we proceed to train our model using the fit method. We provide the training features (x_train) and their corresponding target values (y_train) to the fit function. This process involves the model learning the patterns and relationships in the training data, enabling it to make accurate predictions for unseen data.

The Role of Hyper Parameter tuning

Both XGBoost and LightGBM require proper hyperparameter tuning to unleash their full potential. Hyperparameter tuning involves adjusting various settings of the algorithm to optimize its performance. This step is crucial in ensuring that the model can effectively learn from the data and make precise predictions.

param_grid ={'learning_rate':[0.05,0.1,0.2],
            'max_depth':[3,5,7],'n_estimators':[50,100,200],
             'num_leaves':[31, 63, 127, 255], 'max_depth':[-1, 3, 5, 7]
            }

In this code, param_grid defines a grid of hyperparameters for tuning a LightGBM model. It includes different values for key parameters like learning rate, maximum depth of trees, number of estimators (trees), number of leaves, and maximum depth (with -1 indicating no limit). The purpose is to explore various combinations of these hyperparameter during the model tuning process, optimizing the model's performance for the given task. This hyperparameters grid aids in finding the best configuration for the LightGBM model, ultimately improving its predictive accuracy and robustness.

from sklearn.model_selection import GridSearchCV

## initialize the grid search
grid_search = GridSearchCV(model,param_grid , cv = 5, n_jobs = -1)

The GridSearchCV will perform a cross-validated grid search using the provided parameter grid (param_grid). It splits the data into 5 folds for cross-validation (cv=5), meaning it will evaluate each combination of hyperparameters five times, using a different fold for validation each time. The n_jobs=-1 parameter allows the search to be parallelized, leveraging all available CPU cores for faster processing.

领英推荐

What are the top challenges around working with…

Machine Learning 2 年前

Demystifying Machine Learning Challenges – Imbalanced…

Amlgo Labs 1 年前

Hypothesis Testing in Machine Learning

Sankhyana Consultancy Services Pvt. Ltd. 2 年前

Ultimately, this grid search will help us identify the best combination of hyperparameters for our XGBoost Regressor model based on the evaluation metric specified (often mean squared error for regression tasks). This fine-tuning aims to enhance the model's predictive performance and generalize well to unseen data.

best_model.fit(x_train, y_train)

we're using our tuned model (best_model) to predict the target values (y_pred) for the test set (x_test). We then calculate the mean squared error (mse) to quantify the prediction accuracy by comparing the predicted values (y_pred) against the actual test values (y_test). This metric helps us assess how well our model performs on unseen data, providing a measure of predictive effectiveness.

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)

Now we're evaluating our model's performance using common regression metrics. We first calculate the mean squared error (mse) to quantify the prediction accuracy, followed by computing the root mean squared error (rmse) for a more interpretable measure. Additionally, we determine the mean absolute error (mae) to gauge the prediction accuracy in a different perspective. Finally, we calculate the R-squared (r_squared) value, which provides insights into how well our model fits the data, conveying the proportion of variance in the target variable explained by the features. These metrics collectively offer a comprehensive view of our model's predictive capabilities and its alignment with the true target values.

from sklearn.metrics import explained_variance_score
print(explained_variance_score(y_pred,y_test))

we're using the explained_variance_score from scikit-learn to calculate the explained variance of our predictions (y_pred) compared to the actual target values (y_test). The explained variance score indicates the proportion of variance in the target variable that our model's predictions explain. It provides a measure of the model's ability to capture the underlying patterns and variability in the data, with higher values indicating a better fit. By printing this score, we gain insights into how well our model accounts for the variance in the target variable.

Title: Powering Predictive Precision: XGBoost and LightGBM in Car Price Prediction

XGBoost and LightGBM: A Brief Overview

Before we delve into their role in car price prediction, let's briefly introduce both XGBoost and LightGBM.

XGBoost (Extreme Gradient Boosting): XGBoost is a powerful gradient boosting framework designed for optimizing machine learning tasks. It is renowned for its efficiency and effectiveness in handling structured data. One of its key strengths is its ability to handle missing data, regularization techniques, and parallel processing, making it highly versatile for various applications.

LightGBM: LightGBM, or Light Gradient Boosting Machine, is an open-source, distributed, high-performance gradient boosting framework. Developed by Microsoft, LightGBM is known for its speed and efficiency, making it an excellent choice for large datasets. It utilizes histogram-based learning, which helps accelerate the training process while preserving accuracy.

Price Prediction: A Complex Challenge

The Role of Hyperparameter Tuning

Comparing XGBoost and LightGBM

Same steps has been taken to create a Light GBM model

In a real-world scenario, the effectiveness of machine learning algorithms is often judged by their practical applications. A fascinating aspect of XGBoost and LightGBM is that, when properly tuned, they perform remarkably similarly in car price prediction. This is a testament to the sophistication of both algorithms and their ability to adapt to the intricacies of the task.

XGboost: Performance

The results are as follows:

Mean Squared Error (MSE): 231761848891.70505

Root Mean Squared Error (RMSE): 481416.50251285016

Mean Absolute Error (MAE): 278907.4528850518

R-squared (R2): 0.9609000365264161

Explained Variance score: 0.9590632147449937

the XGBoost model performs well in terms of explaining the variance in car prices, as evidenced by the high R-squared and variance score. However, there is room for improvement in reducing the prediction errors, as indicated by the relatively high MSE, RMSE, and MAE. Further optimization and fine-tuning of the model's hyperparameters may help in achieving more accurate predictions.

LightGBM: Performance

Results:

Mean Squared Error (MSE): 239113435514.25488

Root Mean Squared Error (RMSE): 488992.2652908274

Mean Absolute Error (MAE): 277097.64916909754

R-squared (R2): 0.9596597686834162

Explained Variance score:0.9596597686834162

The Result are Almost similar to XGboost.

Additional tips for reducing the RMSE

Use a validation set. When tuning the hyperparameters of our model, it is important to use a validation set to evaluate the model's performance on unseen data. This will help us to avoid overfitting.
Use early stopping. Early stopping is a technique that can be used to prevent overfitting. It works by stopping the training process early if the model's performance on the validation set starts to decrease.
Use regularization. Regularization is a technique that can be used to reduce the complexity of the model and prevent overfitting. There are a number of different regularization techniques that can be used in XGboost and LightGBM, such as L1 and L2 regularization.

important Features

import matplotlib.pyplot as plt

# Feature importances plot
plt.figure(figsize=(8, 6))
feature_importances = pd.Series(model.feature_importances_, index=x.columns)
feature_importances.nlargest(10).plot(kind='barh')
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Top 10 Feature Importances')
plt.show()

In this particular bar graphs ,we can clearly see some top 10 important features. This visualization helps us understand which features have the most significant influence on our model's predictions, aiding in feature selection and model interpretation.

Conclusion

XGBoost and LightGBM have undoubtedly revolutionized the predictive modeling and numerous other machine learning applications. Their efficacy, efficiency, and versatility make them indispensable tools for data scientists and analysts worldwide. As these algorithms continue to evolve and adapt to new challenges, their popularity in the field of machine learning is sure to endure. The journey of XGBoost and LightGBM is an exemplary tale of how innovation, dedication, and community collaboration have transformed the landscape of predictive analytics.

要查看或添加评论，请登录

Ihtisham Mehmood的更多文章

Most Important Algorithm In Machine Learning

2024年9月27日

Most Important Algorithm In Machine Learning

Backpropagation is an algorithm used to train artificial neural networks by adjusting the weights and biases to…
Neural Makeover:The Science of Rewiring Your Brain

2024年5月31日

Neural Makeover:The Science of Rewiring Your Brain

Have you at any point felt trapped in a hopeless cycle, annoyed by tendencies, or thought designs that appear to be…
Microsoft Fabric

2024年2月19日

Microsoft Fabric

In today's data-driven landscape, enterprises seek robust analytics solutions that streamline data management…
Title: Mastering Tactical Empathy: A Strategic Approach to Professional Success

2024年1月22日

Title: Mastering Tactical Empathy: A Strategic Approach to Professional Success

In the dynamic landscape of today's professional world, effective communication, conflict resolution, and leadership…
Leapfrogging the Learning Curve: How Transfer Learning Supercharges CNNs

2024年1月4日

Leapfrogging the Learning Curve: How Transfer Learning Supercharges CNNs

Introduction In the realm of Artificial Intelligence (AI) and machine learning, Transfer Learning stands as a…
Power of Design Thinking

2023年12月31日

Power of Design Thinking

In a world where innovation reigns supreme, the art of problem-solving has taken on a new guise – one that champions…
Customer Churn: A Pressing Concern for Businesses

2023年11月30日

Customer Churn: A Pressing Concern for Businesses

In the competitive realm of business, customer retention stands as a crucial pillar of success. While attracting new…
Unveiling the Future: A Comprehensive Analysis and Stacked LSTM Approach to Stock Price Prediction

2023年11月10日

Unveiling the Future: A Comprehensive Analysis and Stacked LSTM Approach to Stock Price Prediction

Introduction In the fast-paced world of financial markets, the ability to accurately predict stock prices remains a…
Anomaly | Fraud Detection

2023年10月30日

Anomaly | Fraud Detection

Fraudulent activities in various domains have become increasingly sophisticated, making it imperative for organizations…
Predicting Potential Hazardous Asteroids (PHA) with Machine Learning - A Random Forest Approach

2023年9月11日

Predicting Potential Hazardous Asteroids (PHA) with Machine Learning - A Random Forest Approach

The cosmos, with its celestial wonders, has always captured our imagination. However, it also presents a lurking danger…

See all articles

Powering Predictive Precision: XGBoost and LightGBM

Ihtisham Mehmood

Co-Founder @ DMC | Data Scientist | Generative AI | Agentic AI | MLOps | Data Analyst | MBA | BBA

Car Price Prediction: A Complex Challenge

Steps:

Important Libraries

Feature Selection

Label Encoding

Data Split and Normalize

XG boost

The Role of Hyper Parameter tuning

领英推荐

XGboost: Performance

LightGBM: Performance

important Features

Conclusion

Ihtisham Mehmood的更多文章

社区洞察

其他会员也浏览了

Ensemble Models: Combining Forces to Improve Accuracy

Machine Learning - MLflow for managing the end-to-end machine learning lifecycle

Stock Market Predictions with XGBoost: Combining Machine Learning with Rust for Performance and Accuracy above 84%

The Gradient Boosted Algorithm Explained!

Data Scientist’s Dilemma: The Cold Start Problem – Ten Machine Learning Examples

Unlocking Model Performance: Navigating the Key Factors for Success in Machine Learning

Machine Learning - Cross Validation

Time Series Decomposition in Machine Learning

Boosting Techniques Battle: CatBoost vs XGBoost vs LightGBM vs scikit-learn GradientBoosting vs Hierarchical GB

Support Vector Machines (SVM) in Plain English

Car Price Prediction: A Complex Challenge

Steps:

Important Libraries

Feature Selection

Label Encoding

Data Split and Normalize

XG boost

The Role of Hyper Parameter tuning

领英推荐

XGboost: Performance

LightGBM: Performance

important Features

Conclusion

Ihtisham Mehmood的更多文章

Most Important Algorithm In Machine Learning

Neural Makeover:The Science of Rewiring Your Brain

Microsoft Fabric

Title: Mastering Tactical Empathy: A Strategic Approach to Professional Success

Leapfrogging the Learning Curve: How Transfer Learning Supercharges CNNs

Power of Design Thinking

Customer Churn: A Pressing Concern for Businesses

Unveiling the Future: A Comprehensive Analysis and Stacked LSTM Approach to Stock Price Prediction

Anomaly | Fraud Detection

Predicting Potential Hazardous Asteroids (PHA) with Machine Learning - A Random Forest Approach

社区洞察

其他会员也浏览了

Ensemble Models: Combining Forces to Improve Accuracy

Machine Learning - MLflow for managing the end-to-end machine learning lifecycle

Stock Market Predictions with XGBoost: Combining Machine Learning with Rust for Performance and Accuracy above 84%

The Gradient Boosted Algorithm Explained!

Data Scientist’s Dilemma: The Cold Start Problem – Ten Machine Learning Examples

Unlocking Model Performance: Navigating the Key Factors for Success in Machine Learning

Machine Learning - Cross Validation

Time Series Decomposition in Machine Learning

Boosting Techniques Battle: CatBoost vs XGBoost vs LightGBM vs scikit-learn GradientBoosting vs Hierarchical GB

Support Vector Machines (SVM) in Plain English