Powering Predictive Precision: XGBoost and LightGBM
https://taylorwells.com.au/price-prediction-algorithm/

Powering Predictive Precision: XGBoost and LightGBM

In the ever-evolving landscape of machine learning and data science, the arsenal of tools available to data scientists and analysts continues to expand. Among these, two standout algorithms, XGBoost and LightGBM, have gained immense popularity and have proven their mettle in various predictive tasks. Notably, they have been instrumental in revolutionizing the prediction of car prices. This article delves into these robust algorithms and explores how they have become exceptionally effective in predicting car prices, showcasing their significance in the field of machine learning.


Click here access Dashboard

Click here to Access EDA + Analysis in Python

Click Click here to see ML Source code


XGBoost and LightGBM: A Brief Overview

Before we delve into their role in car price prediction, let's briefly introduce both XGBoost and LightGBM.

XGBoost (Extreme Gradient Boosting):

XGBoost is a powerful gradient boosting framework designed for optimizing machine learning tasks. It is renowned for its efficiency and effectiveness in handling structured data. One of its key strengths is its ability to handle missing data, regularization techniques, and parallel processing, making it highly versatile for various applications.

LightGBM:

LightGBM, or Light Gradient Boosting Machine, is an open-source, distributed, high-performance gradient boosting framework. Developed by Microsoft, LightGBM is known for its speed and efficiency, making it an excellent choice for large datasets. It utilizes histogram-based learning, which helps accelerate the training process while preserving accuracy.

Car Price Prediction: A Complex Challenge

Predicting car prices is a complex task due to the myriad of variables involved. Factors such as make, model, age, mileage, condition, and even the current market trends all play a significant role in determining a car's worth. Traditional regression models often struggle with this complexity. However, XGBoost and LightGBM have emerged as robust contenders in this domain.

Steps:

The Data preprocessing has already been done . The Process can be accessed in above given links

Important Libraries

import numpy as np 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import LabelEncoder 


import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
import xgboost as xgb        

Feature Selection

## Feature Selection
correlation_matrix = df.select_dtypes(include='number').corr()

threshold = 0.8
corr_features = set()

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            colname = correlation_matrix.columns[i]
            corr_features.add(colname)

df.drop(corr_features, axis=1, inplace=True)

# Visualize the correlation matrix 
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()        

The goal of this code is to remove features that exhibit a high correlation (above a specified threshold of 0.8) with other features. First, it calculates the correlation matrix for numerical columns in the DataFrame using Pearson correlation coefficients. The corr() function computes this matrix.

Next, it iterates through the correlation matrix, identifying pairs of features with a correlation magnitude greater than the threshold. For each pair exceeding the threshold, it adds the feature with higher index (column) to a set called corr_features. This set will contain features that need to be removed due to high correlation with other features.

After identifying the correlated features, it drops them from the DataFrame using df.drop() along the columns (axis=1).

Finally, the code visualizes the correlation matrix after removal of highly correlated features as a heatmap using seaborn and matplotlib. The heatmap color-codes the correlation coefficients, providing a visual representation of the correlation relationships between features. This visualization helps in understanding the strength and direction of correlations, aiding in the interpretation of the feature selection process based on correlation.


A correlation matrix is a statistical technique used to evaluate the relationship between two variables in a data set. The matrix is a table in which every cell contains a correlation coefficient, where 1 is considered a strong relationship between variables, 0 a neutral relationship and -1 a not strong relationship. It’s most commonly used in building regression models.?

It is recommended to avoid highly correlated columns as they increase the complexity in the model

Label Encoding

Label encoding is a technique used to convert categorical data into numerical format, making it compatible for machine learning algorithms. In this process, each unique category in a categorical feature is assigned a unique numerical label. For instance, if we have categories like 'red,' 'green,' and 'blue,' we might assign them labels 0, 1, and 2, respectively. However, it's important to note that label encoding introduces ordinal relationships between categories, which might be misleading to the model. For categories without a natural ordering, it's better to use techniques like one-hot encoding to represent them without implying any hierarchy. Label encoding is quick and simple, making it a commonly used preprocessing step in data preparation for machine learning tasks. The following code is used to encode the the dataset

le = LabelEncoder()
df['city'] = le.fit_transform(df[['city']])
df['make'] = le.fit_transform(df[['make']])
df['transmission'] = le.fit_transform(df[['transmission']])
df['registered'] = le.fit_transform(df[['registered']])
df['registration status'] = le.fit_transform(df[['registration status']])
df['fuel'] = le.fit_transform(df[['fuel']])
df['color'] = le.fit_transform(df[['color']])
df['model'] = df['model'].astype(str)
df['model'] = le.fit_transform(df[['model']])        

Data Split and Normalize

x = df.drop(['price'],axis =1)
y = df['price']        
X_norm = df.drop(['year','price'],axis =1)

## normalize the data
from sklearn.preprocessing import MinMaxScaler
scaler  =MinMaxScaler()
df_scaled = scaler.fit_transform(X_norm)
X_norm = pd.DataFrame(df_scaled, columns=X_norm.columns)

x = pd.concat([x['year'], X_norm], axis=1)        

This code prepares the data by splitting features and target, normalizing the features (excluding 'year' and 'price'), and then concatenating the normalized features with the 'year' column to create the final set of features for modeling.

x_train, x_test , y_train, y_test = train_test_split(x,y, test_size = 0.1,random_state=42)
        

This code utilizes the train_test_split function from a machine learning library, scikit-learn, to split the data into training and testing sets for both features (x) and the target (y) which code efficiently partitions the dataset into training and testing sets, facilitating the subsequent training and evaluation of machine learning models. The training set (x_train and y_train) is used to train the model, and the testing set (x_test and y_test) is employed to evaluate its performance and generalization to unseen data.

XG boost

model = xgb_regressor = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=200, max_depth=6, random_state=42)

model.fit(x_train, y_train)        

In this code snippet, we are creating and training a predictive model using XGBoost, a popular gradient boosting framework. First, we define our model using xgb.XGBRegressor(). This sets up an instance of the XGBoost Regressor, specifying our objective as 'reg:squarederror,' indicating that we're aiming to minimize the mean squared error during training. We set the number of estimators (trees) to 200, controlling the complexity of the model, and the maximum depth of each tree to 6, which helps prevent overfitting. Additionally, we set the random state to 42 to ensure reproducibility of results.

Next, we proceed to train our model using the fit method. We provide the training features (x_train) and their corresponding target values (y_train) to the fit function. This process involves the model learning the patterns and relationships in the training data, enabling it to make accurate predictions for unseen data.

The Role of Hyper Parameter tuning

Both XGBoost and LightGBM require proper hyperparameter tuning to unleash their full potential. Hyperparameter tuning involves adjusting various settings of the algorithm to optimize its performance. This step is crucial in ensuring that the model can effectively learn from the data and make precise predictions.

param_grid ={'learning_rate':[0.05,0.1,0.2],
            'max_depth':[3,5,7],'n_estimators':[50,100,200],
             'num_leaves':[31, 63, 127, 255], 'max_depth':[-1, 3, 5, 7]
            }        

In this code, param_grid defines a grid of hyperparameters for tuning a LightGBM model. It includes different values for key parameters like learning rate, maximum depth of trees, number of estimators (trees), number of leaves, and maximum depth (with -1 indicating no limit). The purpose is to explore various combinations of these hyperparameter during the model tuning process, optimizing the model's performance for the given task. This hyperparameters grid aids in finding the best configuration for the LightGBM model, ultimately improving its predictive accuracy and robustness.

from sklearn.model_selection import GridSearchCV

## initialize the grid search
grid_search = GridSearchCV(model,param_grid , cv = 5, n_jobs = -1)         

The GridSearchCV will perform a cross-validated grid search using the provided parameter grid (param_grid). It splits the data into 5 folds for cross-validation (cv=5), meaning it will evaluate each combination of hyperparameters five times, using a different fold for validation each time. The n_jobs=-1 parameter allows the search to be parallelized, leveraging all available CPU cores for faster processing.

Ultimately, this grid search will help us identify the best combination of hyperparameters for our XGBoost Regressor model based on the evaluation metric specified (often mean squared error for regression tasks). This fine-tuning aims to enhance the model's predictive performance and generalize well to unseen data.

best_model.fit(x_train, y_train)        

we're using our tuned model (best_model) to predict the target values (y_pred) for the test set (x_test). We then calculate the mean squared error (mse) to quantify the prediction accuracy by comparing the predicted values (y_pred) against the actual test values (y_test). This metric helps us assess how well our model performs on unseen data, providing a measure of predictive effectiveness.

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)        

Now we're evaluating our model's performance using common regression metrics. We first calculate the mean squared error (mse) to quantify the prediction accuracy, followed by computing the root mean squared error (rmse) for a more interpretable measure. Additionally, we determine the mean absolute error (mae) to gauge the prediction accuracy in a different perspective. Finally, we calculate the R-squared (r_squared) value, which provides insights into how well our model fits the data, conveying the proportion of variance in the target variable explained by the features. These metrics collectively offer a comprehensive view of our model's predictive capabilities and its alignment with the true target values.

from sklearn.metrics import explained_variance_score
print(explained_variance_score(y_pred,y_test))        

we're using the explained_variance_score from scikit-learn to calculate the explained variance of our predictions (y_pred) compared to the actual target values (y_test). The explained variance score indicates the proportion of variance in the target variable that our model's predictions explain. It provides a measure of the model's ability to capture the underlying patterns and variability in the data, with higher values indicating a better fit. By printing this score, we gain insights into how well our model accounts for the variance in the target variable.

Title: Powering Predictive Precision: XGBoost and LightGBM in Car Price Prediction

In the ever-evolving landscape of machine learning and data science, the arsenal of tools available to data scientists and analysts continues to expand. Among these, two standout algorithms, XGBoost and LightGBM, have gained immense popularity and have proven their mettle in various predictive tasks. Notably, they have been instrumental in revolutionizing the prediction of car prices. This article delves into these robust algorithms and explores how they have become exceptionally effective in predicting car prices, showcasing their significance in the field of machine learning.

XGBoost and LightGBM: A Brief Overview

Before we delve into their role in car price prediction, let's briefly introduce both XGBoost and LightGBM.

XGBoost (Extreme Gradient Boosting): XGBoost is a powerful gradient boosting framework designed for optimizing machine learning tasks. It is renowned for its efficiency and effectiveness in handling structured data. One of its key strengths is its ability to handle missing data, regularization techniques, and parallel processing, making it highly versatile for various applications.

LightGBM: LightGBM, or Light Gradient Boosting Machine, is an open-source, distributed, high-performance gradient boosting framework. Developed by Microsoft, LightGBM is known for its speed and efficiency, making it an excellent choice for large datasets. It utilizes histogram-based learning, which helps accelerate the training process while preserving accuracy.

Price Prediction: A Complex Challenge

Predicting car prices is a complex task due to the myriad of variables involved. Factors such as make, model, age, mileage, condition, and even the current market trends all play a significant role in determining a car's worth. Traditional regression models often struggle with this complexity. However, XGBoost and LightGBM have emerged as robust contenders in this domain.

The Role of Hyperparameter Tuning

Both XGBoost and LightGBM require proper hyperparameter tuning to unleash their full potential. Hyperparameter tuning involves adjusting various settings of the algorithm to optimize its performance. This step is crucial in ensuring that the model can effectively learn from the data and make precise predictions.

Comparing XGBoost and LightGBM

Same steps has been taken to create a Light GBM model

In a real-world scenario, the effectiveness of machine learning algorithms is often judged by their practical applications. A fascinating aspect of XGBoost and LightGBM is that, when properly tuned, they perform remarkably similarly in car price prediction. This is a testament to the sophistication of both algorithms and their ability to adapt to the intricacies of the task.

XGboost: Performance

The results are as follows:

Mean Squared Error (MSE): 231761848891.70505

Root Mean Squared Error (RMSE): 481416.50251285016

Mean Absolute Error (MAE): 278907.4528850518

R-squared (R2): 0.9609000365264161

Explained Variance score: 0.9590632147449937

the XGBoost model performs well in terms of explaining the variance in car prices, as evidenced by the high R-squared and variance score. However, there is room for improvement in reducing the prediction errors, as indicated by the relatively high MSE, RMSE, and MAE. Further optimization and fine-tuning of the model's hyperparameters may help in achieving more accurate predictions.

LightGBM: Performance

Results:

Mean Squared Error (MSE): 239113435514.25488

Root Mean Squared Error (RMSE): 488992.2652908274

Mean Absolute Error (MAE): 277097.64916909754

R-squared (R2): 0.9596597686834162

Explained Variance score:0.9596597686834162

The Result are Almost similar to XGboost.

Additional tips for reducing the RMSE

  • Use a validation set. When tuning the hyperparameters of our model, it is important to use a validation set to evaluate the model's performance on unseen data. This will help us to avoid overfitting.
  • Use early stopping. Early stopping is a technique that can be used to prevent overfitting. It works by stopping the training process early if the model's performance on the validation set starts to decrease.
  • Use regularization. Regularization is a technique that can be used to reduce the complexity of the model and prevent overfitting. There are a number of different regularization techniques that can be used in XGboost and LightGBM, such as L1 and L2 regularization.

important Features

import matplotlib.pyplot as plt

# Feature importances plot
plt.figure(figsize=(8, 6))
feature_importances = pd.Series(model.feature_importances_, index=x.columns)
feature_importances.nlargest(10).plot(kind='barh')
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Top 10 Feature Importances')
plt.show()        

In this particular bar graphs ,we can clearly see some top 10 important features. This visualization helps us understand which features have the most significant influence on our model's predictions, aiding in feature selection and model interpretation.

Conclusion

XGBoost and LightGBM have undoubtedly revolutionized the predictive modeling and numerous other machine learning applications. Their efficacy, efficiency, and versatility make them indispensable tools for data scientists and analysts worldwide. As these algorithms continue to evolve and adapt to new challenges, their popularity in the field of machine learning is sure to endure. The journey of XGBoost and LightGBM is an exemplary tale of how innovation, dedication, and community collaboration have transformed the landscape of predictive analytics.



要查看或添加评论,请登录

Ihtisham Mehmood的更多文章

社区洞察

其他会员也浏览了