Continuous value prediction with decision forest algorithm
Salman Srizon
Manager @ US Bangla, xDaraz (Alibaba Group), xUPAY | Analytics Stack development, Data-driven strategy, customer segmentation | E-comm, Fintech, Q-comm, Soft Dev.
House price prediction is a classic example of a continuous value prediction problem in a registration problem family. I took this rich dataset to explore a new decision forest algorithm I came across while searching for a suitable algorithm for my office task. I found this Kaggle notebook by Gus Martins very informative, where he tried to solve this by introducing TensorFlow Decision Forests (TF-DF). (TF-DF) is a library built on TensorFlow that implements decision forest algorithms Decision forests are a family of machine learning algorithms for classification, regression, and ranking tasks. They include models like Random Forests and Gradient Boosted Trees (GBT), which are powerful for tabular data and are known for their ability to handle both structured and unstructured data types.
In this article, I tried to mimic some of the processes in the notebook and add some additional features to understand how TF-DF works with continuous value prediction. So let's jump into it..
Importing library
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
print("Tensorflow v" + tf.__version__)
print("Tensorflow Decision Forests v" + tfdf.__version__)
Tensorflow v2.17.0 Tensorflow Decision Forests v1.10.0
Version compatibility for both Tesnsorflow and Tf-DF is very important here. To find the compatible version spent a couple of hours and finally found that TFv2.17.0 is compatible with TF-DFv1.10.0
Exploring the dataset
trian_file_path = "./house-prices-advanced-regression-techniques/train.csv"
dataset_df = pd.read_csv(trian_file_path)
print("Full train dataset shape is {}".format(dataset_df.shape)
dataset_df.head(3)
There are 79 feature columns. Using these features your model has to predict the house sale price indicated by the label column named SalePrice.
I will drop the Id column as it is not necessary for model training.
dataset_df = dataset_df.drop('Id', axis=1)
inspecting the types of feature columns in the dataset
dataset_df.info()
House price distribution
let's take a look at how the house prices are distributed
print(dataset_df['SalePrice'].describe())
let's visualize this distribution for a better understanding
plt.figure(figsize=(9, 8)) sns.distplot(dataset_df['SalePrice'], color='g', bins=100, hist_kws={'alpha': 0.4})
according to the chart, we found that most of the house prices are to 130k - 300k.
Numeric data distribution
Now let's take a look at how the numerical features are distributed. In order to do this, let us first list all the types of data from our dataset and select only the numerical ones.
list(set(dataset_df.dtypes.tolist()))
df_num = dataset_df.select_dtypes(include = ['float64', 'int64'])
df_num.head(5)
let's plot the distribution for all the numerical features.
Preparing the dataset for training
This dataset contains a mix of numeric, categorical, and missing features. Decision forests are non-parametric models, meaning they often require less data preprocessing compared to neural networks. As TF-DF supports all these feature types natively, no preprocessing is required. This is one advantage of tree-based models, making them a great entry point to Tensorflow and ML. let's split the dataset into training and testing datasets
def split_dataset(dataset, test_ratio=0.30):
test_indices = np.random.rand(len(dataset)) < test_ratio
return dataset[~test_indices], dataset[test_indices]
train_ds_pd, valid_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples in testing.".format(
len(train_ds_pd), len(valid_ds_pd)))
1010 examples in training, 450 examples in testing.
Now we need to convert the Pandas dataset to the tensorflow dataset By default the Random Forest Model is configured to train classification tasks. Since this is a regression problem, we will specify the type of the task (tfdf.keras.Task.REGRESSION) as a parameter here
label = 'SalePrice'
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)
valid_ds_pd = tfdf.keras.pd_dataframe_to_tf_dataset(valid_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)
Select the model
There are several tree-based models for you to choose from.
To start, we'll work with a Random Forest. This is the most well-known of the Decision Forest training algorithms.
A Random Forest is a collection of decision trees, each trained independently on a random subset of the training dataset (sampled with replacement). The algorithm is unique in that it is robust to overfitting and easy to use.
tfdf.keras.get_all_models()
Configuring the models
In this case, we'll use the RandomForest model. Random forests are an ensemble learning method that builds multiple decision trees and merges their results to improve accuracy and avoid overfitting
rf = tfdf.keras.RandomForestModel( hyperparameter_template="benchmark_rank1", task=tfdf.keras.Task.REGRESSION)
benchmark_rank1 is one of the predefined hyperparameter templates optimized for good performance on a wide range of problems. It's typically tuned for models that aim to achieve high accuracy or "rank 1" on certain benchmarks. These hyperparameters include aspects such as the number of trees in the forest, the depth of trees, and other parameters that influence how the random forest is built and how well it generalizes to unseen data.benchmark_rank1 hyperparameter template is useful when we want to quickly create a model that performs well on a wide range of datasets without manual tuning.
领英推荐
Create a Random Forest Model
we will use the defaults to create the Random Forest Model while specifying the task type as tfdf.keras.Task.REGRESSION as we are dealing with a regression problem to predict the SalePrice
rf = tfdf.keras.RandomForestModel(task=tfdf.keras.Task.REGRESSION)
rf.compile(metrics=["mse"])
Train The Model
rf.fit(x=train_ds)
Visualize the model
tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=0, max_depth=3)
Evaluate the model on the Out of the bag (OOB) data and the validation dataset
Before training the dataset we manually separated 20% of the dataset for validation named as valid_ds_pd.
We can also use the Out of Bag (OOB) score to validate our RandomForestModel. To train a Random Forest Model, a set of random samples from the training set are chosen by the algorithm, and the rest of the samples are used to finetune the model. The subset of data that is not chosen is known as Out-of-bag data (OOB). OOB score is computed on the OOB data
The training logs show the Root Mean Squared Error (RMSE) evaluated on the out-of-bag dataset according to the number of trees in the model. Let us plot this.
import matplotlib.pyplot as plt
logs = rf.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.rmse for log in logs])
plt.xlabel('Number of trees')
plt.ylabel('RMSE (out-of-bag)')
plt.show()
Model Evaluation
inspector = rf.make_inspector()
inspector.evaluation()
Evaluation(num_examples=1019, accuracy=None, loss=None, rmse=27798.58368912402, ndcg=None, aucs=None, auuc=None, qini=None)
evaluation = rf.evaluate(valid_ds_pd, return_dict=True)
print()
for name, value in evaluation.items():
print(f"{name}: {value:.4f}")
MSE is the average of the squared differences between the predicted and actual values. The larger the difference between predictions and true values, the higher the MSE. MSE of 915,694,208 corresponds to an average squared error of around
meaning that the model is off by about $30,257 on average. which is 4-6% of the average sales price
Variable Importance
Variable importance generally indicates how much a feature contributes to the model predictions or quality. There are several ways to identify important features using TensorFlow Decision Forests. Let us list the available Variable Importances for Decision Trees
print(f"Available variable importances:")
for importance in inspector.variable_importances().keys():
print("\t", importance)
As an example, let us display the important features of the Variable Importance NUM_AS_ROOT.
The larger the importance score for NUM_AS_ROOT, the more impact it has on the outcome of the model.
inspector.variable_importances()["NUM_AS_ROOT"]
plt.figure(figsize= (12,4))
# Mean decrease in AUC of the class 1 vs the others
variable_importance_matric ="NUM_AS_ROOT"
variable_importance = inspector.variable_importances()[variable_importance_matric]
# Extarct the feature name and importance value
feature_names = [vi[0].name for vi in variable_importance]
feature_importances = [vi[1] for vi in variable_importance]
# features are ordered in decending importance value
feature_ranks = range(len(feature_names))
bar = plt.barh(feature_ranks, feature_importances, label=[str(x) for x in feature_ranks])
plt.yticks(feature_ranks, feature_names)
plt.gca().invert_yaxis()
# TODO: Replace with "plt.bar_label()" when available.
# Label each bar with values
for importance, patch in zip(feature_importances, bar.patches):
plt.text(patch.get_x() + patch.get_width(), patch.get_y(), f"{importance:.4f}", va="top")
plt.xlabel(variable_importance_matric)
plt.title("NUM AS ROOT of the class 1 vs the others")
plt.tight_layout()
plt.show()
Prediction of test data
Now it's time to predict the data with our test dataset.
test_file_path = "/content/test.csv"
test_data = pd.read_csv(test_file_path)
ids = test_data.pop("Id")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_data, label=None, task=tfdf.keras.Task.REGRESSION)
preds = rf.predict(test_ds)
output = pd.DataFrame({'Id': ids, 'SalePrice': preds.squeeze()})
output.head(5)
let's compare with the training dataset to see how well the model performed in the real data
let's compare the predicted results with actual value to see the deviation# Create a figure
plt.figure(figsize=(9, 8))
# Plot the histogram for dataset_df['SalePrice']
sns.histplot(dataset_df['SalePrice'], color='g', bins=100, kde=False, stat="density", alpha=0.4, label='dataset_df')
# Plot the histogram for output['SalePrice']
sns.histplot(output['SalePrice'], color='b', bins=20, kde=False, stat="density", alpha=0.4, label='output')
# Customize the plot
plt.title('Comparison of Sale Prices from actual and predicted dataset', fontsize=16)
plt.xlabel('Sale Price', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.legend() # Show a legend to differentiate between the two datasets
# Remove top and right spines
plt.gca().spines[['top', 'right']].set_visible(False)
# Show the plot
plt.show()
The deviation isn't too much. TF-DF is very handy for manipulating train data and reduces many tedious steps.
Way forward
You can experience the notebook here.