Continuous value prediction with decision forest algorithm

House price prediction is a classic example of a continuous value prediction problem in a registration problem family. I took this rich dataset to explore a new decision forest algorithm I came across while searching for a suitable algorithm for my office task. I found this Kaggle notebook by Gus Martins very informative, where he tried to solve this by introducing TensorFlow Decision Forests (TF-DF). (TF-DF) is a library built on TensorFlow that implements decision forest algorithms Decision forests are a family of machine learning algorithms for classification, regression, and ranking tasks. They include models like Random Forests and Gradient Boosted Trees (GBT), which are powerful for tabular data and are known for their ability to handle both structured and unstructured data types.

In this article, I tried to mimic some of the processes in the notebook and add some additional features to understand how TF-DF works with continuous value prediction. So let's jump into it..

Importing library

import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("Tensorflow v" + tf.__version__)
print("Tensorflow Decision Forests v" + tfdf.__version__)        
Tensorflow v2.17.0 Tensorflow Decision Forests v1.10.0

Version compatibility for both Tesnsorflow and Tf-DF is very important here. To find the compatible version spent a couple of hours and finally found that TFv2.17.0 is compatible with TF-DFv1.10.0


Exploring the dataset

trian_file_path = "./house-prices-advanced-regression-techniques/train.csv"
dataset_df = pd.read_csv(trian_file_path)
print("Full train dataset shape is {}".format(dataset_df.shape)
dataset_df.head(3)        
the output of the above code
There are 79 feature columns. Using these features your model has to predict the house sale price indicated by the label column named SalePrice.

I will drop the Id column as it is not necessary for model training.

dataset_df = dataset_df.drop('Id', axis=1)        

inspecting the types of feature columns in the dataset

dataset_df.info()        

House price distribution

let's take a look at how the house prices are distributed

print(dataset_df['SalePrice'].describe())        
the output of the above code

let's visualize this distribution for a better understanding

plt.figure(figsize=(9, 8)) sns.distplot(dataset_df['SalePrice'], color='g', bins=100, hist_kws={'alpha': 0.4})        
the output of the above code
according to the chart, we found that most of the house prices are to 130k - 300k.

Numeric data distribution

Now let's take a look at how the numerical features are distributed. In order to do this, let us first list all the types of data from our dataset and select only the numerical ones.

list(set(dataset_df.dtypes.tolist()))
df_num = dataset_df.select_dtypes(include = ['float64', 'int64'])
df_num.head(5)        

let's plot the distribution for all the numerical features.

the output of the above code

Preparing the dataset for training

This dataset contains a mix of numeric, categorical, and missing features. Decision forests are non-parametric models, meaning they often require less data preprocessing compared to neural networks. As TF-DF supports all these feature types natively, no preprocessing is required. This is one advantage of tree-based models, making them a great entry point to Tensorflow and ML. let's split the dataset into training and testing datasets

def split_dataset(dataset, test_ratio=0.30):
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

train_ds_pd, valid_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples in testing.".format(
    len(train_ds_pd), len(valid_ds_pd)))        
1010 examples in training, 450 examples in testing.

Now we need to convert the Pandas dataset to the tensorflow dataset By default the Random Forest Model is configured to train classification tasks. Since this is a regression problem, we will specify the type of the task (tfdf.keras.Task.REGRESSION) as a parameter here

label = 'SalePrice'
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)
valid_ds_pd = tfdf.keras.pd_dataframe_to_tf_dataset(valid_ds_pd, label=label, task=tfdf.keras.Task.REGRESSION)        

Select the model

There are several tree-based models for you to choose from.

  • RandomForestModel
  • GradientBoostedTreesModel
  • CartModel
  • DistributedGradientBoostedTreesModel

To start, we'll work with a Random Forest. This is the most well-known of the Decision Forest training algorithms.

A Random Forest is a collection of decision trees, each trained independently on a random subset of the training dataset (sampled with replacement). The algorithm is unique in that it is robust to overfitting and easy to use.

tfdf.keras.get_all_models()        
the output of the above code

Configuring the models

In this case, we'll use the RandomForest model. Random forests are an ensemble learning method that builds multiple decision trees and merges their results to improve accuracy and avoid overfitting

rf =  tfdf.keras.RandomForestModel( hyperparameter_template="benchmark_rank1", task=tfdf.keras.Task.REGRESSION)        

benchmark_rank1 is one of the predefined hyperparameter templates optimized for good performance on a wide range of problems. It's typically tuned for models that aim to achieve high accuracy or "rank 1" on certain benchmarks. These hyperparameters include aspects such as the number of trees in the forest, the depth of trees, and other parameters that influence how the random forest is built and how well it generalizes to unseen data.benchmark_rank1 hyperparameter template is useful when we want to quickly create a model that performs well on a wide range of datasets without manual tuning.

Create a Random Forest Model

we will use the defaults to create the Random Forest Model while specifying the task type as tfdf.keras.Task.REGRESSION as we are dealing with a regression problem to predict the SalePrice

rf = tfdf.keras.RandomForestModel(task=tfdf.keras.Task.REGRESSION)
rf.compile(metrics=["mse"])        

Train The Model

rf.fit(x=train_ds)        

Visualize the model

tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=0, max_depth=3)        
the output of the above code

Evaluate the model on the Out of the bag (OOB) data and the validation dataset

Before training the dataset we manually separated 20% of the dataset for validation named as valid_ds_pd.

We can also use the Out of Bag (OOB) score to validate our RandomForestModel. To train a Random Forest Model, a set of random samples from the training set are chosen by the algorithm, and the rest of the samples are used to finetune the model. The subset of data that is not chosen is known as Out-of-bag data (OOB). OOB score is computed on the OOB data

The training logs show the Root Mean Squared Error (RMSE) evaluated on the out-of-bag dataset according to the number of trees in the model. Let us plot this.

import matplotlib.pyplot as plt
logs = rf.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.rmse for log in logs])
plt.xlabel('Number of trees')
plt.ylabel('RMSE (out-of-bag)')
plt.show()        
the output of the above code

Model Evaluation

inspector = rf.make_inspector()
inspector.evaluation()        
Evaluation(num_examples=1019, accuracy=None, loss=None, rmse=27798.58368912402, ndcg=None, aucs=None, auuc=None, qini=None)

  • num_examples=1018 This indicates that 1,018 data points were used to evaluate the model. These examples are likely from your validation or test dataset.
  • accuracy=None Accuracy is not applicable to regression tasks
  • loss=None Typically, regression models can use losses like Mean Squared Error (MSE) or Mean Absolute Error (MAE). Since we compiled the model with mse, in model.evaluate the contest will represent the value.
  • rmse=30604.43 It measures the average error made by the model in its predictions, expressed in the same units as the target variable. As our average sales price is $180,921.20 out of 30,604.43 (which is roughly 4-6% of the average value) can be acceptable.
  • ndcg=None Normalized Discounted Cumulative Gain is a metric used for ranking problems, where the goal is to rank items in a particular order. Since this is a regression task, NDCG is not relevant here.
  • aucs=None Area Under the Curve is a classification metric, typically used with binary or multi-class classification tasks to measure the model's ability to distinguish between classes. It's not applicable to regression tasks.
  • auuc=None Area Under the Uplift Curve is also a metric used in uplift modeling, which is a specialized area in predictive modeling focused on causal inference (e.g., estimating the effect of an action on a target). This is not relevant to your regression model.
  • qini=None this isn't applicable to your model since it's a regression task.

evaluation = rf.evaluate(valid_ds_pd, return_dict=True)
print()
for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")        
the output of the above code

MSE is the average of the squared differences between the predicted and actual values. The larger the difference between predictions and true values, the higher the MSE. MSE of 915,694,208 corresponds to an average squared error of around

meaning that the model is off by about $30,257 on average. which is 4-6% of the average sales price

Variable Importance

Variable importance generally indicates how much a feature contributes to the model predictions or quality. There are several ways to identify important features using TensorFlow Decision Forests. Let us list the available Variable Importances for Decision Trees

print(f"Available variable importances:")
for importance in inspector.variable_importances().keys():
  print("\t", importance)        


the output of the above code

As an example, let us display the important features of the Variable Importance NUM_AS_ROOT.

The larger the importance score for NUM_AS_ROOT, the more impact it has on the outcome of the model.

inspector.variable_importances()["NUM_AS_ROOT"]

plt.figure(figsize= (12,4))

# Mean decrease in AUC of the class 1 vs the others
variable_importance_matric ="NUM_AS_ROOT"
variable_importance = inspector.variable_importances()[variable_importance_matric]

# Extarct the feature name and importance value
feature_names = [vi[0].name for vi in variable_importance]
feature_importances = [vi[1] for vi in variable_importance]

# features are ordered in decending importance value
feature_ranks = range(len(feature_names))

bar = plt.barh(feature_ranks, feature_importances, label=[str(x) for x in feature_ranks])
plt.yticks(feature_ranks, feature_names)
plt.gca().invert_yaxis()

# TODO: Replace with "plt.bar_label()" when available.
# Label each bar with values
for importance, patch in zip(feature_importances, bar.patches):
  plt.text(patch.get_x() + patch.get_width(), patch.get_y(), f"{importance:.4f}", va="top")

plt.xlabel(variable_importance_matric)
plt.title("NUM AS ROOT of the class 1 vs the others")
plt.tight_layout()
plt.show()        


the output of the above code

Prediction of test data

Now it's time to predict the data with our test dataset.

test_file_path = "/content/test.csv"
test_data = pd.read_csv(test_file_path)
ids = test_data.pop("Id")

test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_data, label=None, task=tfdf.keras.Task.REGRESSION)

preds = rf.predict(test_ds)
output = pd.DataFrame({'Id': ids, 'SalePrice': preds.squeeze()})
output.head(5)        


the output of the above code

let's compare with the training dataset to see how well the model performed in the real data

let's compare the predicted results with actual value to see the deviation# Create a figure
plt.figure(figsize=(9, 8))

# Plot the histogram for dataset_df['SalePrice']
sns.histplot(dataset_df['SalePrice'], color='g', bins=100, kde=False, stat="density", alpha=0.4, label='dataset_df')

# Plot the histogram for output['SalePrice']
sns.histplot(output['SalePrice'], color='b', bins=20, kde=False, stat="density", alpha=0.4, label='output')

# Customize the plot
plt.title('Comparison of Sale Prices from actual and predicted dataset', fontsize=16)
plt.xlabel('Sale Price', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.legend()  # Show a legend to differentiate between the two datasets

# Remove top and right spines
plt.gca().spines[['top', 'right']].set_visible(False)

# Show the plot
plt.show()        


the output of the above code

The deviation isn't too much. TF-DF is very handy for manipulating train data and reduces many tedious steps.

Way forward

  • Creating a web app with user user-friendly UI where users can predict house prices by inputting dependent values.
  • Using TF-DF to use order demand prediction

You can experience the notebook here.


要查看或添加评论,请登录

Salman Srizon的更多文章

社区洞察

其他会员也浏览了