Using PyTorch to build a neural network for Housing Price Prediction
Introduction
In this article you will learn how to use PyTorch to create a feed-forward neural network (or called a multi-layer perceptron, Multiple-Layer Perceptron, MLP). In this article, you will use the automatic derivation function provided by PyTorch to train a neural network.
The dataset for this article comes from the Kaggle competition: Housing?Price Prediction.?This data is divided into training data set and test data set.?Both datasets include features for each house, such as the year of construction, the condition of the basement, etc.?Among these features, there are continuous numerical (Numerical) features and discrete categorical (Categorical) features.?Some of these features have missing values "NA".?
The training data set includes the price of each house, which is the target value (Label) to be predicted.?We should train a model with the training dataset and make predictions on the test dataset, and submit the results to Kaggle.
Data Exploration and?Pre-processing
First, we download and load the dataset:
train_data_path ='./dataset/train.csv
train = pd.read_csv(train_data_path)
num_of_train_data = train.shape[0]
test_data_path ='./dataset/test.csv'
test = pd.read_csv(test_data_path)'
The training data set has a total of, 1460 samples and 81 dimensions. Among them,?Id is the unique number of each sample,?SalePrice is the house price, and is also the target value we want to fit.?Other dimensions (columns) have numeric features and non-numeric columns, or categorical features.
First, look at the dimensions of the training dataset:
train.shape
The output is:
(1460, 81)
Or?:
train.describe()
To check some statistics for each feature of the entire dataset.
Next, we need to merge the training and testing datasets.?The main purpose of merging the training data set and the test data set is to unify the feature processing process, or to use the same method for the training data set, and the test data set, and perform the same feature engineering processing.
#house price, target value to fit
target = train.SalePrice
#Enter features, you can throw the SalePrice column away
train.drop(['SalePrice'],axis = 1 , inplace = True)
#Merge train and test together, and perform feature engineering together, which is #convenient for predicting the house price of test
combined = train.append(test)
combined.reset_index(inplace=True)
combined.drop(['index', 'Id'], inplace=True, axis=1)
Then it's time to start feature engineering.?This paper does not do any complex feature engineering, but only does two things: 1. Filter out the columns with missing values; 2. One-Hot encoding for categorical features.?Missing values will affect the prediction effect of the algorithm to a certain extent. Generally, some default values or some nearby values can be used to fill in the missing values.?For MLP models, categorical features must be encoded and converted into numerical values for model training. One-Hot encoding is one of the most common methods of categorical feature processing.
We filter non-null columns with the following function:
#select non-empty columns
def get_cols_with_no_nans(df,col_type):
'''
Arguments :
df : The dataframe to process
col_type :
num : to only get numerical columns with no nans
no_num : to only get nun-numerical columns with no nans
all : to get any columns with no nans
'''
if (col_type == 'num'):
predictors = df.select_dtypes(exclude=['object'])
elif (col_type == 'no_num'):
predictors = df.select_dtypes(include=['object'])
elif (col_type == 'all'):
predictors = df
else :
print('Error : choose a type (num, no_num, all)')
return 0
cols_with_no_nans = []
for col in predictors.columns:
if not df[col].isnull().any():
cols_with_no_nans.append(col)
return cols_with_no_nans
Numerical features and categorical features are processed separately:
num_cols = get_cols_with_no_nans(combined, 'num'
cat_cols = get_cols_with_no_nans(combined, 'no_num')
# Filter out features with missing values
combined = combined[num_cols + cat_cols]
print(num_cols[:5])
print ('Number of numerical columns with no nan values: ',len(num_cols))
print(cat_cols[:5])
print ('Number of non-numerical columns with no nan values: ',len(cat_cols)))
After filtering, there are 25 columns for numerical features and 20 columns for categorical features, for a total of 45 columns.
# One-Hot encoding of categorical features
def oneHotEncode(df,colNames):
for col in colNames:
if( df[col].dtype == np.dtype('object')):
# pandas.get_dummies can One-Hot encoding of categorical features
dummies = pd.get_dummies(df[col],prefix=col)
df = pd.concat([df,dummies],axis=1)
# drop the encoded column
df.drop([col],axis = 1 , inplace=True)
return df
For classification features, One-Hot encoding is also required, which?pandas.get_dummies can help us automatically complete the One-Hot encoding process.?After One-Hot encoding, the data has increased a lot of columns, a total of 149 columns.
领英推荐
So far, we have achieved a very simple feature engineering, converting these data into the Tensor form that the PyTorch model can accept:
# training data set
train_features = torch.tensor(combined[:num_of_train_data].values, dtype=torch.float)
# training dataset target
train_labels = torch.tensor(target.values, dtype=torch.float).view(-1, 1)
# Test dataset features
test_features = torch.tensor(combined[num_of_train_data:].values, dtype=torch.float)
print("train data size: ", train_features.shape)
print("label data size: ", train_labels.shape)
print("test data size: ", test_features.shape)
Building a Neural?Network :
Next, we start building the neural network.
There are two ways to build neural networks in PyTorch: ?A relatively simple feedforward network can be used?nn.Sequential. nn.Sequential It is a container for storing neural networks.?nn.Sequential we can directly add the layers we need in it.?The input to the entire model is the number of features, and the output is a scalar.?The hidden layer of the model uses a ReLU activation function, and the last layer is a linear layer, which gets a predicted house value.
model_sequential = nn.Sequential(
nn.Linear(train_features.shape[1], 128),
nn.ReLU(),
nn.Linear(128, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, 1)
)
Another way to build a neural network is to inherit from?nn.Module a class, we call the subclass a?Net class.?__init__()The method is?Net the constructor of the class, which is used to initialize the parameters of each layer of the neural network;?forward() it is also a method that we must implement, mainly used to realize the forward propagation process of the neural network.
class Net(nn.Module) :
def __init__(self, features):
super(Net, self).__init__()
self.linear_relu1 = nn.Linear(features, 128)
self.linear_relu2 = nn.Linear(128, 256)
self.linear_relu3 = nn.Linear(256, 256)
self.linear_relu4 = nn.Linear(256, 256)
self.linear5 = nn.Linear(256, 1)
def forward(self, x):
y_pred = self.linear_relu1(x)
y_pred = nn.functional.relu(y_pred)
y_pred = self.linear_relu2(y_pred)
y_pred = nn.functional.relu(y_pred)
y_pred = self.linear_relu3(y_pred)
y_pred = nn.functional.relu(y_pred)
y_pred = self.linear_relu4(y_pred)
y_pred = nn.functional.relu(y_pred)
y_pred = self.linear5(y_pred)
return y_pred:
We have defined a neural network?Net class, but also initialize an?Net object instance of the class?model, representing a specific model.?Then define the loss function, which is used here?MSELoss,?MSELoss using the mean square error (Mean Square Error) to measure the loss function.?For?model the training process of the model, the Adam algorithm is used here.?Adam is one of the optimization algorithms, which is more efficient than SGD in many scenarios.
model = Net(features=train_features.shape[1])
# Use mean squared error as loss function
criterion = nn.MSELoss(reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
Train The?Model
Next, we use the Adam algorithm to perform multiple rounds of iterations to update?model the parameters in the model.?Here, the model is iterated for 500 rounds.
losses = []
# 500 rounds of training
for t in range(500):
y_pred = model(train_features)
loss = criterion(y_pred, train_labels)
# print(t, loss.item())
losses.append(loss.item())
if torch.isnan(loss):
break
# Zero out the gradient of each parameter in the model.
# PyTorch's backward() method calculates the gradient by default by adding the # gradient calculated this time to the gradient already in the cache.
# Must be cleared before backpropagation.
optimizer.zero_grad()
# Backpropagation, calculate the gradient of each parameter to the loss
loss.backward()
# Update the model parameters based on the gradients just backpropagated
optimizer.step()
Each iteration uses all samples in the training dataset,?train_features.?model(train_features)The actual execution is?the forward propagation logic defined in the method.
The input data is propagated forward in the neural network model to obtain the predicted?model.forward(train_features)value.?
The method calculates the loss between?the predicted value?and the target value. forward() y_predcriterion(y_pred, train_labels)y_predtrain_labels
At each iteration, we first zero out the gradients of each parameter in the model:?optimizer.zero_grad().?
The default in PyTorch?backward() is to add the gradient calculated this time to the existing gradient in the cache, so the gradient must be cleared before back-propagation.?Then execute the?backward() method to achieve the back-propagation process, and PyTorch will help us calculate the gradient of each parameter to the loss function.?optimizer.step()The model parameters are updated based on the gradients just back-propagated.
So far, a simple model for predicting house prices has been trained.
Test?Model :
We can use the model to make predictions on the test data set, save the obtained predictions as a file, and submit it to Kaggle.
predictions = model(test_features).detach().numpy()
my_submission = pd.DataFrame({'Id':pd.read_csv('./dataset/test.csv').Id,'SalePrice': predictions[:, 0]})
my_submission.to_csv('{}.csv'.format('./dataset/submission'), index=False)