Malicious Webpage Classifier using DNN [PyTorch]
Sumit Mishra
MSc. Data Science at University of Sheffield | Former Associate Data Scientist at Metyis | Kaggle Expert (3x)
Malicious Webpages are the pages that install malware on your system that will disrupt the computer operation and gather your personal information and many worst cases. Classifying these web pages on the internet is a very important aspect to provide the user with a safe browsing experience.
The objective is to classify the web pages into two categories Malicious [Bad] and Benign [Good] webpages. For the particular objective, the dataset is first preprocessed and a DNN model is trained which is implemented in Pytorch.
The dataset used in the project is taken from Mendeley Data. The dataset contains the features like raw webpage content, geographical location, javascript length, obfuscated JavaScript code of the webpage etc. The dataset contains around 1.5 million web pages where 1.2 million are for training and 300k for testing. The snippet of the dataset is given below.
The dataset is highly skewed, 97.73% of the dataset are Benign Web pages and 2.27% are Malicious Webpages, so choosing the evaluation metrics carefully is important as just accuracy won’t give the correct evaluation so we’ll be using f1_score, recall and confusion matrix.
Now, Firstly we’ll import all the required libraries.
# Importing the Required Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import Dataset, DataLoader from sklearn import metrics from sklearn.preprocessing import LabelEncoder, StandardScaler
After importing all the required libraries, we’ll import the dataset, then preprocess it and add some features to the dataset. [The Extensive Exploratory Data Analysis is done before the preprocessing and feature engineering and it’s given in the complete notebook on Kaggle]. The features we’ll be adding are ‘Type of Network’, ‘Number of Special Character in raw content’ and ‘The length of the content’.
# Importing the dataset df_train=pd.read_csv("../input/Webpages_Classification_train_data.csv") df_test=pd.read_csv("../input/Webpages_Classification_test_data.csv") # Dropping the redundant column df_train.drop(columns = "Unnamed: 0", inplace = True) df_test.drop(columns = "Unnamed: 0", inplace = True)
Now, we’ll write some functions to add the features mentioned, the functions are contained in a class named ‘preproc’, the first function is to count the special character in the raw content and the second function is to assign the type of the network (‘A’, ‘B’ and ‘C’) using the IP Address, for more info visit Network Classes.
class preproc: # Counting the Special Characters in the content def count_special(string): count = 0 for char in string: if not(char.islower()) and not(char.isupper()) and not(char.isdigit()): if char != ' ': count += 1 return count # Identifying the type of network [A, B, C] def network_type(ip): ip_str = ip.split(".") ip = [int(x) for x in ip_str] if ip[0]>=0 and ip[0]<=127: return (ip_str[0], "A") elif ip[0]>=128 and ip[0]<=191: return (".".join(ip_str[0:2]), "B") else: return (".".join(ip_str[0:3]), "C")
Now, Using the functions to generate the features and also taking the length of the raw content.
# Adding Feature that shows the Network type df_train['Network']= df_train['ip_add'].apply(lambda x : preproc.network_type(x)) #Getting the Network type df_train['net_part'], df_train['net_type'] = zip(*df_train.Network) df_train.drop(columns = ['Network'], inplace = True) # Adding Feature that shows the Number of Special Character in the Content df_train['special_char'] = df_train['content'].apply(lambda x: preproc.count_special(x)) # Length of the Content df_train['content_len'] = df_train['content'].apply(lambda x: len(x))
Now, the features are added to the training dataset before the label encoding and normalization, the dataset looks like..
We’ll preprocess the dataset and drop the columns which will not be required, we’ll use the scikit-learn LabelEncoder for Label Encoding and Standard Scalar for normalization. We’ll drop the columns - ‘url’, ‘ip_add’ and ‘content’. and preprocess it. Here, the le_dict and ss_dict have the label encoder instances and standard scalar instances so we can use the same for the testing dataset.
# This le_dict will save the Label Encoder Class so that the same Label Encoder instance can be used for the test dataset le_dict = {} for feature in ls: le = LabelEncoder() le_dict[feature] = le df_train[feature] = le.fit_transform(df_train[feature]) # Encoding the Labels df_train.label.replace({'bad' : 1, 'good' : 0}, inplace = True) # Normalizing the 'content_len' and 'special_char' in training data ss_dict = {} for feature in ['content_len', 'special_char']: ss = StandardScaler() ss_fit = ss.fit(df_train[feature].values.reshape(-1, 1)) ss_dict[feature] = ss_fit d = ss_fit.transform(df_train[feature].values.reshape(-1, 1)) df_train[feature] = pd.DataFrame(d, index = df_train.index, columns = [feature])
The training data after preprocessing,
We’ll plot a feature correlation heatmap to see the correlation of the features with the label.
# Pearson Correlation Heatmap plt.rcParams['figure.figsize'] == [18, 16] sns.set(font_scale = 1) sns.heatmap(df_train.corr(method = 'pearson'), annot = True, cmap = "YlGnBu");
There are some interesting things to notice, the ‘content_len’, ‘special_char’, ‘js_obf_len’ and ‘js_len’ have a very high positive correlation with the label. There are more interesting findings of the Malicious and Benign Webpages. (Please check the Complete Kaggle Notebook for the extensive EDA)
Now continuing with the preprocessing, we’ll just apply the same functions and do the same process for the testing data. [I’ll just be showing how to use the same le_dict and ss_dict for label encoding and normalization, as the previous steps for feature engineering are the same.]
# Using the same label encoders for the features as used in the training dataset for feature in ls: le = le_dict[feature] df_test[feature] = le.fit_transform(df_test[feature]) df_test.label.replace({'bad' : 1, 'good' : 0}, inplace = True) # Normalizing the 'content_len' and 'special_char' in testing data ss_fit = ss_dict['content_len'] d = ss_fit.transform(df_test['content_len'].values.reshape(-1, 1)) df_test['content_len'] = pd.DataFrame(d, index = df_test.index, columns = ['content_len']) ss_fit = ss_dict['special_char'] d = ss_fit.transform(df_test['special_char'].values.reshape(-1, 1)) df_test['special_char'] = pd.DataFrame(d, index = df_test.index, columns = ['special_char'])
The testing data after preprocessing,
All the preprocessing is done now, we’ll do the modelling using PyTorch and for that first, we have to make the custom dataset and data loader.
We’ll set the configuration class, which includes batch size, device (CPU or GPU), learning rate and epochs.
# Configuration Class class config: BATCH_SIZE = 128 DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") LEARNING_RATE = 2e-5 EPOCHS = 20
Now, we’ll make the custom dataset and data loader using the ‘Dataset’ and ‘DataLoader’ from PyTorch. For testing, we’ll set the batch size as 1. df_train_loader and df_test_loader are the final DataLoader which we’ll be using for the Training and Testing.
# Making the custom dataset for pytorch class MaliciousBenignData(Dataset): def __init__(self, df): self.df = df self.input = self.df.drop(columns = ['label']).values self.target = self.df.label def __len__(self): return (len(self.df)) def __getitem__(self, idx): return (torch.tensor(self.input[idx]), torch.tensor(self.target[idx])) # Creating the dataloader for pytorch def create_dataloader(df, batch_size): cls = MaliciousBenignData(df) return DataLoader( cls, batch_size = batch_size, num_workers = 0 ) df_train_loader = create_dataloader(df_train, batch_size=config.BATCH_SIZE) df_test_loader = create_dataloader(df_test, batch_size = 1) # Here for testing using the batch size as 1
The custom data loaders are done!! Time to make the model, We’ll keep the DNN simple for now.
# Making the DNN model class dnn(nn.Module): def __init__(self): super(dnn, self).__init__() self.fc1 = nn.Linear(10, 64) self.fc2 = nn.Linear(64, 128) self.fc3 = nn.Linear(128, 128) self.out = nn.Linear(128, 1) self.dropout1 = nn.Dropout(p = 0.2) self.dropout2 = nn.Dropout(p = 0.3) self.batchn1 = nn.BatchNorm1d(num_features = 64) self.batchn2 = nn.BatchNorm1d(num_features = 128) def forward(self, inputs): t = self.fc1(inputs) t = F.relu(t) t = self.batchn1(t) t = self.dropout1(t) t = self.fc2(t) t = F.relu(t) t = self.batchn2(t) t = self.dropout2(t) t = self.fc3(t) t = F.relu(t) t = self.out(t) return t
Now, we’ll transfer the model to the appropriate device (given in the config class). Criterion and the Optimizer for the model will be BCEWithLogitsLoss and Adam respectively.
# Transfer the model on the device -- 'GPU' if available or Default 'CPU' model = dnn() model.to(config.DEVICE) # Criterion and the Optimizer for the model criterion = nn.BCEWithLogitsLoss() optimizer = optim.Adam(model.parameters(), lr= config.LEARNING_RATE)
We’ll write a very basic binary accuracy function, which will calculate the binary accuracy for every epoch. here the sigmoid and round functions from PyTorch are used for getting the prediction.
def binary_acc(predictions, y_test): y_pred = torch.round(torch.sigmoid(predictions)) correct = (y_pred == y_test).sum().float() acc = torch.round((correct/y_test.shape[0])*100) return acc
Finally, we’ll write the train function and evaluation function (very common in PyTorch).
The training function will accept ‘device’, ‘data_loader’, ‘optimizer’, ‘criterion’ and ‘model’ itself as the parameters. We’ll calculate the epoch loss using BCEWithLogitsLoss and the accuracy using the Binary Accuracy function we wrote.
# Training function def train_model(model, device, data_loader, optimizer, criterion): # Putting the model in training mode model.train() for epoch in range(1, config.EPOCHS+1): epoch_loss = 0 epoch_acc = 0 for X, y in data_loader: X = X.to(device) y_ = torch.tensor(y.unsqueeze(1), dtype = torch.float32) y = y_.to(device) # Zeroing the gradient optimizer.zero_grad() predictions = model(X.float()) loss = criterion(predictions, y) acc = binary_acc(predictions, y) loss.backward() # Calculate Gradient optimizer.step() # Updating Weights epoch_loss += loss.item() epoch_acc += acc.item() print (f"Epoch -- {epoch} | Loss : {epoch_loss/len(data_loader): .5f} | Accuracy : {epoch_acc/len(data_loader): .5f}")
The evaluation function will be very similar but we put the model in evaluation mode first. we’ll use the torch.no_grad() to reduce the memory consumption. The evaluation function will return ‘y_test_al’ and ‘y_pred’ which are the True labels and the predicted values respectively.
# Evaluation Function def eval_model(model, device, data_loader): # Putting the model in evaluation mode model.eval() y_pred = [] y_test_al = [] with torch.no_grad(): for X_test, y_test in data_loader: X_test = X_test.to(device) predictions = model(X_test.float()) pred = torch.round(torch.sigmoid(predictions)) y_test_al.append(y_test.tolist()) y_pred.append(pred.tolist()) # Changing the Predictions into list y_test_al = [ele[0] for ele in y_test_al] y_pred = [int(ele[0][0]) for ele in y_pred] # the format of the prediction is [[[0]], [[1]]] return (y_test_al, y_pred)
Now, Everything is done and all we have to do now is train the model, evaluate the model and check its performance. For training, we’ll just call the training function.
# Training the Model train_model(model, config.DEVICE, df_train_loader, optimizer, criterion)
For Evaluation, calling the evaluation function. We’ll have the true labels and predictions in form of a list and metrics from sklearn can be used.
# Evaluating the model and getting the predictions y_test, preds = eval_model(model, config.DEVICE, df_test_loader)
We’ll use the classification_report from sklearn.metrics and plot the heatmap of the confusion matrix using seaborn.
# Classification Report cls_report = metrics.classification_report(y_test, preds) print ("") print (f"Accuracy : {metrics.accuracy_score(y_test, preds)*100 : .3f} %") print ("") print ("Classification Report : ") print (cls_report) # Setting the params for the plot plt.rcParams['figure.figsize'] = [10, 7] sns.set(font_scale = 1.2) # Confusion Matrix cm = metrics.confusion_matrix(y_test, preds) # Plotting the Confusion Matrix ax = sns.heatmap(cm, annot = True, cmap = 'YlGnBu') ax.set(title = "Confusion Matrix", xlabel = 'Predicted Labels', ylabel = 'True Labels');
Here, we can ignore the accuracy as it’ll just be biased, but we can see that the f1_score and recall score are very good and according to the confusion metrics our model can classify the webpages pretty good. Since we knew that our dataset is biased so we ignored the accuracy.
The most interesting thing here is that the ‘special_char’, ‘content_len’, ‘js_len’ and ‘js_obf_len’ have a very high positive correlation with the labels so they are the most important features which are judging the model performance. There are many interesting findings of the Malicious and Benign Webpages due to which our model is able to perform that good. (I’d highly recommend checking out the EDA in the Kaggle Notebook).
I’ve also trained 3 more Machine Learning models on the dataset for comparison and deployed them using Flask and PyWebIO. I ran the kaggle notebook only for the DNN model. The whole code and deployment with all the models are on my Github.
P.S. I’m still learning, so the feedback would be greatly appreciated :)
Links :
[1] Github — https://github.com/SumitM0432/Malicious-Webpage-Classifier
[2] Kaggle Notebook — https://www.kaggle.com/sumitm004/malicious-webpage-classifier-using-dnn-pytorch
[3] Dataset — https://data.mendeley.com/datasets/gdx3pkwp47/2