登录查看更多内容

Malicious Webpage Classifier using DNN [PyTorch]

Sumit Mishra

MSc. Data Science at University of Sheffield | Former Associate Data Scientist at Metyis | Kaggle Expert (3x)

发布日期: 2021年5月28日

Malicious Webpages are the pages that install malware on your system that will disrupt the computer operation and gather your personal information and many worst cases. Classifying these web pages on the internet is a very important aspect to provide the user with a safe browsing experience.

The objective is to classify the web pages into two categories Malicious [Bad] and Benign [Good] webpages. For the particular objective, the dataset is first preprocessed and a DNN model is trained which is implemented in Pytorch.

The dataset used in the project is taken from Mendeley Data. The dataset contains the features like raw webpage content, geographical location, javascript length, obfuscated JavaScript code of the webpage etc. The dataset contains around 1.5 million web pages where 1.2 million are for training and 300k for testing. The snippet of the dataset is given below.

The dataset is highly skewed, 97.73% of the dataset are Benign Web pages and 2.27% are Malicious Webpages, so choosing the evaluation metrics carefully is important as just accuracy won’t give the correct evaluation so we’ll be using f1_score, recall and confusion matrix.

Now, Firstly we’ll import all the required libraries.

# Importing the Required Libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder, StandardScaler

After importing all the required libraries, we’ll import the dataset, then preprocess it and add some features to the dataset. [The Extensive Exploratory Data Analysis is done before the preprocessing and feature engineering and it’s given in the complete notebook on Kaggle]. The features we’ll be adding are ‘Type of Network’, ‘Number of Special Character in raw content’ and ‘The length of the content’.

# Importing the dataset

df_train=pd.read_csv("../input/Webpages_Classification_train_data.csv")

df_test=pd.read_csv("../input/Webpages_Classification_test_data.csv")

# Dropping the redundant column

df_train.drop(columns = "Unnamed: 0", inplace = True)

df_test.drop(columns = "Unnamed: 0", inplace = True)

Now, we’ll write some functions to add the features mentioned, the functions are contained in a class named ‘preproc’, the first function is to count the special character in the raw content and the second function is to assign the type of the network (‘A’, ‘B’ and ‘C’) using the IP Address, for more info visit Network Classes.

class preproc:
    
    # Counting the Special Characters in the content

    def count_special(string):
        count = 0
        for char in string:
            if not(char.islower()) and not(char.isupper()) and not(char.isdigit()):

                if char != ' ':
                    count += 1
        return count
    
    # Identifying the type of network [A, B, C]

    def network_type(ip):

        ip_str = ip.split(".")
        ip = [int(x) for x in ip_str]

        if ip[0]>=0 and ip[0]<=127:
            return (ip_str[0], "A")

        elif ip[0]>=128 and ip[0]<=191:
            return (".".join(ip_str[0:2]), "B")

        else:
            
            return (".".join(ip_str[0:3]), "C")

Now, Using the functions to generate the features and also taking the length of the raw content.

# Adding Feature that shows the Network type

df_train['Network']= df_train['ip_add'].apply(lambda x : preproc.network_type(x))

#Getting the Network type

df_train['net_part'], df_train['net_type'] = zip(*df_train.Network)
df_train.drop(columns = ['Network'], inplace = True)

# Adding Feature that shows the Number of Special Character in the Content

df_train['special_char'] = df_train['content'].apply(lambda x: preproc.count_special(x))

# Length of the Content

df_train['content_len'] = df_train['content'].apply(lambda x: len(x))

Now, the features are added to the training dataset before the label encoding and normalization, the dataset looks like..

We’ll preprocess the dataset and drop the columns which will not be required, we’ll use the scikit-learn LabelEncoder for Label Encoding and Standard Scalar for normalization. We’ll drop the columns - ‘url’, ‘ip_add’ and ‘content’. and preprocess it. Here, the le_dict and ss_dict have the label encoder instances and standard scalar instances so we can use the same for the testing dataset.

# This le_dict will save the Label Encoder Class so that the same Label Encoder instance can be used for the test dataset

le_dict = {}

for feature in ls:
    le = LabelEncoder()
    le_dict[feature] = le
    df_train[feature] = le.fit_transform(df_train[feature])

# Encoding the Labels

df_train.label.replace({'bad' : 1, 'good' : 0}, inplace = True)

# Normalizing the 'content_len' and 'special_char' in training data

ss_dict = {}

for feature in ['content_len', 'special_char']:

    ss = StandardScaler()
    ss_fit = ss.fit(df_train[feature].values.reshape(-1, 1))

    ss_dict[feature] = ss_fit

    d = ss_fit.transform(df_train[feature].values.reshape(-1, 1))
    df_train[feature] = pd.DataFrame(d, index = df_train.index, columns = [feature])

The training data after preprocessing,

We’ll plot a feature correlation heatmap to see the correlation of the features with the label.

# Pearson Correlation Heatmap

plt.rcParams['figure.figsize'] == [18, 16]
sns.set(font_scale = 1)


sns.heatmap(df_train.corr(method = 'pearson'), annot = True, cmap = "YlGnBu");

There are some interesting things to notice, the ‘content_len’, ‘special_char’, ‘js_obf_len’ and ‘js_len’ have a very high positive correlation with the label. There are more interesting findings of the Malicious and Benign Webpages. (Please check the Complete Kaggle Notebook for the extensive EDA)

Now continuing with the preprocessing, we’ll just apply the same functions and do the same process for the testing data. [I’ll just be showing how to use the same le_dict and ss_dict for label encoding and normalization, as the previous steps for feature engineering are the same.]

# Using the same label encoders for the features as used in the training dataset

for feature in ls:
    le = le_dict[feature]
    df_test[feature] = le.fit_transform(df_test[feature])

df_test.label.replace({'bad' : 1, 'good' : 0}, inplace = True)

# Normalizing the 'content_len' and 'special_char' in testing data

ss_fit = ss_dict['content_len']

d = ss_fit.transform(df_test['content_len'].values.reshape(-1, 1))
df_test['content_len'] = pd.DataFrame(d, index = df_test.index, columns = ['content_len'])

ss_fit = ss_dict['special_char']

d = ss_fit.transform(df_test['special_char'].values.reshape(-1, 1))
df_test['special_char'] = pd.DataFrame(d, index = df_test.index, columns = ['special_char'])

The testing data after preprocessing,

All the preprocessing is done now, we’ll do the modelling using PyTorch and for that first, we have to make the custom dataset and data loader.

We’ll set the configuration class, which includes batch size, device (CPU or GPU), learning rate and epochs.

# Configuration Class

class config:

    BATCH_SIZE = 128
    DEVICE =  torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    LEARNING_RATE = 2e-5
    EPOCHS = 20

Now, we’ll make the custom dataset and data loader using the ‘Dataset’ and ‘DataLoader’ from PyTorch. For testing, we’ll set the batch size as 1. df_train_loader and df_test_loader are the final DataLoader which we’ll be using for the Training and Testing.

# Making the custom dataset for pytorch

class MaliciousBenignData(Dataset):

    def __init__(self, df):
        self.df = df
        self.input = self.df.drop(columns = ['label']).values
        self.target = self.df.label
        
    def __len__(self):
        return (len(self.df))
    
    def __getitem__(self, idx):
        return (torch.tensor(self.input[idx]), torch.tensor(self.target[idx]))

# Creating the dataloader for pytorch

def create_dataloader(df, batch_size):
    cls = MaliciousBenignData(df)
    return DataLoader(
        cls,
        batch_size = batch_size,
        num_workers = 0
    )

df_train_loader = create_dataloader(df_train, batch_size=config.BATCH_SIZE)
df_test_loader = create_dataloader(df_test, batch_size = 1) # Here for testing using the batch size as 1

The custom data loaders are done!! Time to make the model, We’ll keep the DNN simple for now.

# Making the DNN model

class dnn(nn.Module):

    def __init__(self):
        super(dnn, self).__init__()

        self.fc1 = nn.Linear(10, 64)
        self.fc2 = nn.Linear(64, 128)
        self.fc3 = nn.Linear(128, 128)
        self.out = nn.Linear(128, 1)

        self.dropout1 = nn.Dropout(p = 0.2)        
        self.dropout2 = nn.Dropout(p = 0.3)
        self.batchn1 = nn.BatchNorm1d(num_features = 64)
        self.batchn2 = nn.BatchNorm1d(num_features = 128)

    def forward(self, inputs):

        t = self.fc1(inputs)
        t = F.relu(t)
        t = self.batchn1(t)
        t = self.dropout1(t)
        t = self.fc2(t)
        t = F.relu(t)
        t = self.batchn2(t)
        t = self.dropout2(t)
        t = self.fc3(t)
        t = F.relu(t)
        t = self.out(t)

        return t

Now, we’ll transfer the model to the appropriate device (given in the config class). Criterion and the Optimizer for the model will be BCEWithLogitsLoss and Adam respectively.

# Transfer the model on the device -- 'GPU' if available or Default 'CPU'

model = dnn()
model.to(config.DEVICE)

# Criterion and the Optimizer for the model

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr= config.LEARNING_RATE)

We’ll write a very basic binary accuracy function, which will calculate the binary accuracy for every epoch. here the sigmoid and round functions from PyTorch are used for getting the prediction.

def binary_acc(predictions, y_test):

    y_pred = torch.round(torch.sigmoid(predictions))
    correct = (y_pred == y_test).sum().float()
    acc = torch.round((correct/y_test.shape[0])*100)

    return acc

Finally, we’ll write the train function and evaluation function (very common in PyTorch).

The training function will accept ‘device’, ‘data_loader’, ‘optimizer’, ‘criterion’ and ‘model’ itself as the parameters. We’ll calculate the epoch loss using BCEWithLogitsLoss and the accuracy using the Binary Accuracy function we wrote.

# Training function

def train_model(model, device, data_loader, optimizer, criterion):

    # Putting the model in training mode

    model.train()

    for epoch in range(1, config.EPOCHS+1):

        epoch_loss = 0
        epoch_acc = 0

        for X, y in data_loader:

            X = X.to(device)
            y_ = torch.tensor(y.unsqueeze(1), dtype = torch.float32)
            y = y_.to(device)

            # Zeroing the gradient

            optimizer.zero_grad()

            predictions = model(X.float())

            loss = criterion(predictions, y)
            acc = binary_acc(predictions, y)

            loss.backward() # Calculate Gradient
            optimizer.step() # Updating Weights

            epoch_loss += loss.item()
            epoch_acc += acc.item()

        print (f"Epoch -- {epoch} | Loss : {epoch_loss/len(data_loader): .5f} | Accuracy : {epoch_acc/len(data_loader): .5f}")

The evaluation function will be very similar but we put the model in evaluation mode first. we’ll use the torch.no_grad() to reduce the memory consumption. The evaluation function will return ‘y_test_al’ and ‘y_pred’ which are the True labels and the predicted values respectively.

# Evaluation Function

def eval_model(model, device, data_loader):

    # Putting the model in evaluation mode

    model.eval()

    y_pred = []
    y_test_al = []

    with torch.no_grad():

        for X_test, y_test in data_loader:
            X_test = X_test.to(device)

            predictions = model(X_test.float())
            pred = torch.round(torch.sigmoid(predictions))

            y_test_al.append(y_test.tolist())
            y_pred.append(pred.tolist())

        # Changing the Predictions into list 

        y_test_al = [ele[0] for ele in y_test_al]

        y_pred = [int(ele[0][0]) for ele in y_pred] # the format of the prediction is [[[0]], [[1]]]

        return (y_test_al, y_pred)

Now, Everything is done and all we have to do now is train the model, evaluate the model and check its performance. For training, we’ll just call the training function.

# Training the Model

train_model(model, config.DEVICE, df_train_loader, optimizer, criterion)

For Evaluation, calling the evaluation function. We’ll have the true labels and predictions in form of a list and metrics from sklearn can be used.

# Evaluating the model and getting the predictions

y_test, preds = eval_model(model, config.DEVICE, df_test_loader)

We’ll use the classification_report from sklearn.metrics and plot the heatmap of the confusion matrix using seaborn.

# Classification Report

cls_report = metrics.classification_report(y_test, preds)

print ("")
print (f"Accuracy : {metrics.accuracy_score(y_test, preds)*100 : .3f} %") 
print ("")
print ("Classification Report : ")
print (cls_report)

# Setting the params for the plot

plt.rcParams['figure.figsize'] = [10, 7]
sns.set(font_scale = 1.2)

# Confusion Matrix

cm = metrics.confusion_matrix(y_test, preds)

# Plotting the Confusion Matrix

ax = sns.heatmap(cm, annot = True, cmap = 'YlGnBu')

ax.set(title = "Confusion Matrix", xlabel = 'Predicted Labels', ylabel = 'True Labels');

Here, we can ignore the accuracy as it’ll just be biased, but we can see that the f1_score and recall score are very good and according to the confusion metrics our model can classify the webpages pretty good. Since we knew that our dataset is biased so we ignored the accuracy.

The most interesting thing here is that the ‘special_char’, ‘content_len’, ‘js_len’ and ‘js_obf_len’ have a very high positive correlation with the labels so they are the most important features which are judging the model performance. There are many interesting findings of the Malicious and Benign Webpages due to which our model is able to perform that good. (I’d highly recommend checking out the EDA in the Kaggle Notebook).

I’ve also trained 3 more Machine Learning models on the dataset for comparison and deployed them using Flask and PyWebIO. I ran the kaggle notebook only for the DNN model. The whole code and deployment with all the models are on my Github.

P.S. I’m still learning, so the feedback would be greatly appreciated :)

Links :

[1] Github — https://github.com/SumitM0432/Malicious-Webpage-Classifier

[2] Kaggle Notebook — https://www.kaggle.com/sumitm004/malicious-webpage-classifier-using-dnn-pytorch

[3] Dataset — https://data.mendeley.com/datasets/gdx3pkwp47/2

要查看或添加评论，请登录

Sumit Mishra的更多文章

Application of Machine Learning for the predictions of Epidemic Diseases

2019年7月24日

Application of Machine Learning for the predictions of Epidemic Diseases

An Epidemic is the rapid spread of infectious diseases to a large number of people in a given population within a short…

Malicious Webpage Classifier using DNN [PyTorch]

Sumit Mishra

MSc. Data Science at University of Sheffield | Former Associate Data Scientist at Metyis | Kaggle Expert (3x)

Sumit Mishra的更多文章

社区洞察

其他会员也浏览了

bormaxi8080 OSINT Timeline (80) - 17.10.2024

Excited to Announce: UserAgentFilter is Now Live on PyPi!

Scrape Google Shopping Tab with Python

The start of your shiny new project.

Rate Limiting: Controlling the Flow in a Digital World

Your Guide to NextJS Routing: Dynamic, File-Based, and Folder-Based Routes Unveiled

Cheesey: Cheeseyjack Vulnhub walkthrough

DQ Outlier Detection with Interquartile Range (IQR) in Python

Let's Build a Flight Tracker Part 1: OpenSky API

How Memory Works for Multi-User, Multi-Request Applications.

Sumit Mishra的更多文章

Application of Machine Learning for the predictions of Epidemic Diseases

社区洞察

其他会员也浏览了

bormaxi8080 OSINT Timeline (80) - 17.10.2024

Excited to Announce: UserAgentFilter is Now Live on PyPi!

Scrape Google Shopping Tab with Python

The start of your shiny new project.

Rate Limiting: Controlling the Flow in a Digital World

Your Guide to NextJS Routing: Dynamic, File-Based, and Folder-Based Routes Unveiled

Cheesey: Cheeseyjack Vulnhub walkthrough

DQ Outlier Detection with Interquartile Range (IQR) in Python

Let's Build a Flight Tracker Part 1: OpenSky API

How Memory Works for Multi-User, Multi-Request Applications.