Confusion Matrix and Cyber Crime
Ajeenkya S.
Jr. Soft Engg @Cognizant, EDI-Maps Developer, 2X OCI, 1xAWS Certified, 1X Aviatrix Certified, AT&T Summer Learning Academy Extern, LW summer Research Intern, ARTH Learner, 1X Gitlab Certified Associate, ARTH 2.0 LW_TV
A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. It is used to measure the performance of a classification model. It can be used to evaluate the performance of a classification model through the calculation of performance metrics like accuracy, precision. Confusion matrices are widely used because they give a better idea of a model’s performance than classification accuracy does.
The above table has the following cases:
- True Negative: Model has given prediction No, and the real or actual value was also No.
- True Positive: The model has predicted yes, and the actual value was also true.
- False Negative: The model has predicted no, but the actual value was Yes, it is also called a Type-II error.
- False Positive: The model has predicted Yes, but the actual value was No. It is also called a Type-I error.
- The target variable has two values: Positive or Negative
- The columns represent the actual values of the target variable
- The rows represent the predicted values of the target variable
Hence, a confusion matrix is a table with two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives. This allows for more detailed analysis than the mere proportion of correct classifications (accuracy). Accuracy will yield misleading results if the data set is unbalanced; that is when the numbers of observations in different classes vary greatly.
Types of errors in Confusion Matrix:
Confusion matrices have two types of errors: Type I and Type II.
Type I Error - It is basically called as False Positive is called as Type I Error. It's the most dangerous kind of error as it means that our model has predicted or given the wrong answer but in a positive sense i.e. True. for example, in security domains, if this kind of error happens then if the security engineers will rely on the ML model's predicted data which won't be accurate 100% then there will be a chance of getting the systems to be cracked or hacked by the intruders as false positive means that our model has given the wrong answer but in a positive sense which means our model will not notify the security engineers about the 20 or 30% intruders which have penetrated the organization's environment since the Machine Learning model is not 100 % accurate and the security engineers will not take a considerable action accordingly on time which may result in a huge loss to an organization.
Type II Error - False Negative is called a Type II Error. This error is so dangerous as it generally means that our model has given a wrong answer in a negative way i.e. false. But these kinds of error generally mean that the model has predicted about the passing students and it has predicted 50 students which have failed in the exam but in actual only 40 students have passed so the remaining 10 students will come under this kind of error or False Negative.
Crime Cases related to Confusion Matrix:
Vancouver is the most populated city in Canada. It is the most ethnically diverse city in Canada. Crime is one of the biggest and dominating problems in our society and its prevention is an important task. Even though Vancouver is known to be the safest city it is observed that vehicle break-ins and many more thefts are still a problem. The dataset used is the Crime dataset of the city of Vancouver available on Kaggle. The dataset consists of crimes in Vancouver from 2003 to 2017 which consists of 530,652. It consists of features like type, year, month, day, hour, location, latitude, longitude, and many more.
> Data Preprocessing:
Initially, the data is preprocessed by removing all null values and removing all columns that are unnecessary.
The proposed work is divided into 4 parts:
1. Data preprocessing: After data cleaning, we will use some preprocessing techniques for numerical and categorical data like normalization and one hot encoding. Model sampling for train and split:
? Training dataset consists of 70% or 80% of data
? Testing dataset consists of 30% or 20% data
2. Data Analysis: Data analysis was done to understand the dataset and problems we are actually having the number of incidents that have happened previously based on the type of crime.
3. Data Modelling: In this part, various classification model’s comparison is done to understand which model works best for our crime prediction. One approach used to implement crime prediction is, Categorical Variables are encoded and then used for training in the model. Crime type is our output (target). Since we are going to classify types of crimes, we are going to implement the following machine learning models.
1. K-Nearest Neighbor: This is one of the simplest model, its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.
2. Logistic Regression: Logistic regression is one of the regression models where the variables dependent are either binary is categorical. It cannot handle continuous data.
3. Decision Trees: Decision trees is a tree-shaped graph that includes outcomes, utilities that helps in making a decision. Implementation under construction.
4. TensorFlow classification: In Fully connected neural network models, 4 Dense layers are with a number of neurons (64,32,16,10) the last layer is the output layer. Optimizer used here is adam and activation function used are relu.
5. Bayesian Methods: The implementation is based on the Naive Bayes algorithm which constructs models as classifiers and is represented as vectors of all values of features.
6. Random Forest Classifier: A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
4. Evaluation of performance: In this research, the dataset for Vancouver city is of the year 2003 to 2017. After applying machine learning models, the classification prediction obtained is approximately 50%. Different algorithms require different training times. The accuracy can be improved if we include more features in our dataset like the weather conditions, street lights, after understanding the dataset and results, more features are needed to distinguish different crimes from each other. As neighborhood does not specifically has nothing to do with crime characteristic, there is a requirement of features which especially support the crime types so as to study on machine learning cyber crimes.
Case studies on Confusion Matrix:
Its quite common we receive mails which are categorized under “promotion”, “social” and “Primary”. Now, the situation comes where we know that one of our mail is been wrongly misplaced in spam, which is not actually a spam mail, and there are few mails which are non spam, and are “actually spam mails”. So how to over come this kind of getting the clarification problems, there comes with confusion matric which gives the final clarity among these doubts we have..
There are 4 situations, where this classification problem arise..
- Actual true also predicted True called (True Positive).
- Actual False also predicted False called (True Negative).
- Actual not true but predicted True called (False Positive).
- Actual True but predicted false called(False Negative).
Since, that we got familiar with (TP, TN, FP & FN). Now let's continue with the same spam case study, with the help of the below table.
Note:- “Actual ”and “Predicted ”are two identical classes. which can also be denoted as “0” & “1”.
From the above table, let's understand what does these technical terms are used in this concept.
- The spam which is actually positive, and predicted positive is 952 and
- The spam which is actually negative and is predicted positive is 167,
- The spam which is actually positive but is predicted negative is 526,
- The spam which is actually negative also predicted negative is 3025.
Once the total sum of the classifications are done the next step is to know about important concepts of confusion matrix, which are called Recall/sensitivity/True positive rate(minimize false -ve), specificity/True negative rate(minimize false +ve), Accuracy, or F1 score, and Precision.
Note:-
- Among all the metrics, Accuracy is also the most important one, since that it helps in predicting if spam is “actual positive”. So in this case, we can probably tolerate False Positives but not False Negatives.
- Precision should ideally be 1 (high) for a good classifier. Precision becomes 1, only when the numerator and denominator are equal.
- Specificity is defined as the ratio of considered actual negative but predicted positive values, which is also the reverse order of recall rate. This works with the condition, where the spam mails are True negative but are marked or identified as positive. It helps us to determine the only/total calculated negative values.
Conclusion:
In this context, I tried to give an overview with the help of real examples related to the confusion matrix and a case study based on ML use-cases, and how a classification model could be effectively evaluated especially in situations where looking at standalone accuracy is not enough. Hope you understood the concepts like “TP, TN, FP, FN, Precision, Recall, Confusion matrix, types of error-1 and 2”, with examples and explanations.
Hope this made clear about the concepts we used today…
Keep Learning, Keep Sharing :)
Cloud & DevOps | RightEducation | Mandsaur University
3 年Good going ??