Hello Learners!!! Today we will be expanding our knowledge about how we can use a confusion matrix in dealing with cybercrimes.
What is Confusion Matrics…most of us have this as our first question?RIGHT so let's know about confusion matrix first.
When we get the data, after data cleaning, pre-processing, and wrangling, the first step we do is to feed it to an outstanding model and of course, get output in probabilities. But how can we measure the effectiveness of our model? Better the effectiveness, better the performance and that are what we want. And it is where the Confusion matrix comes into the limelight. Confusion Matrix is a performance measurement for machine learning classification.
There are multiple ways of finding errors in the machine learning model. The Mean Absolute Error(Error/cost) function helps the model to be trained in the correct direction by trying to make the distance between the Actual and predicted value to be 0. We find the error in machine learning model prediction by “y — y^”.
Mean Square Error(MSE): Points from the data set are taken and they are squared first and then the mean is taken to overcome the error.
In Binary Classification models, the error is detected with the help of confusion matrix.
Confusion Matrix is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.
It is extremely useful for measuring Recall, Precision, Specificity, Accuracy and most importantly AUC-ROC Curve.
Four outcomes of the confusion matrix
The confusion matrix visualizes the accuracy of a classifier by comparing the actual and predicted classes. The binary confusion matrix is composed of squares:
- TP: True Positive: Predicted values correctly predicted as actual positive
- FP: Predicted values incorrectly predicted an actual positive. i.e., Negative values predicted as positive. Also known as the Type 1 error
- FN: False Negative: Positive values predicted as negative. Also known as the Type 2 error
- TN: True Negative: Predicted values correctly predicted as an actual negative
The accuracy of a model (through a confusion matrix) is calculated using the given formula below.
Accuracy = TN+TP / TN+FP+FN+TP
Accuracy can be misleading if used with imbalanced datasets, and therefore there are other metrics based on confusion matrix which can be useful for evaluating performance. In Python, confusion matrix can be obtained using “confusion_matrix()” function which is a part of “sklearn” library. This function can be imported into Python using “from sklearn.metrics import confusion_matrix.” To obtain confusion matrix, users need to provide actual values and predicted values to the function.
Understanding Confusion Matrix in a simpler manner:
Let’s take an example:
We have a total of 20 cats and dogs and our model predicts whether it is a cat or not.
- Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]
- Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘cat’, ‘cat’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]
- True Positive (TP) = 6
You predicted positive and it’s true. You predicted that an animal is a cat and it actually is.
- True Negative (TN) = 11
You predicted negative and it’s true. You predicted that animal is not a cat and it actually is not (it’s a dog).
- False Positive (Type 1 Error) (FP) = 2
You predicted positive and it’s false. You predicted that animal is a cat but it actually is not (it’s a dog).
- False Negative (Type 2 Error) (FN) = 1
You predicted negative and it’s false. You predicted that animal is not a cat but it actually is.
- Positive Predictive Value(PVV): This is very much near to precision. One significant difference between the two-term is that PVV considers prevalence. In the situation where the classes are perfectly balanced, the positive predictive value is the same as precision.
- Null Error Rate: This term is used to define how many times your prediction would be wrong if you can predict the majority class. You can consider it as a baseline metric to compare your classifier.
- F-measure/F1-Score: It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more.
- Roc Curve: Roc curve shows the true positive rates against the false positive rate at various cut points. It also demonstrates a trade-off between sensitivity (recall and specificity or the true negative rate).
- Precision: The precision metric shows the accuracy of the positive class. It measures how likely the prediction of the positive class is correct.
The maximum score is 1 when the classifier perfectly classifies all the positive values. Precision alone is not very helpful because it ignores the negative class. The metric is usually paired with Recall metric. Recall is also called sensitivity or true positive rate.
- Sensitivity: Sensitivity computes the ratio of positive classes correctly detected. This metric gives how good the model is to recognize a positive class.
Is it necessary to check for recall (or) precision if you already have a high accuracy?
We can not rely on a single value of accuracy in classification when the classes are imbalanced. For example, we have a dataset of 100 patients in which 5 have diabetes and 95 are healthy. However, if our model only predicts the majority class i.e. all 100 people are healthy even though we have a classification accuracy of 95%.
When to use Accuracy / Precision / Recall / F1-Score?
- Accuracy is used when the True Positives and True Negatives are more important. Accuracy is a better metric for Balanced Data.
- Whenever False Positive is much more important use Precision.
- Whenever False Negative is much more important use Recall.
- F1-Score is used when the False Negatives and False Positives are important. F1-Score is a better metric for Imbalanced Data.
Why you we need a Confusion matrix?
Here are the pros/benefits of using a confusion matrix.
- It shows how any classification model is confused when it makes predictions.
- Confusion matrix not only gives you insight into the errors being made by your classifier but also types of errors that are being made.
- This breakdown helps you to overcomes the limitation of using classification accuracy alone.
- Every column of the confusion matrix represents the instances of that predicted class.
- Each row of the confusion matrix represents the instances of the actual class.
- It provides insight not only the errors which are made by a classifier but also errors that are being made.
Now let's understand how the Confusion matrix helps in overcoming cyber crimes
Cyberattack is becoming a critical issue of organizational information systems. A number of cyber-attack detection and classification methods have been introduced with different levels of success that are used as a countermeasure to preserve data integrity and system availability from attacks. The classification of attacks against computer networks is becoming a harder problem to solve in the field of network security.
The rapid increase in connectivity and accessibility of computer systems has resulted in frequent chances for cyber attacks. Attack on the computer infrastructures is becoming an increasingly serious problem. Basically, cyber attack detection is a classification problem, in which we classify the normal pattern from the abnormal pattern (attack) of the system. The subset selection decision fusion method plays a key role in cyber-attack detection. It has been shown that redundant and/or irrelevant features may severely affect the accuracy of learning algorithms. The SDF is a very powerful and popular data mining algorithm for decision-making and classification problems. It has been using in many real-life applications like medical diagnosis, radar signal classification, weather prediction, credit approval, and fraud detection, etc.
KDD CUP ‘’99 Data Set Description
To check the performance of the proposed algorithm for distributed cyber-attack detection and classification, we can evaluate it practically using KDD’99 intrusion detection datasets. In the KDD99 dataset, these four attack classes (DoS, U2R, R2L, and probe) are divided into 22 different attack classes that tabulated in Table I. The 1999 KDD datasets are divided into two parts: the training dataset and the testing dataset. The testing dataset contains not only known attacks from the training data but also unknown attacks. Since 1999, KDD’99 has been the most wildly used data set for the evaluation of anomaly detection methods. This data set is prepared by Stolfo et al. and is built based on the data captured in the DARPA’98 IDS evaluation program. DARPA’98 is about 4 gigabytes of compressed raw (binary) TCP dump data of 7 weeks of network traffic, which can be processed into about 5 million connection records, each with about 100 bytes. For each TCP/IP connection, 41 various quantitative (continuous data type) and qualitative (discrete data type) features were extracted among the 41 features, 34 features (numeric), and 7 features (symbolic). To analysis the different results, there are standard metrics that have been developed for evaluating network intrusion detections. Detection Rate (DR) and false alarm rate are the two most famous metrics that have already been used. DR is computed as the ratio between the number of correctly detected attacks and the total number of attacks, while the false alarm (false positive) rate is computed as the ratio between the number of normal connections that is incorrectly misclassified as attacks and the total number of normal connections.
In the KDD Cup 99, the criteria used for evaluation of the participant entries is the Cost Per Test (CPT) computed using the confusion matrix and a given cost matrix. A Confusion Matrix (CM) is a square matrix in which each column corresponds to the predicted class, while rows correspond to the actual classes. An entry at row I and column j, CM (i, j), represents the number of misclassified instances that originally belong to class I, although incorrectly identified as a member of class j. The entries of the primary diagonal, CM (i, i), stand for the number of properly detected instances. The cost matrix is similarly defined, as well, and entry C (i, j) represents the cost penalty for misclassifying an instance belonging to class i into class j.
- True Positive (TP): The amount of attack detected when it is actually attacked.
- True Negative (TN): The amount of normal detected when it is actually normal.
- False Positive (FP): The amount of attack detected when it is actually normal (False alarm).
- False Negative (FN): The amount of normal detected when it is actually attacked.
In the confusion matrix above, rows correspond to predicted categories, while columns correspond to actual categories.
Confusion matrix contains information on actual and predicted classifications done by a classifier. The performance of a cyber-attack detection system is commonly evaluated using the data in a matrix.