K-Mean Clustering and Its Real Use case in the Security Domain

K-Mean Clustering and Its Real Use case in the Security Domain

In this post, we’re going to dive deep into one of the most influential unsupervised learning algorithms - k-means clustering. K-means clustering is one of the simplest and most popular unsupervised machine learning algorithms, and we’ll be discussing how the algorithm works, distance and accuracy metrics, and a lot more.

No alt text provided for this image

What is Clustering?

Clustering?is the process of dividing the data space or data points into a number of groups, such that data points in the same groups are more similar to other data points in the same group, and dissimilar to the data points in other groups.


Applications of Clustering in Real-World problems

Vector quantization

K-means originates from signal processing, but it’s also used for vector quantization. For example,?color quantization?is the task of reducing the?color pallet?of an image to a fixed number of colors?k. The k-means algorithm can easily be used for this task.

Psychology and Medicine

An illness or condition frequently has a number of variations, and cluster analysis can be used to identify these different subcategories. For example, clustering has been used to identify different types of depression. Cluster analysis can also be used to detect patterns in the spatial or temporal distribution of a disease.

Recommender Systems

Clustering can also be used in?recommendation engine. In the case of recommending movies to someone, you can look at the movies enjoyed by a user and then use clustering to find similar movies.

Document Clustering

This is another common application of clustering. Let’s say you have multiple documents and you need to cluster similar documents together. Clustering helps us group these documents such that similar documents are in the same clusters.

Image Segmentation

Image segmentation?is a wide-spread application of clustering. Similar pixels in the image are grouped together. We can apply this technique to create clusters having similar pixels in the same group.


Importance of Clustering in Ecommerce

The clustering task is an instance of unsupervised learning that automatically forms clusters of similar things. The key difference from classification is that in classification, we know what we are looking for. That is not the case in clustering. Clustering is sometimes called unsupervised classification because it produces the same result as classification but without having predefined classes.

We can cluster almost anything, and the more similar the items are in the cluster, the better our clusters are. This notion of similarity depends on a similarity measurement. Because we don’t have a target variable as we did in classification, we call this?unsupervised learning. Instead of telling the machine “Predict Y for our data X,” we are asking “What can you tell me about X?”. One widely used clustering algorithm is?k-means?where k is a user-specified number of clusters to create. The k-means clustering algorithm starts with k-random cluster centers known as?centroids.

Next, the?algorithm computes the distance from every point to the clustercenters. Each point is assigned to the closest cluster center. The cluster centers are then re-calculated based on the new points in the cluster. This process is repeated until the cluster centers no longer move. This simple algorithm is quite effective but is sensitive to the initial cluster placement.

No alt text provided for this image

Used Cases

Ecommerce Use Case

Amazon could fix its recommendations by incorporating clustering to segment customers in order to determine if they are likely to buy something similar again. Let’s see how that can happen with a hypothetical scenario:

  1. Clustering:?Let’s say Amazon has a dataset of all the purchase orders for 500,000 customers in the past week. The dataset has many features that can be broadly categorized into customer profile (gender, age, ZIP code, occupation) and item profile (types, brands, description, color). By applying the k-means clustering algorithm to this dataset, we end up with 10 different clusters. At this point, we do not know what each of these clusters represents; so we arbitrarily call them Cluster 1, 2, 3, and so on.
  2. Classification:?Okay, it’s time to do supervised learning. We now look at cluster 1 and use a Naive Bayes algorithm to predict the probabilities of ZIP code and item type features for all the data points. It turns out that 95% of the data in cluster 1 consist of customers who live in New York and frequently buy high-heel shoes. Awesome, let’s now look at cluster 2 and use logistic regression to do binary classification on the gender and color features for all the data points. As a result, the data in cluster 2 consist of male customers who are obsessed with any items that are black. If we keep doing this for all the remaining clusters, we will end up with very detailed description for each of them.
  3. Recommendation:?Finally, we can recommend items to the customer, knowing that they are highly relevant according to our prior segmentation analysis. We can simply use the k-Nearest Neighbor algorithm to find the items to recommend. For example, customers in cluster 1 are recommended a pair of Marc New York high heels, customers in cluster 2 are recommended a black razor from Dollar Shave Club, and so on.

No alt text provided for this image


Clustering Analysis for Malware Behavior Detection in Cyber Crime

Cyber-attacks become the biggest threat in computer and networks system around the world. Because of that it is important to merge IDS that can detect and analyze the data with high accuracy (i.e., true positives and negative) and low false detection (i.e., false positive and negative) in the minimal detection time. So, K-Means clustering detection model with appoint of data mining, peculiarly clustering method is a notable field that can be explored to overcome this matter. It is a need to have continuous of IDS improvement in term of the accuracy of malware analysis, the detection time and the suitable detection approach; are the motivations for this research.

No alt text provided for this image

Malware Detection

Malware interrupt the file registry when entering a computer and basically malware tend to create and modify computer files system and Windows registry entries besides the computer interprocess communication and basic network interaction. Intrusion attack such as malwares are known to breach the policy of network security in organizations and continuously tries to interrupt the core fundamental of cybersecurity which are Confidential, Integrity and Availability or known as CIA. Therefore, previous cybersecurity researcher has proposed detection-based for malware intrusion, which is a framework that monitors the behavior of system activity. Then, the behavior will be analyzed by the framework and notify the users if there is a sign of intrusion

At any cost, the detection of malware is important and crucial as it conquer more than half of malware attack that exploit on the computer registry; and it can be detected by using Intrusion Detection System as the early defense over the malware attack Ransomware attacks: detection, prevention and cure. One of the solutions in detecting any intrusions is an Intrusion Detection System (IDS) to avoid the?network and computer system from any cyber attack

No alt text provided for this image

Malware Detection by using K-Means Clustering

K-Means clustering is a method of cluster analysis in which the defined 'k' is separating the clusters with the existence of center value between all the grouped objects. However, in data mining perspective, the implemented K-Means clustering algorithm separates the time interval between the normal and abnormal data in the same training dataset. Differ from database manners, clustering can be referred as the capability of many servers or instances to connect to one database while in IDS, clustering technique is usually use within anomaly detection in exploring group of malware data information without knowing the former relationship knowledge of the data. So, clustering method clusters the objects according to their characteristic of data points, in such every single data point in a cluster is identical to those in the same cluster, but diverse from another clusters. Clustering is one of the most admired concepts in the domain of unsupervised learning as the anomaly detection is generally unsupervised detection. The idea is the same data points tend to belong to same groups or clusters, as identified by the distance of the data from the local centroids.

The graph shows that there are only two centroids, which are marks as ?X?. The ?X? mark depends on the number of cluster that is defined in the first step of the process. The resulting cluster centroids are then used for fast anomaly detection in monitoring of new incoming data [44]. The KMeans clustering algorithm is one of the simplest unsupervised learning algorithms as shown in Fig. 4 that resolves the clustering problem [8] by:

  1. Collecting dataset of malware.
  2. Identifying the number of clusters (k).
  3. Initializing the k centroids (k-means) for the data.
  4. Determining the distance of each malware from each centroid and then assign each malware to the cluster with centroid closest to it.
  5. Recounting the centroids for each cluster.
  6. Steps 4 and 5 are repeated until there is no change in cluster centroids.
  7. If formed clusters do not look reasonable, repeat the steps 1-6 for different number of clusters.?

No alt text provided for this image


In K-Means clustering method, the whole dataset are transform to Voronoi cells by taking observations and finally create the ?k? groups in which every observation is a segment of a computed nearest mean cluster. It means that it creates 'k' similar clusters of data points and the data instances that fall outside of these groups could potentially be marked as anomalies. Thus, K-Means is a widely used clustering algorithm and this algorithm can be said as the most popular clustering algorithm among the geometric procedures Survey on anomaly detection using data mining techniques because of its computational simplicity, efficiency and ease of implementation . As it is straightforward algorithm, the computational time is faster then the other algorithm, thus the time of malware clustering process can be minimized

K-Means clustering is combined with Euclidean Distance based classifier correctly classified more than 14m DNS transactions of 42,143 malware samples concerning DNS-C&C usage then, uncovers another bot family with DNS C&C. In addition, this method correctly detected DNSC&C in mixed office workstation network traffic. For instance, DNS C&C provide a mechanism to detect DNS C&C in network traffic.

All the processes are classified into four main phases and the phases of the detection model describes as follows:

Phases 1: Binary Execution Phase In this phase, the binary file is run in virtual machine that is Drakvuf environment. Then, all the activities are captured as log format.

Phases 2: File Extraction Phase Then, all the data, which is the malware activities are extracted in this phase. There are two types of data that are extracted; first, default file (normal activities) and second infected file (suspicious activities).

Phases 3: Registry Data Extraction Phase After that, all the collected registry data is extracted and prepared in this phase, as the extracted data are imbalance data.

Phases 4: Clustering Phase The last phase is clustering phase in which the balanced data is analyzed by using K-Means clustering algorithm to cluster the data either it is malware or not. Euclidean Distance formula is used to measure the distance of centroid and data points. The formula is shown in Fig.

No alt text provided for this image

CONCLUSION

Intrusion Detection System (IDS) is use as malware detector globally, causes many researchers in exploring this field. In this research project, clustering method is proposed for better malware detection. It is because of lack of analysis in detecting malware behavior causes low malware detection due to limited sources on this information especially in windows registry to identify malware activities. Clustering techniques that use unsupervised algorithm in machine learning plays an important role in grouping similar malware characteristics but this approach is absent in malware analysis environment specifically in registry information. Thus, the purpose of this research project is to study registry information and proposing clustering analysis against registry information to improve malware detection.

No alt text provided for this image



要查看或添加评论,请登录

Parth Sharma的更多文章

  • KUBERNETES IN LIFE

    KUBERNETES IN LIFE

    What is Kubernetes? Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and…

    1 条评论
  • ?? ARTH - Task 1.5??

    ?? ARTH - Task 1.5??

    Task-1.5 Task Description ?? ?? Find out different Fun commands like sl, cowway etc.

  • ?? ARTH - Task 1.3??

    ?? ARTH - Task 1.3??

    Task Description ?? Explore Ram and find and show that if we store 5 in variable X i.e.

  • ?? ARTH - Task 1.4??

    ?? ARTH - Task 1.4??

    TASK 1.4 Task Description ?? ?? Explore date command and with options and try to use every option and create a simple…

  • ?? ARTH - Task 1 ??

    ?? ARTH - Task 1 ??

    Task Description 1.2 ?? ?? Explore espeak-ng options INFO ESPEAK-NG? ? ? ? ? ? ? ? ? NAME ? ? ? ?espeak-ng - A…

  • ?? ARTH - Task 1 ??

    ?? ARTH - Task 1 ??

    Task Description1.1 ?? ?? Create a blog/article/video/document about explaining various options for zenity with output…

  • JAVASCRIPT AS A POWERFULL TOOL

    JAVASCRIPT AS A POWERFULL TOOL

    What is JavaScript JavaScript (js) is a light-weight object-oriented programming language which is used by several…

    2 条评论
  • JAVA SCRIPT WORKSHOP BY VIMAL DAGA SIR

    JAVA SCRIPT WORKSHOP BY VIMAL DAGA SIR

    HELLO! Connections I hope so you are doing well. i attended the 2 days workshop by world record holder VIMAL DAGA sir.

  • CONFUSION MATRIX AND CYBER SECURITY

    CONFUSION MATRIX AND CYBER SECURITY

    What is Confusion Matrix? A confusion matrix is a performance measurement technique for Machine learning…

  • Running GUI Applications on Docker in Linux

    Running GUI Applications on Docker in Linux

    Hello Connections! Summer - Task 02 Task Description ?? ?? GUI container on the Docker ?? Launch a container on docker…

    1 条评论

社区洞察

其他会员也浏览了