K-Means clustering and its Real World Use Case
Source : Done Myself

K-Means clustering and its Real World Use Case

Task Description??

??? Create a blog/article/video about explaining k-mean clustering and its real use case in the security domain.

Hello, Connections !!!

This is an important article on k-mean clustering and its use case in the security domain. In this article,?you will get to know about supervised, unsupervised learning, clustering,k-means clustering and how it works, applications,limitations, and its real-world use-cases.

No alt text provided for this image
No alt text provided for this image

Let's Dive in !!!

Supervised and Unsupervised learning are the two techniques of machine learning. But both the techniques are used in different scenarios and with different datasets.

Supervised Learning

Supervised learning is a machine learning method in which models are trained using labeled data. In supervised learning, models need to find the mapping function to map the input variable (X) with the output variable (Y). The goal of supervised learning is to train the model so that it can predict the output when it is given new data.

No alt text provided for this image

Example:?Suppose we have an image of different types of fruits. The task of our supervised learning model is to identify the fruits and classify them accordingly. So to identify the image in supervised learning, we will give the input data as well as output for that, which means we will train the model by the shape, size, color, and taste of each fruit. Once the training is completed, we will test the model by giving the new set of fruit. The model will identify the fruit and predict the output using a suitable algorithm.

Supervised Machine Learning Categorisation

It is important to remember that all supervised learning algorithms are essentially complex algorithms, categorized as either classification or regression models.

1)?Classification Models?– Classification models are used for problems where the output variable can be categorized, such as “Yes” or “No”, or “Pass” or “Fail.” Classification Models are used to predict the category of the data. Real-life examples include spam detection, sentiment analysis, scorecard prediction of exams, etc.

2)?Regression Models?– Regression models are used for problems where the output variable is a real value such as a unique number, dollars, salary, weight or pressure, for example. It is most often used to predict numerical values based on previous data observations. Some of the more familiar regression algorithms include linear regression, logistic regression, polynomial regression, and ridge regression.

There are some very practical applications of supervised learning algorithms in real life, including:

  • Text categorization
  • Face Detection
  • Signature recognition
  • Customer discovery
  • Spam detection
  • Weather forecasting
  • Predicting housing prices based on the prevailing market price
  • Stock price predictions, among others

Unsupervised Learning

Unsupervised learning is another machine learning method in which patterns are inferred from the unlabeled input data. The goal of unsupervised learning is to find the structure and patterns from the input data. Unsupervised learning does not need any supervision. Instead, it finds patterns from the data on its own.

No alt text provided for this image

Example:?To understand unsupervised learning, we will use the example given above. So unlike supervised learning, here we will not provide any supervision to the model. We will just provide the input dataset to the model and allow the model to find the patterns from the data. With the help of a suitable algorithm, the model will train itself and divide the fruits into different groups according to the most similar features between them.

Unsupervised Machine Learning Categorization

1)?Clustering?is one of the most common unsupervised learning methods. The method of clustering involves organizing unlabelled data into similar groups called clusters. Thus, a cluster is a collection of similar data items. The primary goal here is to find similarities in the data points and group similar data points into a cluster.

2)?Anomaly detection?is the method of identifying rare items, events or observations which differ significantly from the majority of the data. We generally look for anomalies or outliers in data because they are suspicious. Anomaly detection is often utilized in bank fraud and medical error detection.

3) Association: Fill an online shopping cart with diapers, applesauce and sippy cups and the site just may recommend that you add a bib and a baby monitor to your order. This is an example of association, where certain features of a data sample correlate with other features. By looking at a couple of key attributes of a data point, an unsupervised learning model can predict the other attributes with which they’re commonly associated.

4) Autoencoders: Autoencoders take input data, compress it into a code, then try to recreate the input data from that summarized code. It’s like starting with?Moby Dick, creating a SparkNotes version, and then trying to rewrite the original story using only SparkNotes for reference. While a neat deep learning trick, there are fewer real-world cases where a simple autocoder is useful. But add a layer of complexity and the possibilities multiply: by using both noisy and clean versions of an image during training,?autoencoders?can remove noise from visual data like images, video, or medical scans to improve picture quality.

Applications of Unsupervised Learning Algorithms

Some practical applications of unsupervised learning algorithms include:

  • Fraud detection
  • Malware detection
  • Identification of human errors during data entry
  • Conducting accurate basket analysis, etc.

What is Clustering?

No alt text provided for this image

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Clustering?is one of the most common exploratory data analysis techniques used to get an intuition about the structure of the data. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.

Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.

Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.

What is K-Means Clustering?

No alt text provided for this image

K-Means Clustering is an unsupervised learning algorithm that is used to solve clustering problems in machine learning or data science. which groups the unlabeled dataset into different clusters. It can be also defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different.

Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

How the K-means algorithm works

To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids

It halts creating and optimizing clusters when either:

  • The centroids have stabilized — there is no change in their values because the clustering has been successful.
  • The defined number of iterations has been achieved.

K-Means Clustering Algorithm:

Step 1. Choose a value of k, the number of clusters to be formed.

Step 2. Randomly select k data points from the data set as the initial cluster centroids/centers

Step 3. For each data point:

a. Compute the distance between the data point and the cluster centroid

b. Assign the data point to the closest centroid

Step 4. For each cluster calculate the new mean based on the data points in the cluster.

Step 5. Repeat 3 & 4 steps until the mean of the clusters stop changing or a maximum number of iterations is reached.

No alt text provided for this image

Not knowing the value of K

There is no way of knowing the number of clusters with K-means. So what you can do is start with K=1. Then increase the value of K (up to a certain upper limit). Usually, the?variance?(the summation of the square of the distance from the “owner” center for each point) will decrease rapidly. After a certain point, it will decrease slowly. When you see such behavior, you know you’ve overshot the K-value.

Bad initial guess

If your initial guess is bad, you cannot expect the algorithm to work well.

The best way is to run K-means on several random initial guesses. Then, pick the final centers which have the least variance.

Another trick is to pick centers in a certain manner:

  1. Place the first center on a data point
  2. Place the second center on a data point that is farthest from the first
  3. Place the third center on a data point that is farthest from the first and second
  4. So on.

K-Means clustering is used in a variety of examples or business cases in real life, like:

In Security Domains

  • Academic performance?
  • Diagnostic systems?
  • Search engines?
  • Wireless sensor networks

No alt text provided for this image

Here are some of the real-world use-case of the K-means Clustering :

1. Wireless sensor networks: A wireless sensor network (WSN) consists of spatially distributed autonomous sensors to monitor physical or environmental conditions and to cooperatively pass their data through the network to a Base Station. Clustering is a critical task in Wireless Sensor Networks for energy efficiency and network stability. Clustering through the Central Processing Unit in wireless sensor networks is well known and in use for a long time. Presently clustering through distributed methods is being developed for dealing with the issues like network lifetime and energy. In our work, we implemented both centralized and distributed k-means clustering algorithms in a network simulator. k-means is a prototype-based algorithm that alternates between two major steps, assigning observations to clusters and computing cluster centers until a stopping criterion is satisfied. Simulation results are obtained and compared which show that distributed clustering is efficient than centralized clustering.

Diagnostic systems: This research work has developed a Decision Support in Heart Disease Prediction System (HDPS) using data mining modeling technique, namely, Na?ve Bayes and K-means clustering algorithms that are one of the most popular clustering techniques; however, where the initial choice of the centroid strongly influences the final result. Using medical data, such as age, sex, blood pressure, and blood sugar levels, chest pain, electrocardiogram, analyzes of different study patients, etc. graphics can predict the likelihood of the patient. Effectiveness of unsupervised learning techniques, which is a k-means clustering to improve teaching methods controlled, which is naive Bayes. It explores the integration of K-means clustering with naive Bayes in the diagnosis of disease patients. It also investigates different methods of initial centroid selection of the K-means clustering such as range, inlier, outlier, random attribute values, and random row methods in the diagnosis of heart disease patients. The results indicate that the integration of the K-means clustering with na?ve Bayes with different initial centroid selecting naive Bayesian improves accuracy in the diagnosis of the patient.

2. Document classification: Cluster documents in multiple categories based on tags, topics, and the content of the documents. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarities in document groups.

3. Delivery store optimization:?Optimize the process of good delivery using truck drones by using a combination of k-means to find the optimal number of launch locations and a genetic algorithm to solve the truck route as a traveling salesman problem.

4. Identifying crime localities:?With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into a crime-prone area within a city or a locality. With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into a crime-prone area within a city or a locality.

5. Customer/Market Segmentation:?Clustering helps marketers improve their customer base, work on target areas, and segment customers based on purchase history, interests, or activity monitoring. The classification would help the company target specific clusters of customers for specific campaigns.

6. Fantasy league stat analysis:?Analyzing player stats has always been a critical element of the sporting world, and with increasing competition, machine learning has a crucial role to play here. As an interesting exercise, if you would like to create a fantasy draft team and like to identify similar players based on player stats, k-means can be a convenient option.

7. Insurance fraud detection:?Machine learning has a critical role to play in fraud detection and has many applications in automobile, healthcare, and insurance fraud detection. Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on their proximity to clusters that state fraudulent patterns. Since insurance fraud can have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

8. Rideshare data analysis:?The publicly available Uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. Analyzing this data is helpful not just in the context of Uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.

9. Cyber-profiling criminals:?Cyber profiling is the process of collecting data from individuals and groups to identify significant correlations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

10. Call record detail analysis:?A call detail record is a piece of information captured by telecom companies during the call, SMS, and internet activity of a customer. this information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.

11. Automatic clustering of IT alerts:?Large enterprise IT infrastructure technology components such as network, storage, or database generate large volumes of alert messages. because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes.

K-means limitations and weaknesses

Unfortunately, k-means has limitations and doesn’t work well with different types of clusters. It doesn’t do well when:

  1. The clusters are of unequal size or density.
  2. The clusters are non-spherical.
  3. There are outliers in the data.

Implemented K-means clustering

Here is the link: https://github.com/pratikkprojecthub/Datasciencepractice.git

Here in this article, We have learned how the k-means clustering works and how it helps in real-world problems.

Thanks for reading the article.

Hope you find this article helpful !!!

You can appreciate the article by giving it a like and posting comments about your feedback.

Feel free to ask any queries. Don’t forget to share this article with your colleagues.??

No alt text provided for this image


Abdulrahman Alzahrani

Application Manager @ King Salman Hospital

1 年

Great post explaining K-Means clustring , I would just add a comment on finding the best (K) value is ussing Elbow method is used to determine the best k. We try the k means implementation for multiple k values and plot them against our sum of squared distances from the centroid(loss function). The elbow of the curve (where the curve visibly bends) is selected as the optimum k.

回复
Andre Acierno, M.S. Organizational Management

Data Analyst | ETL/EDA Expert | Transforms details to insights

1 年

This was a good read for me. Great content and thank you.

要查看或添加评论,请登录

Pratik Kohad ????的更多文章

社区洞察

其他会员也浏览了