Cluster bugs using ML (K-Means Clustering Algorithm) – A step-by-step approach

Cluster bugs using ML (K-Means Clustering Algorithm) – A step-by-step approach

There are a few potential areas in the quality engineering space where applying Machine Learning concepts can help us get deep insights. It is not a necessity that only applied ML can bring an extensive understanding of those areas. Some solid plain engineering work can also achieve the same. But the existence of some libraries, used extensively for ML works, makes the work far easier.

One of those use cases is the “clustering” of software bugs found for a software product. It’s about understanding the nature of the bugs and then grouping them into different “clusters” (i.e. groups) based on some common characteristics. The concept can be applied by starting with a few features of the bugs (as shown in this article) and then can be extended to include multiple features. The more bug features are included, the better the groupings will be. The efficacy will also depend on the amount of data. The larger the dataset, the better. Collecting the data for only a few test cycles/sessions will not be that effective. My suggestion is to keep on collecting the data for every cycle/session, over a long period of time and keep on looking into what the algorithm result is telling you. The results can uncover some very useful information that is otherwise difficult to perceive with the naked eye.

Before starting, let’s first understand what is clustering in ML. “Clustering” (or Cluster Analysis) is an unsupervised machine learning technique that groups a set of data points in such a way that the data points in the same group (cluster) are more similar to each other than to those in other groups (clusters). Here, the term “unsupervised” means the model will be trained on unlabelled data without the supervision/help of human-labelled examples as the training set. Such kind of model usage becomes helpful when we want to discover patterns, relationships or structures in the collected data. The model will explore the collected data, learn and do these on its own, before segregating the data and putting them into clusters.

Let’s take the example of an imaginary bugs dataset defined by the equation:

import numpy as n
bug_data = np.random.rand(100, 2) * 4p        

These two lines generate a 2-dimensional NumPy array called “bug_data” with a shape of 100 rows and 2 columns. The array is filled with random values between 0 and 1, and these values are then multiplied by 4, resulting in a range between 0 and 4. Consider the two columns to be representative of some features related to the bugs like “Occurrence frequency” (or it can be “Application Module/Component affected”) and “Bug Severity” respectively.

[Note: Instead of using this kind of random imaginary dataset (as a numpy array), you can replace this with your own bug dataset. You can put them into either a CSV file or Excel file, create a pandas dataframe out of them and then follow the remaining process, but the basic concept remains the same]

Next, we need to enable the model to determine, on its own, the optimal number of clusters that we have to use in the model. Here, the ML model algorithm that I am using is “K-Means clustering” and the method to determine the number of clusters is “The Elbow method”.

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

wcss = []  # within cluster sum of squares
for i in range(1, 16):
    kmeans = KMeans(n_clusters=i, init="k-means++", max_iter=300, n_init=10, random_state=0)
    kmeans.fit(bug_data)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 16), wcss)
plt.title("The Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("WCSS")
plt.show()        

I am importing two Python modules here (“matplotlib.pyplot” from the matplotlib library and “sklearn.cluster” from the scikit-learn library).

I created an empty Python list (wcss) which will represent the “within-cluster sum of squares”. This mathematical calculation/metric is used to evaluate the clustering results in the K-Means algorithm. Mathematically, it measures the sum of squared distances between each data point and the centroid of its assigned cluster.?

Next, I initialised a K-Means object with some specific parameters like below:

n_clusters=i -> specifies the number of clusters to create. The value of i is determined by a the loop I am using (range(1, 16))

init=”k-means++” -> specifies the initialisation method for centroids

max_iter=300 -> specifies the maximum number of iterations for the K-means algorithm to converge.

n_init=10 -> specifies the number of times the algorithm will be run with different centroid seeds. The final result will be the best output in terms of WCSS

random_state=0 -> sets the random centroid seed for reproducibility

Then by using the below line, I am fitting the bug dataset to the initialised K-Means model

kmeans.fit(bug_data)        

This line performs the actual clustering by assigning each data point to one of the clusters

wcss.append(kmeans.inertia_)        

Next, this line calculates the WCSS metric (defined earlier) and adds the value to the previously declared “wcss” empty Python list. Here, “kmeans.inertia_” returns the WCSS value for the fitted model, which represents the sum of squared distances between each data point and its centroid.

The below lines will plot the data points and enable us to determine the value where the “elbow” seems to be breaking.?

plt.plot(range(1, 16), wcss)
plt.title("The Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("WCSS")
plt.show()        

That value will be the “number of clusters” that we have to use. Looking into the diagram here, that number is 3.

No alt text provided for this image

We need to use this cluster number (i.e. 3) for the subsequent actions of the model. For that we are declaring a variable (k) and initialising with it with the value “3”.

k = 3
kmeans = KMeans(n_clusters=k, random_state=0) 
kmeans.fit(bug_data)
labels = kmeans.labels_
centers = kmeans.cluster_centers_        

The line “” will initialize the K-means clustering model considering the cluster number as “3”

The line "kmeans.fit(bug_data)" will fit the bug data to the K-means model

The line "labels = kmeans.labels_" will get the cluster labels assigned to each bug

The line "centers = kmeans.cluster_centers_" will help us to get the cluster centroids

To plot the formed clusters, we can use the below code (comments included):

plt.figure(figsize=(8, 6))

# Define marker styles and colors for each cluster
markers = ["o", "s", "^"]
colors = ["red", "blue", "green"]
for cluster_label in range(k):
    cluster_points = bug_data[labels == cluster_label]
    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], marker=markers[cluster_label],
                color=colors[cluster_label])

plt.scatter(centers[:, 0], centers[:, 1], marker="x", s=200, c="black")
plt.xlabel("Bug Severity")
plt.ylabel("Occurrence Frequency")
plt.title("Bug Clustering for Bug Analysis")

# Create legend
legend_elements = [plt.Line2D([0], [0], marker=markers[i], color='w', markerfacecolor=colors[i], markersize=10)
                   for i in range(k)]
plt.legend(legend_elements, ['Cluster {}'.format(i + 1) for i in range(k)])
plt.show()        
No alt text provided for this image

In some other article, I will try to cluster the same dataset using the other Clustering techniques (Hierarchical clustering, Density-based clustering, Topic modelling) and see how the plots will look like and which one out of these gives the best results.

------------------------------------------------------------------------------------------------------------

Thank you so much for reading the 7th edition of the?#AutomationKaksha?newsletter. Every week, I will be publishing articles on Automation, Framework design, ML, System Design, Web Development and Data Science.

If you found this article interesting, you may also love my other blogs:

  1. Fundamental Understanding of Text Processing in NLP (Natural Language Processing)
  2. Understanding the capabilities of Polars Python implementation

Do subscribe to?#AutomationKaksha?and also share it with your colleagues, friends and connections who can get benefit from it.

Keep Learning, and Keep Sharing.

Brijesh DEB

Infosys | The Test Chat | Empowering teams to master their testing capabilities while propelling individuals toward stellar career growth.

1 年

Interesting... This seems quite a simple way to do it. Thank you Sumon. I have used Knime as the tool for my course and that is a good tool to use but python makes life so easy. Love it.

要查看或添加评论,请登录

Sumon Dey的更多文章

社区洞察

其他会员也浏览了