登录查看更多内容

Cluster bugs using ML (K-Means Clustering Algorithm) – A step-by-step approach

Sumon Dey

Software Engineer ? Apple

发布日期: 2023年5月26日

There are a few potential areas in the quality engineering space where applying Machine Learning concepts can help us get deep insights. It is not a necessity that only applied ML can bring an extensive understanding of those areas. Some solid plain engineering work can also achieve the same. But the existence of some libraries, used extensively for ML works, makes the work far easier.

One of those use cases is the “clustering” of software bugs found for a software product. It’s about understanding the nature of the bugs and then grouping them into different “clusters” (i.e. groups) based on some common characteristics. The concept can be applied by starting with a few features of the bugs (as shown in this article) and then can be extended to include multiple features. The more bug features are included, the better the groupings will be. The efficacy will also depend on the amount of data. The larger the dataset, the better. Collecting the data for only a few test cycles/sessions will not be that effective. My suggestion is to keep on collecting the data for every cycle/session, over a long period of time and keep on looking into what the algorithm result is telling you. The results can uncover some very useful information that is otherwise difficult to perceive with the naked eye.

Before starting, let’s first understand what is clustering in ML. “Clustering” (or Cluster Analysis) is an unsupervised machine learning technique that groups a set of data points in such a way that the data points in the same group (cluster) are more similar to each other than to those in other groups (clusters). Here, the term “unsupervised” means the model will be trained on unlabelled data without the supervision/help of human-labelled examples as the training set. Such kind of model usage becomes helpful when we want to discover patterns, relationships or structures in the collected data. The model will explore the collected data, learn and do these on its own, before segregating the data and putting them into clusters.

Let’s take the example of an imaginary bugs dataset defined by the equation:

import numpy as n
bug_data = np.random.rand(100, 2) * 4p

These two lines generate a 2-dimensional NumPy array called “bug_data” with a shape of 100 rows and 2 columns. The array is filled with random values between 0 and 1, and these values are then multiplied by 4, resulting in a range between 0 and 4. Consider the two columns to be representative of some features related to the bugs like “Occurrence frequency” (or it can be “Application Module/Component affected”) and “Bug Severity” respectively.

[Note: Instead of using this kind of random imaginary dataset (as a numpy array), you can replace this with your own bug dataset. You can put them into either a CSV file or Excel file, create a pandas dataframe out of them and then follow the remaining process, but the basic concept remains the same]

Next, we need to enable the model to determine, on its own, the optimal number of clusters that we have to use in the model. Here, the ML model algorithm that I am using is “K-Means clustering” and the method to determine the number of clusters is “The Elbow method”.

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

wcss = []  # within cluster sum of squares
for i in range(1, 16):
    kmeans = KMeans(n_clusters=i, init="k-means++", max_iter=300, n_init=10, random_state=0)
    kmeans.fit(bug_data)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 16), wcss)
plt.title("The Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("WCSS")
plt.show()

I am importing two Python modules here (“matplotlib.pyplot” from the matplotlib library and “sklearn.cluster” from the scikit-learn library).

I created an empty Python list (wcss) which will represent the “within-cluster sum of squares”. This mathematical calculation/metric is used to evaluate the clustering results in the K-Means algorithm. Mathematically, it measures the sum of squared distances between each data point and the centroid of its assigned cluster.?

Next, I initialised a K-Means object with some specific parameters like below:

n_clusters=i -> specifies the number of clusters to create. The value of i is determined by a the loop I am using (range(1, 16))

init=”k-means++” -> specifies the initialisation method for centroids

max_iter=300 -> specifies the maximum number of iterations for the K-means algorithm to converge.

n_init=10 -> specifies the number of times the algorithm will be run with different centroid seeds. The final result will be the best output in terms of WCSS

random_state=0 -> sets the random centroid seed for reproducibility

Then by using the below line, I am fitting the bug dataset to the initialised K-Means model

kmeans.fit(bug_data)

This line performs the actual clustering by assigning each data point to one of the clusters

领英推荐

Understanding CatBoost!

Damien Benveniste, PhD 9 个月前

SpeedML

360DigiTMG 1 年前

Approaching (Almost) Any Machine Learning Problem

Abhishek Thakur 8 年前

wcss.append(kmeans.inertia_)

Next, this line calculates the WCSS metric (defined earlier) and adds the value to the previously declared “wcss” empty Python list. Here, “kmeans.inertia_” returns the WCSS value for the fitted model, which represents the sum of squared distances between each data point and its centroid.

The below lines will plot the data points and enable us to determine the value where the “elbow” seems to be breaking.?

plt.plot(range(1, 16), wcss)
plt.title("The Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("WCSS")
plt.show()

That value will be the “number of clusters” that we have to use. Looking into the diagram here, that number is 3.

We need to use this cluster number (i.e. 3) for the subsequent actions of the model. For that we are declaring a variable (k) and initialising with it with the value “3”.

k = 3
kmeans = KMeans(n_clusters=k, random_state=0) 
kmeans.fit(bug_data)
labels = kmeans.labels_
centers = kmeans.cluster_centers_

The line “” will initialize the K-means clustering model considering the cluster number as “3”

The line "kmeans.fit(bug_data)" will fit the bug data to the K-means model

The line "labels = kmeans.labels_" will get the cluster labels assigned to each bug

The line "centers = kmeans.cluster_centers_" will help us to get the cluster centroids

To plot the formed clusters, we can use the below code (comments included):

plt.figure(figsize=(8, 6))

# Define marker styles and colors for each cluster
markers = ["o", "s", "^"]
colors = ["red", "blue", "green"]
for cluster_label in range(k):
    cluster_points = bug_data[labels == cluster_label]
    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], marker=markers[cluster_label],
                color=colors[cluster_label])

plt.scatter(centers[:, 0], centers[:, 1], marker="x", s=200, c="black")
plt.xlabel("Bug Severity")
plt.ylabel("Occurrence Frequency")
plt.title("Bug Clustering for Bug Analysis")

# Create legend
legend_elements = [plt.Line2D([0], [0], marker=markers[i], color='w', markerfacecolor=colors[i], markersize=10)
                   for i in range(k)]
plt.legend(legend_elements, ['Cluster {}'.format(i + 1) for i in range(k)])
plt.show()

In some other article, I will try to cluster the same dataset using the other Clustering techniques (Hierarchical clustering, Density-based clustering, Topic modelling) and see how the plots will look like and which one out of these gives the best results.

------------------------------------------------------------------------------------------------------------

Thank you so much for reading the 7th edition of the?#AutomationKaksha?newsletter. Every week, I will be publishing articles on Automation, Framework design, ML, System Design, Web Development and Data Science.

If you found this article interesting, you may also love my other blogs:

Do subscribe to?#AutomationKaksha?and also share it with your colleagues, friends and connections who can get benefit from it.

Keep Learning, and Keep Sharing.

Automation Kaksha

3,949 位关注者

Brijesh DEB

Infosys | The Test Chat | Empowering teams to master their testing capabilities while propelling individuals toward stellar career growth.

1 年

Interesting... This seems quite a simple way to do it. Thank you Sumon. I have used Knime as the tool for my course and that is a good tool to use but python makes life so easy. Love it.

1 次回应

查看更多评论

要查看或添加评论，请登录

Sumon Dey的更多文章

Why time to time refactoring of your code is important

2023年6月1日

Why time to time refactoring of your code is important

Introduction If we talk about program development and maintenance, we must talk about “Refactoring” too. If the word…

3 条评论
Systems Modelling as an input to your Test Automation Design

2023年5月13日

Systems Modelling as an input to your Test Automation Design

A few weeks ago, I wrote a LinkedIn post mentioning how people are spending most of their time fixing (or discussing)…

11 条评论
Visualize how Git works internally in your local (using few commands)

2023年5月3日

Visualize how Git works internally in your local (using few commands)

I will be honest. The first time I came across Git, I found it to be quite complicated and difficult to understand.

5 条评论
Deep Dive into Selenium’s communication setup with ChromeDriver

2023年4月21日

Deep Dive into Selenium’s communication setup with ChromeDriver

Reading code and understanding a proven project’s detailed design/implementation is a good activity to spend time on. I…

2 条评论
Fundamental Understanding of Text Processing in NLP (Natural Language Processing)

2023年4月14日

Fundamental Understanding of Text Processing in NLP (Natural Language Processing)

During 2018-19, I was studying a lot about Machine Learning and Deep Learning at the weekends. As a small side project,…

1 条评论
Understanding the capabilities of Polars Python implementation

2023年4月7日

Understanding the capabilities of Polars Python implementation

You know that, right now, “pandas” is the most popular open-source dataframe library in the world of Data Science. It…

7 条评论
Deconstruction of the Cucumber JUnit Runner Behaviour

2023年4月1日

Deconstruction of the Cucumber JUnit Runner Behaviour

When done well and effectively, BDD or Behaviour-Driven Development can set a company apart from its competitors by…

9 条评论

See all articles

Cluster bugs using ML (K-Means Clustering Algorithm) – A step-by-step approach

Sumon Dey

Software Engineer ? Apple

领英推荐

Automation Kaksha

3,949 位关注者

Sumon Dey的更多文章

社区洞察

其他会员也浏览了

Model vs Algorithm in ML

MODEL VS ALGORITHM IN ML

Why are ZenML Pipelines preferred over others?

Unleashing the Power of Knowledge Graphs in RAG Applications

Time Series Decomposition in Machine Learning

Machine Learning - Cross Validation

My experiments with Gen AI : Jan 2025

What Essential Skills and Knowledge Do You Need to Master Machine Learning and Excel in the Field?

December 25, 2020

Concept and business relevance of binomial distribution

领英推荐

Automation Kaksha

3,949 位关注者

Sumon Dey的更多文章

Why time to time refactoring of your code is important

Systems Modelling as an input to your Test Automation Design

Visualize how Git works internally in your local (using few commands)

Deep Dive into Selenium’s communication setup with ChromeDriver

Fundamental Understanding of Text Processing in NLP (Natural Language Processing)

Understanding the capabilities of Polars Python implementation

Deconstruction of the Cucumber JUnit Runner Behaviour

社区洞察

其他会员也浏览了

Model vs Algorithm in ML

MODEL VS ALGORITHM IN ML

Why are ZenML Pipelines preferred over others?

Unleashing the Power of Knowledge Graphs in RAG Applications

Time Series Decomposition in Machine Learning

Machine Learning - Cross Validation

My experiments with Gen AI : Jan 2025

What Essential Skills and Knowledge Do You Need to Master Machine Learning and Excel in the Field?

December 25, 2020

Concept and business relevance of binomial distribution