Breathing K-Means: Superior K-Means Solutions through Dynamic K-Values
Introduction
Running a k-means algorithm on your numerical data is a common first step to get a compact representation of your data through the resulting set of k centroids. The quality of your k-means solution is measured by the average distance of your data and the centroids.
In Figure 1 an example of a k-means problem is shown and a solution for k=15 which is obviously (at least to the trained eye ??) sub-optimal. In Figure 1b an animation of the local adaptation performed by k-means is shown.
The most popular algorithm to find a good k-means solution is the k-means++ algorithm which is a combination of the k-means++ initialization method (David Arthur and Sergei Vassilvitskii, 2007) producing an initial configuration and the Generalized Lloyd Algorithm (Linde, Y.; Buzo, A.; Gray, R., 1980) improving this configuration by locally moving the centroids until it converges to a local minimum solution. In the following, the Generalized Lloyd Algorithm is called GLA for brevity.
In this article, I present "Breathing K-Means", a dynamic method with the ability to perform non-local changes. The new approach is initialized with k-means++ but is usually able to improve the starting solution significantly while even being faster than k-means++ in the common scenario that k-means++ is run ten times to select the best solution.
The algorithm has been compared to six other methods in numerous experiments (TR available: https://arxiv.org/abs/2006.15666) and came out as the overall best algorithm beating k-means++/GLA by a large margin. Thus it can be recommended as a replacement for the usual k-means++/GLA combination.
An open source (MIT License) implementation is available from the Python Package repository PyPi (https://pypi.org/project/bkmeans/). The implementation is API-compatible with the KMeans class from scikit-learn (from which it is derived).
The K-Means Problem
The k-means problem denotes the task of decribing a given data set X consisting of n numeric vectors in d-dimensional space by a smaller set of k vectors, often called centroids or codebook vectors. The k-means problem typically occurs in the context of data analysis or data compression
The mathematical goal of the k-means problem is to find a set C of k centroids such that the summed square distance (often denoted as Summed Squared Error or SSE) of each vector x∈X to the nearest centroids c∈C is minimized. It is well known that this problem is NP-hard (Aloise et al., 2009) meaning that it belongs to the most complex group of optimization problems for which no efficient algorithm exists which is guaranteed to find the optimal solution. Therefore, in practice approximation algorithms are used to in the attempt to find a solution with a low SSE.
Below is a sample data set consisting of 25 clusters, each assembled from 3 quadratic areas, and each of these quadratic areas contains 4x4=16 data points.
While optimal solutions for k-means problems are normally not known, this data set was specifically constructed such that for k=75 an obvious optimal solution exists: one centroid in the center of each quadratic area as shown below:
The knowledge of this optimum makes it possible to judge given approximate solutions very accurately by computing the relative deviation of their SSE from the SSE of the optimal solution.
If one generates a solution by simply selecting k centroids at random from the data set, the relative error deviation is huge (325.83%) as shown in the example in Figure 4:
The k-means problem is known to be NP-hard, which means that there can be no efficient algorithm for finding the optimal solution (the best way to position the centroids). Therefore, approximation algorithms are used to find good, though not necessarily optimal, solutions efficiently. Below we describe the classic Generalized Lloyd Algorithm (GLA) with random initialization, the much better GLA with k-means++ initialization and our new approach, breathing k-means which improves upon GLA/k-means++ considerably.
The Generalized Lloyd Algorithm With Random Initialization
This classic method -- sometimes somewhat imprecisely denoted as "The K-Means Algorithm" - can be used to find a solution the k-means problem which is reasonable but often not good:
Lloyd's algorithm depends heavily on the initial selection of centroids. It is fully deterministic after the random initialization. If the initial centroids are not chosen well, the algorithm converges to a quite suboptimal solution. This means that the algorithm might get stuck in a poor configuration with a high SSE. To mitigate this, methods such as running the algorithm multiple times with different initializations or using more sophisticated initialization techniques like k-means++ (see below) can be employed to improve the chances of finding a better solution.
In Figure 6 the solution to the test problem found by Lloyd's algorithm is shown. It suffers from the random initialization used (shown in the Figure 4) which distributed the centroids quite unevenly over the 25 connected areas. The subsequent cycles of assignment and update are not able to correct this, leading to the huge remaining SSE.
The Generalized Lloyd Algorithm With K-Means++ Initialization
The strong dependence of Lloyd's algorithm on the initial configuration has led to a lot of research how to find better initializations. The most wide-spread such initialization is called K-Means++ [David Arthur and Sergei Vassilvitskii, 2007]. It sequentially adds centroids considering at each step the positions of the already placed centroids:
Here's an intuitive explanation why it works:
By selecting centroids that are far apart from existing ones, k-means++ ensures a better spread of initial centroids across the data. This helps capture the overall structure of the data more effectively than random initialization. With centroids spread out, the algorithm is more likely to find good positions, as it starts with a more representative distribution of centroids that avoids a concentration of many centroids in one area.
Applied to our example problem, k-means++ gives much better results than the GLA with random initialization. An example solution is depicted here:
K-Means++ is probably the most widely used algorithm for k-means problems. It is the default algorithm in the widely used scikit-learn Python package. There, an improved variant was implemented called "Greedy K-Means++" which in every selection step tries out several candidate centroids and chooses the one reducing the SSE most. Due to the huge user base of scikit-learn, probably the "Greedy K-Means++" variant is actually the most widely used algorithm for k-means problems. The non-greedy algorithm is sometimes called "Vanilla K-Means++" to distinguish it from its improved variant.
Still, k-means++ (greedy or vanilla) is only an initialization method and is followed by Lloyd's algorithm which only allows local changes before it converges. One can easily construct/observe cases where the results from k-means++ are quite sub-optimal. This was the reason to come up with a more dynamic algorithm which allows targeted non-local changes to a given approximate solution thereby finding much better local minima in many cases than Lloyd's algorithm. The new algorithm is called "Breathing K-Means".
领英推荐
Breathing K-Means with K-Means++ Initialization
Here is a brief description of the new Breathing K-Means algorithm. It starts from a solution obtained by k-means++ and the GLA. The new approach then inserts and removes codebook vectors ("breathing") and runs the GLA in between until the solution does not improve anymore:
Due to the requirement to lower the SSE with every breathing cycle and the finite initial value of m convergence is guaranteed in finitely many steps.
In Figure 9 the algorithm is shown as a diagram with the Breathing K-Means part shown in detail. For convenience, also the whole algorithm including the k-means++ initialization and the subsequent run of GLA is called "Breathing K-Means".
The principal effect of the breathing cycles on the error of the found solution is illustrated in Figure 10.
For full details and motivation please refer to the technical report (https://arxiv.org/abs/2006.15666) where also the principle of neighbor freezing is described which is relevant in the "breathe out" phase to keep the SSE low when removing centroids.
The breathing cycles significantly improve solution quality compared to greedy k-means++ and also speed for the common case that greedy k-means++ is repeated ten times.
Averaged over five different groups of problems, breathing k-means outperformed several other k-means variants as well (see simulation results further below).
For the test problem, after starting with the k-means++/GLA result of Figure 8, breathing k-means found the optimal solution (see Figure 11). Please note that there is no formal guarantee that the optimum is determined by breathing k-means. This should come at no surprise considering the NP-hardness of the k-means problem.
Experimental Results
Excellent Solution Quality
Below is the central result from the technical report https://arxiv.org/abs/2006.15666 (see there for details on the algorithm and the experiments/data sets). The table shows the respective SSE improvements of seven algorithms over the baseline algorithm: greedy k-means++ (run once). All approaches were run 10 times and the best result was selected, only breathing k-means was just run once. This was repeated 100 times for each combination of algorithm and problem. Accordingly, the first column shows how much greedy k-means++ itself can be improved by running it ten times. Averaged over all problem groups, breathing k-means emerges as the clear winner with 8.1% mean improvement over the baseline (lower right corner).
Fast Execution
Here the execution times in relation to greedy k-means++ (run 10 times) are shown (corresponding to Table 7 shown above). Breathing k-means is second-fastest or fastest for all problem groups and even the fastest when averaged over all problems groups, slightly better that runner-up Hartigan-Wong, which however produced very low quality solutions in our experiments (see Table 7). Breathing k-means was nearly twice as fast as greedy k-means++ (run 10 times).
A few percent better - Why is this even relevant?
First of all, for NP-hard problems like the k-means problem it is usually very difficult to improve solution quality once the quality is anywhere near the optimum. Finding improvements for existing algorithms is an interesting intellectual challenge on its own. The Generalized Lloyd Algorithm has been published 1982 and even goes back to an unpublished technical report from 1957. It is widely used until today, so, without being able to scan the vast literature on the topic (often hidden
behind paywalls), one can state that no other method has gained comparable traction so far for 67 years. Improved initializations are quite common (k-means++ from 2007 being the most well-known) but the GLA withstood all attempts to improve it.
The new Breathing K-Means algorithm provides a way to improve an existing solution to which GLA has converged by introducing non-local changes to the codebook, thereby creating a sequence of local minima of monotonously decreasing error until it converges to an often considerably improved solution. The experiments performed also indicate that the variance of the solution quality is very low for Breathing K-Means likely caused by a large similarity of the found solutions. This make these solutions particularily suited for data analysis.
Reference Implementation available from PyPi
An Open Source (MIT license) implementation of Breathing K-Means is available from PyPi (Python Package Index) at https://pypi.org/project/bkmeans/
Install it via pip:
pip install bkmeans
Afterwards you can use the BKMeans class in your code. The API is compatible with scikit-learn's KMeans class from which BKMeans is sub-classed.
Usage Example:
import numpy as np
from bkmeans import BKMeans
# generate random data set (or insert your data set here)
X=np.random.rand(1000,2)
# create BKMeans instance
bkm = BKMeans(n_clusters=100)
# run the algorithm
bkm.fit(X)
# print SSE (inertia in scikit-learn terms)
print(bkm.inertia_)
You can also experiment without any local installation by using this Google Colab Notebook.
Try it out on your data!
Who cares about performance on some test data set? Please try out Breathing K-Means on your own data and comment below on the results you get! You should observe considerable improvements over k-means++/Lloyd (e.g., over the KMeans class in scikit-learn).