Breathing K-Means: Superior K-Means Solutions through Dynamic K-Values

Breathing K-Means: Superior K-Means Solutions through Dynamic K-Values

Introduction

Running a k-means algorithm on your numerical data is a common first step to get a compact representation of your data through the resulting set of k centroids. The quality of your k-means solution is measured by the average distance of your data and the centroids.

In Figure 1 an example of a k-means problem is shown and a solution for k=15 which is obviously (at least to the trained eye ??) sub-optimal. In Figure 1b an animation of the local adaptation performed by k-means is shown.

Figure 1: Example of a sub-optimal solution for a k-means problem with k=15. A better solution would position more than one centroid in the area where the "high error centroid" lies.
Figure 1b: K-Means finds suboptimal solution after random initialization

The most popular algorithm to find a good k-means solution is the k-means++ algorithm which is a combination of the k-means++ initialization method (David Arthur and Sergei Vassilvitskii, 2007) producing an initial configuration and the Generalized Lloyd Algorithm (Linde, Y.; Buzo, A.; Gray, R., 1980) improving this configuration by locally moving the centroids until it converges to a local minimum solution. In the following, the Generalized Lloyd Algorithm is called GLA for brevity.

In this article, I present "Breathing K-Means", a dynamic method with the ability to perform non-local changes. The new approach is initialized with k-means++ but is usually able to improve the starting solution significantly while even being faster than k-means++ in the common scenario that k-means++ is run ten times to select the best solution.

The algorithm has been compared to six other methods in numerous experiments (TR available: https://arxiv.org/abs/2006.15666) and came out as the overall best algorithm beating k-means++/GLA by a large margin. Thus it can be recommended as a replacement for the usual k-means++/GLA combination.

An open source (MIT License) implementation is available from the Python Package repository PyPi (https://pypi.org/project/bkmeans/). The implementation is API-compatible with the KMeans class from scikit-learn (from which it is derived).

The K-Means Problem

The k-means problem denotes the task of decribing a given data set X consisting of n numeric vectors in d-dimensional space by a smaller set of k vectors, often called centroids or codebook vectors. The k-means problem typically occurs in the context of data analysis or data compression

The mathematical goal of the k-means problem is to find a set C of k centroids such that the summed square distance (often denoted as Summed Squared Error or SSE) of each vector x∈X to the nearest centroids c∈C is minimized. It is well known that this problem is NP-hard (Aloise et al., 2009) meaning that it belongs to the most complex group of optimization problems for which no efficient algorithm exists which is guaranteed to find the optimal solution. Therefore, in practice approximation algorithms are used to in the attempt to find a solution with a low SSE.

Below is a sample data set consisting of 25 clusters, each assembled from 3 quadratic areas, and each of these quadratic areas contains 4x4=16 data points.

Figure 2: Data Set constructed such that an optimal solution exists for k=75

While optimal solutions for k-means problems are normally not known, this data set was specifically constructed such that for k=75 an obvious optimal solution exists: one centroid in the center of each quadratic area as shown below:

Figure 3: Optimal solution for k=75. The blue lines indicate the Voronoi Partition, a division of the plane into regions where each region contains all the points closest to a specific centroid.

The knowledge of this optimum makes it possible to judge given approximate solutions very accurately by computing the relative deviation of their SSE from the SSE of the optimal solution.

If one generates a solution by simply selecting k centroids at random from the data set, the relative error deviation is huge (325.83%) as shown in the example in Figure 4:

Figure 4: The random choice of 75 centroids leads to a very irregular distribution of centroids and a huge error in areas with few centroids. The SSE is 325.83% larger than for the optimal solution.

The k-means problem is known to be NP-hard, which means that there can be no efficient algorithm for finding the optimal solution (the best way to position the centroids). Therefore, approximation algorithms are used to find good, though not necessarily optimal, solutions efficiently. Below we describe the classic Generalized Lloyd Algorithm (GLA) with random initialization, the much better GLA with k-means++ initialization and our new approach, breathing k-means which improves upon GLA/k-means++ considerably.

The Generalized Lloyd Algorithm With Random Initialization

This classic method -- sometimes somewhat imprecisely denoted as "The K-Means Algorithm" - can be used to find a solution the k-means problem which is reasonable but often not good:

  • Initialization: Select k initial centroids randomly from the data set X.
  • Assignment Step: Assign each data point in X to its nearest centroid.
  • Update Step: Recalculate the centroids as the mean of the data points assigned to each centroid.
  • Repeat: Repeat the Assignment and Update steps until the centroids no longer change significantly.
  • Result: The algorithm outputs k centroids and the data points associated with each centroid.

Figure 5: The two steps to find a solution with GLA are the random initialization and the actual run of the iterative Generalized Lloyd Algorithm. This combination is often (and somewhat misleadingly) denoted as "The K-Means Algorithm"

Lloyd's algorithm depends heavily on the initial selection of centroids. It is fully deterministic after the random initialization. If the initial centroids are not chosen well, the algorithm converges to a quite suboptimal solution. This means that the algorithm might get stuck in a poor configuration with a high SSE. To mitigate this, methods such as running the algorithm multiple times with different initializations or using more sophisticated initialization techniques like k-means++ (see below) can be employed to improve the chances of finding a better solution.

In Figure 6 the solution to the test problem found by Lloyd's algorithm is shown. It suffers from the random initialization used (shown in the Figure 4) which distributed the centroids quite unevenly over the 25 connected areas. The subsequent cycles of assignment and update are not able to correct this, leading to the huge remaining SSE.

Figure 6: The GLA reduces the SSE distance from the optimum to 62.95%. Starting from the initialization in Figure 4, it converges to a state where each centroid is in the center of gravity of its associated data points.

The Generalized Lloyd Algorithm With K-Means++ Initialization

The strong dependence of Lloyd's algorithm on the initial configuration has led to a lot of research how to find better initializations. The most wide-spread such initialization is called K-Means++ [David Arthur and Sergei Vassilvitskii, 2007]. It sequentially adds centroids considering at each step the positions of the already placed centroids:

  • First Centroid: Randomly select the first centroid from the data set X.
  • Distance Calculation: Calculate the distance from each data point in X to the nearest already chosen centroid.
  • Probability Selection: Select the next centroid with a probability proportional to the square of its distance to the nearest existing centroid.
  • Repeat: Repeat the Distance Calculation and Probability Selection steps until kkk centroids have been chosen.
  • Proceed with Lloyd’s Algorithm: Use these k centroids as the starting points for Lloyd's algorithm.

Figure 7: The K-Means++ Algorithm adds a specific initialization before GLA is run.

Here's an intuitive explanation why it works:

By selecting centroids that are far apart from existing ones, k-means++ ensures a better spread of initial centroids across the data. This helps capture the overall structure of the data more effectively than random initialization. With centroids spread out, the algorithm is more likely to find good positions, as it starts with a more representative distribution of centroids that avoids a concentration of many centroids in one area.

Applied to our example problem, k-means++ gives much better results than the GLA with random initialization. An example solution is depicted here:

Figure 8: The k-means++ initialization gave the GLA a much better starting configuration from where it could reach a solution only 13.05% worse than the optimum.

K-Means++ is probably the most widely used algorithm for k-means problems. It is the default algorithm in the widely used scikit-learn Python package. There, an improved variant was implemented called "Greedy K-Means++" which in every selection step tries out several candidate centroids and chooses the one reducing the SSE most. Due to the huge user base of scikit-learn, probably the "Greedy K-Means++" variant is actually the most widely used algorithm for k-means problems. The non-greedy algorithm is sometimes called "Vanilla K-Means++" to distinguish it from its improved variant.

Still, k-means++ (greedy or vanilla) is only an initialization method and is followed by Lloyd's algorithm which only allows local changes before it converges. One can easily construct/observe cases where the results from k-means++ are quite sub-optimal. This was the reason to come up with a more dynamic algorithm which allows targeted non-local changes to a given approximate solution thereby finding much better local minima in many cases than Lloyd's algorithm. The new algorithm is called "Breathing K-Means".

Breathing K-Means with K-Means++ Initialization

Here is a brief description of the new Breathing K-Means algorithm. It starts from a solution obtained by k-means++ and the GLA. The new approach then inserts and removes codebook vectors ("breathing") and runs the GLA in between until the solution does not improve anymore:

  • Initialization: Initial centroids are determined by running k-means++ (followed by the GLA).
  • Set Breathing depth: The parameter m (breathing depth) is set to a starting value (default: 5)
  • Breathe In: m new centroids are added near existing high-error centroids. Afterwards, the GLA is run to find a local minimum.
  • Breathe out: m centroids with low-utility centroids are removed to reduce the total number of centroids back to the desired amount. Afterwards, the GLA is run to find a local minimum.
  • SSE Check: If the overall SSE has decreased, another breathing cycle is performed. If the SSE has not decreased, m is reduced by one. if m > 0, more breathing cycles (with the new m) are performed, else the next (and final) step is executed
  • Result: The centroid set of size k with the smallest SSE is returned.

Due to the requirement to lower the SSE with every breathing cycle and the finite initial value of m convergence is guaranteed in finitely many steps.

In Figure 9 the algorithm is shown as a diagram with the Breathing K-Means part shown in detail. For convenience, also the whole algorithm including the k-means++ initialization and the subsequent run of GLA is called "Breathing K-Means".

Figure 9: Breathing K-Means with a drilldown diagram showing the structure of the Breathing Phase which is guaranteed to terminate when the SSE does not improve anymore.

The principal effect of the breathing cycles on the error of the found solution is illustrated in Figure 10.


Figure 10: Schematic illustration how breathing k-means can find a sequence of increasingly better local minima starting from a solution obtained by k-means++/GLA.

For full details and motivation please refer to the technical report (https://arxiv.org/abs/2006.15666) where also the principle of neighbor freezing is described which is relevant in the "breathe out" phase to keep the SSE low when removing centroids.

The breathing cycles significantly improve solution quality compared to greedy k-means++ and also speed for the common case that greedy k-means++ is repeated ten times.

Averaged over five different groups of problems, breathing k-means outperformed several other k-means variants as well (see simulation results further below).

For the test problem, after starting with the k-means++/GLA result of Figure 8, breathing k-means found the optimal solution (see Figure 11). Please note that there is no formal guarantee that the optimum is determined by breathing k-means. This should come at no surprise considering the NP-hardness of the k-means problem.

Figure 11: Breathing k-means found the optimal solution (0.00% SSE difference from the optimum). This is a very typical result for this problem, but optimality of breathing k-means is not guaranteed.

Experimental Results

Excellent Solution Quality

Below is the central result from the technical report https://arxiv.org/abs/2006.15666 (see there for details on the algorithm and the experiments/data sets). The table shows the respective SSE improvements of seven algorithms over the baseline algorithm: greedy k-means++ (run once). All approaches were run 10 times and the best result was selected, only breathing k-means was just run once. This was repeated 100 times for each combination of algorithm and problem. Accordingly, the first column shows how much greedy k-means++ itself can be improved by running it ten times. Averaged over all problem groups, breathing k-means emerges as the clear winner with 8.1% mean improvement over the baseline (lower right corner).

excerpt from technical report (table numbers refer to the TR)

Fast Execution

Here the execution times in relation to greedy k-means++ (run 10 times) are shown (corresponding to Table 7 shown above). Breathing k-means is second-fastest or fastest for all problem groups and even the fastest when averaged over all problems groups, slightly better that runner-up Hartigan-Wong, which however produced very low quality solutions in our experiments (see Table 7). Breathing k-means was nearly twice as fast as greedy k-means++ (run 10 times).

excerpt from technical report

A few percent better - Why is this even relevant?

First of all, for NP-hard problems like the k-means problem it is usually very difficult to improve solution quality once the quality is anywhere near the optimum. Finding improvements for existing algorithms is an interesting intellectual challenge on its own. The Generalized Lloyd Algorithm has been published 1982 and even goes back to an unpublished technical report from 1957. It is widely used until today, so, without being able to scan the vast literature on the topic (often hidden

behind paywalls), one can state that no other method has gained comparable traction so far for 67 years. Improved initializations are quite common (k-means++ from 2007 being the most well-known) but the GLA withstood all attempts to improve it.

The new Breathing K-Means algorithm provides a way to improve an existing solution to which GLA has converged by introducing non-local changes to the codebook, thereby creating a sequence of local minima of monotonously decreasing error until it converges to an often considerably improved solution. The experiments performed also indicate that the variance of the solution quality is very low for Breathing K-Means likely caused by a large similarity of the found solutions. This make these solutions particularily suited for data analysis.

Reference Implementation available from PyPi

An Open Source (MIT license) implementation of Breathing K-Means is available from PyPi (Python Package Index) at https://pypi.org/project/bkmeans/

Install it via pip:

pip install bkmeans        

Afterwards you can use the BKMeans class in your code. The API is compatible with scikit-learn's KMeans class from which BKMeans is sub-classed.

Usage Example:

import numpy as np
from bkmeans import BKMeans

# generate random data set (or insert your data set here)
X=np.random.rand(1000,2)

# create BKMeans instance
bkm = BKMeans(n_clusters=100)

# run the algorithm
bkm.fit(X)

# print SSE (inertia in scikit-learn terms)
print(bkm.inertia_)        

You can also experiment without any local installation by using this Google Colab Notebook.

Try it out on your data!

Who cares about performance on some test data set? Please try out Breathing K-Means on your own data and comment below on the results you get! You should observe considerable improvements over k-means++/Lloyd (e.g., over the KMeans class in scikit-learn).




要查看或添加评论,请登录

Dr. Bernd Fritzke的更多文章

社区洞察

其他会员也浏览了