I recently came across a very elegant way of saying that in a certain sense the k-means clustering algorithm is guaranteed to converge. I thought this was something worthy of sharing. Before I do that, for the sake of completeness I will (like in other posts on the blog) just summarize k-means and not assume any background.
A Short Introduction:
k-means clustering in a way is the most commonly used machine learning algorithm in my opinion (not just clustering algorithm). Its popularity is perhaps a function of:
1. Importance of clustering in data analysis across domains.
2. Ease of implementation of k-means.
It is another story though that little is known about it theoretically. This is something I seriously intend to write about in future posts.
Clustering: The idea of clustering is to find structure in the data. Given a dataset, the task of clustering is to find groups in this set, such that “similar” points are in one group and “dissimilar” ones in others. In k-means clustering, this notion of similarity is usually the Euclidean Distance. The Algorithm roughly is like this.
1. Choose a value of “k” i.e number of clusters.
2. Initialize k cluster centroids in the data-set randomly.
3. For each point in the data, find the Euclidean Distance with each of the cluster centroids initialized in step – 2.
4. Assign each point in the data-set to a cluster centroid it is closest to.
5. Update each cluster centroid in step 2 with the average of all points assigned to it.
6. Repeat steps 2 – 5 till convergence.
Optimization Function: Informally, the objective of k-means can be said to be: Find a set of cluster centroids such that all points in that “cluster” are the closest to that centroid than to any other. i.e. Find cluster centroids such that the sum of distances of all points with their assigned cluster centroids is as low as possible.
This optimization objective is to minimize the distortion function:
Where, is a cluster centroid to which a point is assigned to. Note: k-means is simply co-ordinate descent on
Convergence: It is not hard to see that the distortion function is non-convex. And thus k-means is prone to finding local minimas. i.e One is not guaranteed to find the best clustering according to on a random initialization. Therefore, usually people initialize k-means randomly a few times and then pick the best solution.
Lyapunov Function: But we do know that k-means is quite “well behaved” usually. This is because k-means is guaranteed to converge in a certain sense. That is, in successive iterations of k-means decreases monotonically to a certain minimum value. It would never increase from one iteration to another.
A way to formalize this notion is using the idea of a Lyapunov Function as :
1. We can associate an “energy” with a given state of the kmeans algorithm by connecting a spring between each point and the centroid it is assigned to.
2. The energy of each spring is proportional to its squared length. i.e . Where x is the length of the spring. In this case the distance between the mean (centroid) and a point. This “k”, is the stiffness of the string and not the “k” of kmeans.
3. The total energy of all such springs is a Lyapunov Function for the algorithm BECAUSE:
(i) The assignment to a cluster centroid only happens if there is a decrease in energy. A point gets assigned to another cluster centroid only if its energy will decrease with such an assignment.
(ii) Each Iteration of updating centroids will only be such that the energy of the springs decreases. This is a method to minimize the energy of the springs.
(iii) This energy has a lower bound (the minima of )
This discussion shows that this problem has a Lyapunov Function. And since it has a Lyapunov function it converges.
 David MacKay: Information Theory, Inference and Learning Algorithms (University of Cambridge, Free PDF available)