Feeds:
Posts

Lyapunov Stability and k-means Clustering

I recently came across a very elegant way of saying that in a certain sense the k-means clustering algorithm is guaranteed to converge. I thought this was something worthy of sharing. Before I do that, for the sake of completeness I will (like in other posts on the blog) just summarize k-means and not assume any background.

stabilità

A Short Introduction:

k-means clustering in a way is the most commonly used machine learning algorithm in my opinion (not just clustering algorithm). Its popularity is perhaps a function of:

1. Importance of clustering in data analysis across domains.

2. Ease of implementation of k-means.

It is another story though that little is known about it theoretically. This is something I seriously intend to write about in future posts.

Clustering: The idea of clustering is to find structure in the data. Given a dataset, the task of clustering is to find groups in this set, such that “similar” points are in one group and “dissimilar” ones in others. In k-means clustering, this notion of similarity is usually the Euclidean Distance. The Algorithm roughly is like this.

1. Choose a value of “k” i.e number of clusters.

2. Initialize k cluster centroids in the data-set randomly.

3. For each point in the data, find the Euclidean Distance with each of the cluster centroids initialized in step – 2.

4. Assign each point in the data-set to a cluster centroid it is closest to.

5. Update each cluster centroid in step 2 with the average of all points assigned to it.

6. Repeat steps 2 – 5 till convergence.

Optimization Function: Informally, the objective of k-means can be said to be: Find a set of cluster centroids such that all points in that “cluster” are the closest to that centroid than to any other. i.e. Find cluster centroids such that the sum of distances of all points with their assigned cluster centroids is as low as possible.

This optimization objective is to minimize the distortion function:

$\displaystyle J(c, \mu) = \sum_{i = 1}^m \lVert x^{(i)} - \mu_{c^{(i)}}\rVert^2$

Where, $\mu_c$ is a cluster centroid to which a point $x$ is assigned to. Note: k-means is simply co-ordinate descent on $J$

Convergence: It is not hard to see that the distortion function $J$ is non-convex. And thus k-means is prone to finding local minimas. i.e One is not guaranteed to find the best clustering according to $J$ on a random initialization. Therefore, usually people initialize k-means randomly a few times and then pick the best solution.

Lyapunov Function: But we do know that k-means is quite “well behaved” usually. This is because k-means is guaranteed to converge in a certain sense. That is, in successive iterations of k-means $J$ decreases monotonically to a certain minimum value. It would never increase from one iteration to another.

A way to formalize this notion is using the idea of a Lyapunov Function as [1]:

1. We can associate an “energy” with a given state of the kmeans algorithm by connecting a spring between each point and the centroid it is assigned to.

2. The energy of each spring is proportional to its squared length.  i.e $E = \frac{1}{2}kx^2$. Where x is the length of the spring. In this case the distance between the mean (centroid) and a point. This “k”, is the stiffness of the string and not the “k” of kmeans.

3. The total energy of all such springs is a Lyapunov Function for the algorithm BECAUSE:

(i) The assignment to a cluster centroid only happens if there is a decrease in energy. A point gets assigned to another cluster centroid only if its energy will decrease with such an assignment.

(ii) Each Iteration of updating centroids will only be such that the energy of the springs decreases. This is a method to minimize the energy of the springs.

(iii) This energy has a lower bound (the minima of $J$)

This discussion shows that this problem has a Lyapunov Function. And since it has a Lyapunov function it converges.

_________________

References:

[1] David MacKay: Information Theory, Inference and Learning Algorithms (University of Cambridge, Free PDF available)

_________________

Onionesque Reality Home >>

This question appears very trivial and might just be meaningless, so I put it up at the risk of embarrassing myself. However, since it is a genuine question. I still put it up. All thoughts are welcome.

Question: When trying to quantify the performance of a classifier. What advantages does RMSE offer over the Area under the Curve under the ROC metric? And what does the AUC offer that the RMSE does not? I find AUC very intuitive and prefer using it for classification tasks. But can I give a theoretical reason for using it above RMSE and the vice versa? Review committees have different preferences, some journals prefer reporting the RMSE while some prefer the AUC, some ask for both. Another example being – The 2010 KDD Cup used RMSE while the 2010 UCSD data mining competition used AUC.

To paraphrase my question – What can be instances in which a classifier is deemed as “good” by the AUC measure and “not so good” by the RMSE measure. What would be the exact reason for such a different “opinion”? And in what situations should I use AUC and in what situations should I use RMSE?

Some Background : If they are equivalent, then you would expect a strong linear relationship (with a negative correlation). That means that for a perfect classifier RMSE would be zero and the AUC 1.

I always use both for all purposes. Here is a sample graph.

RMSE versus AUC for a classifier on some Intelligent Tutoring Data

This is actually a very typical graph, and there are no surprises with it. If you leave out some “bad examples” such as those at (0.4, 0.65) and (0.38, 0.7), the graph has a good negative correlation (as measured by the line fit).

So, the question remains for me. What are the advantages and disadvantages of both?

Recommendations :

Taught Bio-Informatics (UG)

I am in the process of winding up taking a basic course on Bio-Informatics, it is offered as an elective subject for final year under-graduate Information Technology students.  I preferred taking this course as a visiting faculty on weekends as managing time in the week is hard (though i did take some classes on weekdays).

______

[Gene Clustering : (a) shows clusters (b) uses hierarchical clustering (c) uses k-means (d)  SOM finds clusters which are arranged in grids. Source : Nature Biotechnology 23, 1499 – 1501 (2005) by Patrick D’haeseleer]

Why Bio-Informatics?

The course (out of the ones offered in Fall) I would have preferred taking the most would have been a course on AI. There is no course on Machine Learning or Pattern Recognition at the UG level here, and the course on AI comes closest as it has sufficient weight given to Neural Nets and Bayesian Learning.

The only subject that comes nearest to my choice as AI was not available, was Bio-Informatics as about 60 percent of the syllabus was Machine Learning, Data Mining and Pattern Recognition. And it being a basic course gave me the liberty to take these parts in much more detail as compared to the other parts. And that’s exactly why taking up Bio-Informatics even though it’s not directly my area was not a bad bargain!

______

The Joys of Teaching:

This is the first time that I have formally taken a complete course, I have taken work-shops and given talks quite a few times before. But never taken a complete course.

I have always enjoyed teaching. When I say I enjoy teaching, I don’t necessarily mean something academic. I like discussing ideas in general

If I try to put down why I enjoy teaching, there might be some reasons:

• There is an obvious inherent joy in teaching that few activities have for me. When i say teaching here, like I said before I don’t just mean to talk about formal teaching, but rather the more general meaning of the term.
• It’s said that there is no better way to learn than to teach. Actually that was the single largest motivation that prompted me to take that offer.
• Teaching gives me a high! The time I get to discuss what I like (and teach), I forget things that might be pressing me at other times of the day. I tend to become a space-cadet when into teaching. It’s such a wonderful experience!
• One more reason that i think i like teaching is this : I have a wide range of reading (or atleast am interested in) and I have noticed that the best way it gets connected and in most unexpected ways is in discussions. You don’t get people who would be interested in involved discussions very often, also being an introvert means the problem is further compounded. Teaching gives me a platform to engage in such discussions. Some of the best ideas that I have got, borrowing from a number of almost unrelated areas is while discussing/teaching. And this course gave me a number of ideas that I would do something about if I get the chance and the resources.
• Teaching also gives you the limits of your own reading and can inspire you to plug the deficiencies in your knowledge.
• Other than that, I take teaching or explaining things as a challenge. I enjoy it when I find out that I can explain pion exchanges to friends who have not seen a science book after grade 10. Teaching is a challenge well worth taking for a number of reasons!

From this specific course the most rewarding moment was when a couple of groups approached me after the conclusion of classes to help them a little with their projects. Since their projects are of moderate difficulty and from pattern recognition, I did take that up as a compliment for sure! Though I can not say I can “help” them,  I don’t like using that word, it sounds pretentious, I would definitely like to work with them on their projects and hopefully would learn something new about the area.

______

Course:

I wouldn’t be putting up my notes for the course, but the topics I covered included:

1. Introduction to Bio-Informatics, Historical Overview, Applications, Major Databases, Data Management, Analysis and Molecular Biology.

2. Sequence Visualization, structure visualization, user interface, animation verses simulation, general purpose technologies, statistical concepts, microarrays, imperfect data, quantitative randomness, data analysis, tool selection, statistics of alignment, clustering and classification, regression analysis.

3. Data Mining Methods & Technology overview, infrastructure, pattern recognition & discovery, machine learning methods, text mining & tools, dot matrix analysis, substitution metrics, dynamic programming, word methods, Bayesian methods, multiple sequence alignment, tools for pattern matching.

4. Introduction, working with FASTA, working with BLAST, filtering and capped BLAST, FASTA & BLAST algorithms & comparison.

Like I said earlier, my focus was on dynamic programming, clustering, regression (linear, locally weighted), Logistic regression, support vector machines, Neural Nets, an overview of Bayesian Learning. And then introduced all the other aspects as applications subsequently and covered the necessary theory then!

______

Resources:

All my notes for the course were hand-made and not on $\LaTeX$, so it would be impossible to put them up now (they were basically made from a number of books and the MIT-OCW).

H0wever I would update this space soon enough linking to all the resources I would recommend.

______

I am looking forward to taking a course on Digital Image Processing and Labs the next semester, which begins December onwards (again as a visiting instructor)! Since Image Processing is closer to the area I am interested in deeply (Applied Machine Learning – Computer Vision), I am already very excited about the possibility!

______

Onionesque Reality Home >>

In the past month or so I have been looking at a series of lectures on Data Mining that I had long bookmarked. I’ve had a look at the lectures twice and I found them extremely useful, hence I thought it was not a bad idea to share them here, though I am aware that they are pretty old and rather well circulated.

These lectures delivered by Professor David Mease as Google Tech Talks/Stanford Stat202 course lectures, work equally well for beginners as for experts who need to brush up with basic ideas. The course uses R extensively.

Statistical Aspects of Data Mining

_____

I’d end with some Dilbert strips on Data-Mining that I have liked in the past!

_____

_____

_____

Onionesque Reality Home >>