Archive for the ‘Data Mining’ Category

I recently came across a very elegant way of saying that in a certain sense the k-means clustering algorithm is guaranteed to converge. I thought this was something worthy of sharing. Before I do that, for the sake of completeness I will (like in other posts on the blog) just summarize k-means and not assume any background.


A Short Introduction:

k-means clustering in a way is the most commonly used machine learning algorithm in my opinion (not just clustering algorithm). Its popularity is perhaps a function of:

1. Importance of clustering in data analysis across domains.

2. Ease of implementation of k-means.

It is another story though that little is known about it theoretically. This is something I seriously intend to write about in future posts.

Clustering: The idea of clustering is to find structure in the data. Given a dataset, the task of clustering is to find groups in this set, such that “similar” points are in one group and “dissimilar” ones in others. In k-means clustering, this notion of similarity is usually the Euclidean Distance. The Algorithm roughly is like this.

1. Choose a value of “k” i.e number of clusters.

2. Initialize k cluster centroids in the data-set randomly.

3. For each point in the data, find the Euclidean Distance with each of the cluster centroids initialized in step – 2.

4. Assign each point in the data-set to a cluster centroid it is closest to.

5. Update each cluster centroid in step 2 with the average of all points assigned to it.

6. Repeat steps 2 – 5 till convergence.

Optimization Function: Informally, the objective of k-means can be said to be: Find a set of cluster centroids such that all points in that “cluster” are the closest to that centroid than to any other. i.e. Find cluster centroids such that the sum of distances of all points with their assigned cluster centroids is as low as possible.

This optimization objective is to minimize the distortion function:

\displaystyle J(c, \mu) = \sum_{i = 1}^m \lVert x^{(i)} - \mu_{c^{(i)}}\rVert^2

Where, \mu_c is a cluster centroid to which a point x is assigned to. Note: k-means is simply co-ordinate descent on J

Convergence: It is not hard to see that the distortion function J is non-convex. And thus k-means is prone to finding local minimas. i.e One is not guaranteed to find the best clustering according to J on a random initialization. Therefore, usually people initialize k-means randomly a few times and then pick the best solution.

Lyapunov Function: But we do know that k-means is quite “well behaved” usually. This is because k-means is guaranteed to converge in a certain sense. That is, in successive iterations of k-means J decreases monotonically to a certain minimum value. It would never increase from one iteration to another.

A way to formalize this notion is using the idea of a Lyapunov Function as [1]:

1. We can associate an “energy” with a given state of the kmeans algorithm by connecting a spring between each point and the centroid it is assigned to.

2. The energy of each spring is proportional to its squared length.  i.e E = \frac{1}{2}kx^2. Where x is the length of the spring. In this case the distance between the mean (centroid) and a point. This “k”, is the stiffness of the string and not the “k” of kmeans.

3. The total energy of all such springs is a Lyapunov Function for the algorithm BECAUSE:

(i) The assignment to a cluster centroid only happens if there is a decrease in energy. A point gets assigned to another cluster centroid only if its energy will decrease with such an assignment.

(ii) Each Iteration of updating centroids will only be such that the energy of the springs decreases. This is a method to minimize the energy of the springs.

(iii) This energy has a lower bound (the minima of J)

This discussion shows that this problem has a Lyapunov Function. And since it has a Lyapunov function it converges.



[1] David MacKay: Information Theory, Inference and Learning Algorithms (University of Cambridge, Free PDF available)


Onionesque Reality Home >>

Read Full Post »

This question appears very trivial and might just be meaningless, so I put it up at the risk of embarrassing myself. However, since it is a genuine question. I still put it up. All thoughts are welcome.

Question: When trying to quantify the performance of a classifier. What advantages does RMSE offer over the Area under the Curve under the ROC metric? And what does the AUC offer that the RMSE does not? I find AUC very intuitive and prefer using it for classification tasks. But can I give a theoretical reason for using it above RMSE and the vice versa? Review committees have different preferences, some journals prefer reporting the RMSE while some prefer the AUC, some ask for both. Another example being – The 2010 KDD Cup used RMSE while the 2010 UCSD data mining competition used AUC.

Or is this a bad question to ask?

To paraphrase my question – What can be instances in which a classifier is deemed as “good” by the AUC measure and “not so good” by the RMSE measure. What would be the exact reason for such a different “opinion”? And in what situations should I use AUC and in what situations should I use RMSE?

Some Background : If they are equivalent, then you would expect a strong linear relationship (with a negative correlation). That means that for a perfect classifier RMSE would be zero and the AUC 1.

I always use both for all purposes. Here is a sample graph.

RMSE versus AUC for a classifier on some Intelligent Tutoring Data

This is actually a very typical graph, and there are no surprises with it. If you leave out some “bad examples” such as those at (0.4, 0.65) and (0.38, 0.7), the graph has a good negative correlation (as measured by the line fit).

So, the question remains for me. What are the advantages and disadvantages of both?

Recommendations :

1. ROC Graphs : Notes and Practical Considerations for Researchers – Tom Fawcett

Read Full Post »

I am in the process of winding up taking a basic course on Bio-Informatics, it is offered as an elective subject for final year under-graduate Information Technology students.  I preferred taking this course as a visiting faculty on weekends as managing time in the week is hard (though i did take some classes on weekdays).


nbt1205-1499-F1[Gene Clustering : (a) shows clusters (b) uses hierarchical clustering (c) uses k-means (d)  SOM finds clusters which are arranged in grids. Source : Nature Biotechnology 23, 1499 – 1501 (2005) by Patrick D’haeseleer]

Why Bio-Informatics?

The course (out of the ones offered in Fall) I would have preferred taking the most would have been a course on AI. There is no course on Machine Learning or Pattern Recognition at the UG level here, and the course on AI comes closest as it has sufficient weight given to Neural Nets and Bayesian Learning.

The only subject that comes nearest to my choice as AI was not available, was Bio-Informatics as about 60 percent of the syllabus was Machine Learning, Data Mining and Pattern Recognition. And it being a basic course gave me the liberty to take these parts in much more detail as compared to the other parts. And that’s exactly why taking up Bio-Informatics even though it’s not directly my area was not a bad bargain!


The Joys of Teaching:

This is the first time that I have formally taken a complete course, I have taken work-shops and given talks quite a few times before. But never taken a complete course.

I have always enjoyed teaching. When I say I enjoy teaching, I don’t necessarily mean something academic. I like discussing ideas in general

If I try to put down why I enjoy teaching, there might be some reasons:

  • There is an obvious inherent joy in teaching that few activities have for me. When i say teaching here, like I said before I don’t just mean to talk about formal teaching, but rather the more general meaning of the term.
  • It’s said that there is no better way to learn than to teach. Actually that was the single largest motivation that prompted me to take that offer.
  • Teaching gives me a high! The time I get to discuss what I like (and teach), I forget things that might be pressing me at other times of the day. I tend to become a space-cadet when into teaching. It’s such a wonderful experience!
  • One more reason that i think i like teaching is this : I have a wide range of reading (or atleast am interested in) and I have noticed that the best way it gets connected and in most unexpected ways is in discussions. You don’t get people who would be interested in involved discussions very often, also being an introvert means the problem is further compounded. Teaching gives me a platform to engage in such discussions. Some of the best ideas that I have got, borrowing from a number of almost unrelated areas is while discussing/teaching. And this course gave me a number of ideas that I would do something about if I get the chance and the resources.
  • Teaching also gives you the limits of your own reading and can inspire you to plug the deficiencies in your knowledge.
  • Other than that, I take teaching or explaining things as a challenge. I enjoy it when I find out that I can explain pion exchanges to friends who have not seen a science book after grade 10. Teaching is a challenge well worth taking for a number of reasons!

From this specific course the most rewarding moment was when a couple of groups approached me after the conclusion of classes to help them a little with their projects. Since their projects are of moderate difficulty and from pattern recognition, I did take that up as a compliment for sure! Though I can not say I can “help” them,  I don’t like using that word, it sounds pretentious, I would definitely like to work with them on their projects and hopefully would learn something new about the area.



I wouldn’t be putting up my notes for the course, but the topics I covered included:

1. Introduction to Bio-Informatics, Historical Overview, Applications, Major Databases, Data Management, Analysis and Molecular Biology.

2. Sequence Visualization, structure visualization, user interface, animation verses simulation, general purpose technologies, statistical concepts, microarrays, imperfect data, quantitative randomness, data analysis, tool selection, statistics of alignment, clustering and classification, regression analysis.

3. Data Mining Methods & Technology overview, infrastructure, pattern recognition & discovery, machine learning methods, text mining & tools, dot matrix analysis, substitution metrics, dynamic programming, word methods, Bayesian methods, multiple sequence alignment, tools for pattern matching.

4. Introduction, working with FASTA, working with BLAST, filtering and capped BLAST, FASTA & BLAST algorithms & comparison.

Like I said earlier, my focus was on dynamic programming, clustering, regression (linear, locally weighted), Logistic regression, support vector machines, Neural Nets, an overview of Bayesian Learning. And then introduced all the other aspects as applications subsequently and covered the necessary theory then!



All my notes for the course were hand-made and not on \LaTeX, so it would be impossible to put them up now (they were basically made from a number of books and the MIT-OCW).

H0wever I would update this space soon enough linking to all the resources I would recommend.


I am looking forward to taking a course on Digital Image Processing and Labs the next semester, which begins December onwards (again as a visiting instructor)! Since Image Processing is closer to the area I am interested in deeply (Applied Machine Learning – Computer Vision), I am already very excited about the possibility!


Onionesque Reality Home >>

Read Full Post »

In the past month or so I have been looking at a series of lectures on Data Mining that I had long bookmarked. I’ve had a look at the lectures twice and I found them extremely useful, hence I thought it was not a bad idea to share them here, though I am aware that they are pretty old and rather well circulated.

These lectures delivered by Professor David Mease as Google Tech Talks/Stanford Stat202 course lectures, work equally well for beginners as for experts who need to brush up with basic ideas. The course uses R extensively.

data mining icon11Statistical Aspects of Data Mining


Course Video Lectures.

Course website.

Lecture Slides.


I’d end with some Dilbert strips on Data-Mining that I have liked in the past!

Data Mining






Onionesque Reality Home >>

Read Full Post »

I know this post is no rocket science and might just appear to be too silly! :)

5 years back i joined a social networking website (orkut), my first. It has a unique feature w.r.t such sites – it displays the number of visits to your profile on each day. Some related features of the web-site are as follows:

1. When you sign into your account, you appear on the home-page of your friends’ profiles (something like “recently online”). As other people sign-in after you did, you’ll gradually stop appearing on the home page of your friends’. This is because, the home page can only show 9 people at a time, that in turn means that if 9 people sign in after you, you would cease to be visible on the home page and would only be accessible on the friends list – that generally people don’t look at unless they are looking for someone specifically.

2. If anyone visits your page or refreshes it. It will be counted as one hit.

Now, it is obvious that if you sign-in very often, you are are more likely to be seen on the home-page of your friends’ pages who are then more likely to click through to your page (I don’t have any research to support this, but see it as common sense and as per experience. If you see someone’s profile on your page you would be more likely to visit that profile casually rather than search for that profile and visit it when it is out of sight from the home page. The latter you would do only if you have to communicate with the concerned person, or if you have some work, OR if you have to spy on the person under question ;-). Thus, in conclusion – when you sign-in regularly you are more likely to get more hits on your profile.

Now there are tricks to avoid appearing online on such websites. What they’ll do is that even when you sign-in you would not appear online on the home pages of your friends. Since you do not appear online on the page of your friends your profile is also less likely to be visited by those who did after seeing you online. In short your profile would be visited by such people only:

1. People who wanted to talk to you for something.

2. People who randomly searched for your name/somebody else sharing your name (this gets removed if you don’t keep a name) and landed up on your profile.

3. People who saw your posts/messages on some community or group and then got curious and visited your page.

4. Somebody searched for his/her favorite cult movie and you have that movie on your profile and thus it shows on the search results and that somebody then checks out your page (extends to artists, music etc too).

5. Somebody randomly remembers you and checks your page out to see what’s up with you.

Thus we can say that if certain conditions are satisfied, the number of hits to your page would be a random number, at least approximately.

I generally satisfied the condition of not appearing online, and I noticed over the years that other than the occasional spike, the number of hits was in a way distributed around a central value. I however never paid any further attention.

Some weeks ago, while writing somewhere.,  I thought it was time to try and model the same for a blog or a website and see for myself if that number could indeed be considered as random. Please note that if that number would be random then the distribution of page visits over a period of time would be a Gaussian curve (I’ll come back at the end of the post to this for those who wouldn’t be sure).

Now it is difficult to satisfy those conditions that I mentioned for social networking website for some blog or website on the web. I looked for the following and made the following assumptions (please question their wisdom in case you don’t agree and give new suggestions):

1. Suppose you start writing a blog and it starts off rather well. You are enthusiastic and advertise your page and ask people to pay a visit. Such hits can’t be considered random hits. The total number of visits to the page would be non-random plus random hits (from search engines, random visitors etc).

2. You are active on your blog in a big way for the first year say. And suppose the feed/email subscribers keep visiting your blog frequently as and when they get notified of a new post by you. This number too wouldn’t be a random one as the number of hits would be basically a function which has a dependent variable in the number and frequency of posts as well. In short the more you post the more page visits you are likely to get.

3. After the one year you decide to quit your page. For some time the subscribers would keep visiting your page. And since you have stopped blogging as such, you would stop advertising too, you would stop giving its link to people/friends and asking them to pay a visit.

4. After a sufficient period of time, say another year. The “excitement” about the blog has died down and there are no new posts at all. The number of hits that you obtain can be:

(a) Random hits from people searching randomly some stuff on search engines.

(b) Randomly people (mostly friends, former stalkers etc ;-) think of paying your blog a visit just hoping there might be something new.

(c) People keep looking for some tutorials (or similar post) on your page. I have noticed that no matter how old a tutorial on your page gets the old crowd is mostly replaced by new people and the overall number of visits to that page remains roughly around a mean value.

I believe that this sum total (unless bot attacks, or similar events which would result in spikes in the number of visits for a day occur) would roughly be a random number. And also that this above scenario for a website is equivalent to that of a profile I mentioned earlier.

I actually thought it was pretty straight forward. I asked some people about what their opinion on it was. And it appeared to me that either I lacked the communication skills to convey what i meant or maybe it was not so straight forward.

I decided to take it up. I collected the webstats for two websites.

I would like to thank Dr Jonathan Yedidia of the MERL for providing me with the stats of his website over the past three-four months or so. This website has been inactive for over a year and satisfies the other conditions that I spoke about, so it was an ideal candidate.

One more observation was this : The number of hits on weekends is visibly less than on weekdays. So it is not a good idea to use both together. It would be a better idea to use two classes:

1. Only weekdays

2. Only weekends or other holidays.

That is, it is a good idea to only models weekends, or only weekdays. Both together would not be a good idea as they seem to have different distributions.


Let’s only consider the weekdays class. Like I mentioned I collected data for some months for Dr Jonathan Yedidia’s website. And the data plots actually turned out to be Gaussian. That’s interesting.

The staircase plot for the number of visitors to the website is given below: The X Axis represents the days and the Y Axis represents the number of hits.

Stairs[Staircase Plot for the number of visitors]

I plotted a simple historgram of this data for 40 bins and for 80 bins and the plot is what I expected it to be! Roughly normal. There is an outlier though (a day when number of hits was 265, which I believe was due to a bot attack or something similar).

Histogram-40[Histogram Plot for the data with 40 bins]

Just for the sake of visual convenience let’s consider the same data for 80 bins.

Histogram-80[Histogram Plot for the data with 40 bins]

The normal fit for the above plots (both for 40 and 80 bins) are given below:

Normal-Fit[Normal fit on the data with 40 bins]

Normal-Fit 2[Normal fit on the data with 80 bins]

The estimated values for the mean and the standard deviation are as follows:



Meaning: The average number of hits on each day is about 97 and and on any given day the number of hits would most likely lie in the bracket 97.5 +/- 28.


I found the above confirmation of what i had thought somewhat interesting. This does confirm that when certain things that I mentioned are taken care of, then the number of hits on a particular day can be considered to be a random number. I now plan to collect data for a longer period of about one year and repeat the same for four websites (2 that satisfy those assumptions and 2 that do not, these two would be the control).

The normal distribution never ceases to amaze me. It is one of those things that signify an underlying order in chaos. Which is fundamental. You take a set of random people, take measurements of their foreheads, waists, heights and it would lie along a normal distribution. You take the various shots taken by an archer towards an aim and the distribution of arrows about the center is along a normal distribution.


Infact this can be used to detect irregularities in data. Such patterns of randomness are pretty reliable to make a guess if there is any foul play with the data. For example if you were a recruiter in the the military and had data on the heights of thousands of men. If the data does not lie along a normal distribution would indicate some fudging or faking of heights by individuals. Such statistical analysis has given to a new area called Forensic Economics, which has some extremely enjoyable aspects to it such as Benford’s law, which too deals with checking the statistical fingerprint that data leaves to detect foul play.

Such interesting patterns underlying the chaos of social data led to many interesting philosophical questions in the 19th century. Many of which were simply whimsical, however exploration of which provided much progress in understanding randomness.

Read Full Post »