Feeds:
Posts
Comments

Posts Tagged ‘Clustering’

Synopsis of a recent direction of work with Gábor Sárközy, Endre Szemerédi and Fei Song — “The Regularity Lemma is a deep result from extremal graph theory having great utility as a fundamental tool to prove theoretical results, but can it be employed in more “practical” settings?”

More specifically we are interested in the problem of harnessing the power of the regularity lemma to do clustering. This blog post is organized as follows: We first sketch the regularity lemma, we then see that it is an existential predicate and state an algorithmic version, we then look at how this constructive version may be used for clustering/segmentation. It must be noted that the idea seems to have originated from an earlier interesting work by Sperotto and Pellilio.

Before a brief introduction to the regularity lemma, I’d quote Luca Trevisan on (the related) Szemeredi’s Theorem from his blog

Szemeredi’s theorem on arithmetic progressions is one of the great triumphs of the “Hungarian approach” to mathematics: pose very difficult problems, and let deep results, connections between different areas of math, and applications, come out as byproducts of the search for a solution.

Even though I am nothing close to being a mathematician, but we’ll see that even an Electrical Engineer can somewhat understand and appreciate what Prof. Trevisan means in this specific context! ;-)

I will attempt to sketch a few things below with more emphasis on intuition than rigour! To be specific – intuition and then rigour.

_________________________

Introduction: One of the most fundamental and ingenious results in graph theory and discrete mathematics is the Szemeredi Regularity Lemma. It was originally advanced by Endre Szemeredi as an auxiliary result to prove a long standing conjecture of Erdős and Turán from 1936 (on the Ramsey properties of arithmetic progressions), which stated that sequences of integers with postive upper density must contain arbitrarily long arithmetic progressions (now called as Szemeredi’s theorem (1978)). Now the regularity lemma by itself is considered as one of the most important tools in graph theory.

Prof. Endre Szemeredi: Originator of the "Regularity Method"

A very rough statement of the regularity lemma could be made as follows:

Every graph can be approximated by random graphs. This is in the sense that every graph can be partitioned into a bounded number of equal parts such that:
1. Most edges run between different parts
2. And that these edges behave as if generated at random.

The fundamental nature of the Regularity Lemma as a tool in Graph Theory (and beyond) could be understood by the fact that two recent Fields Medals (Timothy Gowers, 1998 and Terence Tao, 2006) were awarded atleast in part for achieving breakthrough results on the regularity lemma and using those to prove fundamental results on arithmetic progressions amongst other results (Such as the Green-Tao Theorem).

_________________________

A Temporary Digression and Prof. Trevisan’s Statement: A first look at the statement of the regularity lemma above reminds one (especially those trained in signal processing I suppose) of the Fourier series. While this is extremely simplified – It’s not entirely incorrect to say that it points in part to a deeper relationship between combinatorics and analysis. Let’s look at it this way – Szemeredi employed the regularity lemma (a graph theoretic method) to prove what is now called Szemeredi’s theorem. The Szemeredi Theorem is a difficult one and it has four different proofs (graph-theoretic, ergodic, hypergraph-theoretic and lastly based on Fourier analysis)*.  Infact work on the Szemeredi theorem has contributed in better understanding connections between these areas (as an example – Timothy Gowers was awarded the Fields Medal for proving very deep results on the connections between analysis and combinatorics).

Terence Tao and Timothy Gowers have made breakthrough contributions to the regularity lemma and Szemeredi's Theorem amongst numerous others (Image Source: Republic of Math)

So it could be said that it all really started with Erdős and Turán posing a hard problem – The search for it’s solution has led to not only some good theorems but also some interesting connections between different areas of mathematics and thus a better understanding of all of them as a result. So we see what Prof. Trevisan means!

*While here the connection appears very sketchy though intuitive. The strong relationship between the Regularity Lemma and Analysis was shown in a breakthrough paper by László Lovász and Balázs Szegedy. In this paper they showed that the regularity lemma could be thought of as a result in analysis

Szemeredi Regularity Lemma for the Analyst (László Lovász and Balázs Szegedy)

_________________________

Given the background now we get to stating Szemeredi’s Regularity Lemma:

The Szemeredi Lemma: Expanding upon the rough statement mentioned a couple of paragraphs above, we could be a little more specific and say that

The Lemma states that every graph could be partitioned into a bounded number of quasi-random bi-partite graphs, called regular pairs and a few left over edges.

Now for a more precise statement, we introduce some definitions and notation.

Definition 1: If G = (V,E) is a graph  and A, B are disjoint subsets of V . Then, we denote the count of number edges with one endpoint in A and another in B by e(A,B) . The density of edges between A and B can then be defined as:

\displaystyle d(A,B) = \frac{e(A,B)}{|A||B|}

This just defines the edge density in a bipartite graph.

Definition 2: The bipartite graph G = (A, B, E) is called \varepsilon -regular if for every X \subset A and Y \subset B satisfying

|X| > \varepsilon |A|, |Y| > \varepsilon |B|

we would have |d(X,Y) - d(A,B)| < \varepsilon

This means that a bipartite graph is epsilon regular if we were to take any random subsets (of some minimum size) X, Y of A,B respectively and even then find that the edge density between these subsets is almost the same as the edge density in the original bipartite graph. In effect this implies that if a bipartite graph is epsilon regular then all the edges between the the two disjoint sets are distributed uniformly.

Definition 3: An equitable partition of the vertex set V of a graph G = (V,E) is a partition of V such that all the classes C_0, C_1, \dots, C_k are pairwise disjoint. And that all classes C_i with 1 \leq i \leq k have the same cardinality. It is noteworthy that oftentimes the vertex set might not have a cardinality that could be equally divided into the said number of classes, thus C_0 is present for a technical reason: To ensure that all the classes have the same cardinality.

Definition 4: For every equitable partition P of the vertex set V of G = (V,E) into classes C_0, C_1, \dots, C_k we associate a measure called the potential or the index of the partition P , which is defined as:

\displaystyle ind(P) = \frac{1}{k^2} \sum_{1 \leq r \leq s \leq k} d(C_r, C_s)^2

This measure just defines how close a partition is close to a regular one.

Definition 5: An equitable partition C_0, C_1, \dots, C_k of the vertex set of graph G given by V , where C_0 is the exceptional set, is called \varepsilon -regular if C_0 < \varepsilon |V| and all but \varepsilon k^2 of the pairs (C_i,C_j) are \varepsilon -regular such that 1 \leq i < j \leq k .

We are now in a position to state the regularity lemma:

Theorem 1: [Regularity Lemma]
For every positive \varepsilon and positive integer m, there are positive integers N = N(\varepsilon, m) and M = M(\varepsilon, m) such that the following property holds: For all graphs G with |V|\geq N , there is an \varepsilon -regular partition of G(V,E) into k+1 classes such that m\leq k \leq M .

The beauty of the regularity lemma is in the point that that the approximation for any graph does not depend on the number of vertices it has, but only on the error in approximation (represented by \varepsilon ).

In the proof of the regularity lemma we start with an initial partition (with low potential) and then iteratively refine the partition in such a way that the potential increases to a point such that the partition is epsilon regular. However, this leads to a major problem (the first of our concerns) – In refining the partitions iteratively the number of classes increases exponentially in each iteration. We then end up with a partition in which the number of classes is a tower function – usually an astronomical figure. Clearly, this does not work very well for graphs arising in practice. This same point is made by Prof. Trevisan:

[…] one may imagine using the Regularity Lemma as follows: to prove an approximate quantitative statement about arbitrary graphs, we only need to prove a related statement for the finite set of objects that approximate arbitrary graphs up to a certain level of accuracy. The latter, completely finite, goal may be performed via a computer search. Unfortunately, the size of the objects arising in the Regularity Lemma grow so astronomically fast in the approximation that this approach is completely impractical.

As an aside: For a long while it was thought that maybe a tower function is not necessary. However in a remarkable paper Timothy Gowers constructed graphs that demonstrated that functions of the tower type were indeed necessary.

In any case, how can we get around this problem for approximating most graphs so that it could be useful in applications such as clustering? A possible solution to this problem in the context of clustering was proposed by Sperotto and Pelillo. They make some interesting points, however do not provide many details on their approach. We will get back to this problem in a short while.

But the first problem that we face is the following: The Szemeredi Lemma as originally proposed is an existential predicate. It does not give a method to obtain the regular partition for a given graph, but only says that one must exist! So if we are to use the Lemma in any practical setting, we need an algorithmic version. There now exist two algorithmic versions:

1. One proposed in a paper by Alon et al in 1992.

2. Another proposed by Frieze and Kannan and is based on the very intriguing relationship between singular values and regularity!

We for now, focus on the Alon et al. version.

_________________________

Algorithmic Version of the Regularity Lemma:

Theorem 2: [A Constructive Version of the Regularity Lemma – Alon et al.]

For every \varepsilon > 0 and every positive integer t there is an integer Q = Q(\varepsilon, t) such that every graph with n > Q vertices has an \varepsilon -regular partition into k + 1 classes, where t \le k \le Q . For every fixed \varepsilon > 0 and t \ge 1 such a partition can be found in O(M(n)) sequential time, where M(n) is the time for multiplying two n by n matrices with 0,1 entries over the integers. It can also be found in time O(logn) on an ERPW PRAM with a polynomial number of parallel processors.

To really understand how the above theorem would require the introduction of several supporting definitions and lemmas (including one from Szemeredi’s original paper). Since our focus is on the application, we’d just state one lemma using which the idea behind the constructive version could be revealed in a more concrete sense.

Lemma 1: [Alon et al.] Let H be a bipartite graph with equally sized classes |A| = |B| = n . Let \displaystyle 2n^{\frac{-1}{4}} < \varepsilon <\frac{1}{16} . There is an O(M(n)) algorithm that verifies that H is \varepsilon -regular or finds two subset A' \subset A , B' \subset B , \displaystyle |A'| \ge \frac{{\varepsilon}^4}{16}n , \displaystyle |B'| \ge \frac{{\varepsilon}^4}{16}n , such that \displaystyle |d(A, B) - d(A', B')| \ge \varepsilon^4 . The algorithm can be parallelized and implemented in NC.

This lemma basically says that whether or not a bipartite graph is epsilon regular is a question that can be answered very quickly. If it is – then we can have an algorithm to say so and if it is not, it should return the answer with a certificate or proof that it is not so. This certificate is nothing but subsets of the classes from the original bipartite graph. The idea of the certificate is to help to proceed to the next step.

The general idea in the Alon algorithm is:

Start with a random equitable partition of the graph C_0, C_1, \dots, C_b , where \displaystyle |C_i| = \lfloor\ \frac{n}{b} ;\rfloor and i = 1, 2, \dots, b . Also |C_0| < b . Also let k_1 = b .

Then, for each pair C_r, C_s in the partition P_i check for regularity. If it is regular report so, if not find the certificate for the pair. If out of all the pairs, only \varepsilon k_i^2 are not epsilon regular – then halt, the partition P_i is epsilon regular. If not then we refine this partition using the information gained from the certificates in the cases when the pairs not epsilon regular. On refining this partition we obtain a new partition P' with 1 + k_i4^{k_i} classes. We repeat this process till we hit a partition which is epsilon regular.

_________________________

Using the Constructive Version for Clustering:

A little earlier it was pointed out that there are two main difficulties in using the Szemeredi Lemma in practical settings. At one level, to use the lemma in a practical setting we need a constructive version. However the constructive versions are so designed that they work for all graphs. To illustrate how this is a problem, we quote a line from the algorithmic version based on singular values (Freize and Kannan):

The Algorithm finishes in atmost O(\varepsilon^{-45}) steps with an \varepsilon -regular partition

Now, consider what this implies – If \varepsilon is \frac{1}{16} (a typical value). Then the number of steps in which we are guaranteed to find a regular partition is 1.53249554 × 1054.

This is astonishingly large number. To make the lemma truly practical, we would have to give up the aim of making it work for all graphs. Instead we should be content that it works on most graphs. Most graphs would largely be graphs that appear in practice. Like was mentioned earlier some directions in this regard were provided by Sperotto and Pellilio.

Another problem (as was mentioned earlier too) was that the number of classes grows exponentially with each refinement step. We can not allow the number of classes to grow astronomically either because if we do so then consequently we would never be able to refine the partitions far enough. Here we would have to compromise on the generality again. Instead of allowing the number of classes to grow exponentially we could allow it to grow by a defined value in each iteration. This value could be chosen depending on a number of parameters, including the data set size.

Even in making such an approximation – the potential would always increase with each iteration albeit much slowly as compared to the methodology as originally described. Infact we would say that it would work for most graphs. In our work this approach seems to work quite well.

Given these two changes, what still remains is how could the lemma be used for clustering datasets? One possible way is suggested by another result called the Key Lemma and the implication that it might have. This is stated below:

Lemma 2: Given an arbitrary graph G=(V,E) , a partition of V in k clusters as in regularity lemma described above and two parameters \varepsilon, d , we describe the reduced graph G^R as the graph whose vertices are associated to the clusters and whose edges are associated to the \varepsilon -regular pairs with density above d . If we have a coloring on the edges of G, then the edges of the reduced graph will be colored with a color that appears on most of the edges between the two clusters.

Using the properties of what is called the Key-Lemma/Blow-Up Lemma, it is understood that this reduced graph should retain properties of the original graph. And thus any changes made on this reduced graph would reflect on the original graph.

Thus, one possible way to cluster is using a two-phase clustering strategy: First, we use the modified regularity lemma to construct a reduced graph using the method above. In the second phase, we use a pairwise clustering method such as spectral clustering to do cluster the the reduced graph. The results obtained on this reduced graph are then projected back onto the original graph using Lemma 2.

Two Phase Strategy to use the Regularity Lemma

Infact such a methodology gives results quite encouraging as compared to other standard clustering methods. These results would be reported here once the paper under consideration is published. :) Another thing to be noted is that the reduced graph is usually quite small as compared to the original graph and working on this smaller – reduced graph is much faster.

_________________________

Other Possible Changes:

1. The Alon et al. algorithm is used and modified to use the regularity lemma for clustering. It would be interesting to explore the results given by the Frieze & Kannan method as it has a different way to find the certificate.

2. There has been some work on the sparse regularity lemma. This is something that would have a lot of practical value as in the above we construct a dense graph. Using the sparse version would allow us to use nearest neighbor graphs instead of dense graphs. This would reduce the computational burden significantly.

3. A recent paper by Fischer et al.Approximate Hypergraph Partitioning and Applications” has received a lot of attention. In this paper they give a new approach for finding regular partitions. All the previous ones work to find partitions of the tower type, while this paper gives a method to find a smaller regular partition if one exists in the graph. Employing this methodology for refinement instead of using an approximate version of the algorithmic regularity lemma could be a fruitful direction of work.

_________________________

Onionesque Reality Home >>

Read Full Post »

I recently came across a very elegant way of saying that in a certain sense the k-means clustering algorithm is guaranteed to converge. I thought this was something worthy of sharing. Before I do that, for the sake of completeness I will (like in other posts on the blog) just summarize k-means and not assume any background.

stabilità

A Short Introduction:

k-means clustering in a way is the most commonly used machine learning algorithm in my opinion (not just clustering algorithm). Its popularity is perhaps a function of:

1. Importance of clustering in data analysis across domains.

2. Ease of implementation of k-means.

It is another story though that little is known about it theoretically. This is something I seriously intend to write about in future posts.

Clustering: The idea of clustering is to find structure in the data. Given a dataset, the task of clustering is to find groups in this set, such that “similar” points are in one group and “dissimilar” ones in others. In k-means clustering, this notion of similarity is usually the Euclidean Distance. The Algorithm roughly is like this.

1. Choose a value of “k” i.e number of clusters.

2. Initialize k cluster centroids in the data-set randomly.

3. For each point in the data, find the Euclidean Distance with each of the cluster centroids initialized in step – 2.

4. Assign each point in the data-set to a cluster centroid it is closest to.

5. Update each cluster centroid in step 2 with the average of all points assigned to it.

6. Repeat steps 2 – 5 till convergence.

Optimization Function: Informally, the objective of k-means can be said to be: Find a set of cluster centroids such that all points in that “cluster” are the closest to that centroid than to any other. i.e. Find cluster centroids such that the sum of distances of all points with their assigned cluster centroids is as low as possible.

This optimization objective is to minimize the distortion function:

\displaystyle J(c, \mu) = \sum_{i = 1}^m \lVert x^{(i)} - \mu_{c^{(i)}}\rVert^2

Where, \mu_c is a cluster centroid to which a point x is assigned to. Note: k-means is simply co-ordinate descent on J

Convergence: It is not hard to see that the distortion function J is non-convex. And thus k-means is prone to finding local minimas. i.e One is not guaranteed to find the best clustering according to J on a random initialization. Therefore, usually people initialize k-means randomly a few times and then pick the best solution.

Lyapunov Function: But we do know that k-means is quite “well behaved” usually. This is because k-means is guaranteed to converge in a certain sense. That is, in successive iterations of k-means J decreases monotonically to a certain minimum value. It would never increase from one iteration to another.

A way to formalize this notion is using the idea of a Lyapunov Function as [1]:

1. We can associate an “energy” with a given state of the kmeans algorithm by connecting a spring between each point and the centroid it is assigned to.

2. The energy of each spring is proportional to its squared length.  i.e E = \frac{1}{2}kx^2. Where x is the length of the spring. In this case the distance between the mean (centroid) and a point. This “k”, is the stiffness of the string and not the “k” of kmeans.

3. The total energy of all such springs is a Lyapunov Function for the algorithm BECAUSE:

(i) The assignment to a cluster centroid only happens if there is a decrease in energy. A point gets assigned to another cluster centroid only if its energy will decrease with such an assignment.

(ii) Each Iteration of updating centroids will only be such that the energy of the springs decreases. This is a method to minimize the energy of the springs.

(iii) This energy has a lower bound (the minima of J)

This discussion shows that this problem has a Lyapunov Function. And since it has a Lyapunov function it converges.

_________________

References:

[1] David MacKay: Information Theory, Inference and Learning Algorithms (University of Cambridge, Free PDF available)

_________________

Onionesque Reality Home >>

Read Full Post »