Feeds:
Posts
Comments

Posts Tagged ‘Unsupervised Learning’

Deep Learning reads Wikipedia and discovers the meaning of life – Geoff Hinton.

The above quote is from a very interesting talk by Geoffrey Hinton I had the chance to attend recently.

I have been at a summer school on Deep Neural Nets and Unsupervised Featured Learning at the Institute for Pure and Applied Mathematics at UCLA since July 9 (till July 27). It has been organized by Geoff Hinton, Yoshua Bengio, Stan Osher, Andrew Ng and Yann LeCun.

I have always been a “fan” of Neural Nets and the recent spike in interest in them has made me excited, thus the school happened at just the right time. The objective of the summer school is to give a broad overview of some of the recent work in Deep Learning and Unsupervised Feature Learning with emphasis on optimization, deep architectures and sparse representations. I must add that after getting here and looking at the peer group I would consider myself lucky to have obtained funding for the event!

[Click on the above image to see slides for the talks. Videos will be added at this location after July 27 Videos are now available]

That aside, if you are interested in Deep Learning or Neural Networks in general, the slides for the talks are being uploaded over here (or click on the image above), videos will be added at the same location some time after the summer school ends so you might like to bookmark this link.

The school has been interesting given the wide range of people who are here. The diversity of opinions about Deep Learning itself has given a good perspective on the subject and the issues and strengths of it. There are quite a few here who are somewhat skeptical of deep learning but are curious, while there are some who have been actively working on the same for a while. Also, it has been enlightening to see completely divergent views between some of the speakers on key ideas such as sparsity. For example Geoff Hinton had a completely different view of why sparsity was useful in classification tasks than compared to Stéphane Mallat, who gave a very interesting talk today even joking that “Hinton and Yann LeCun told you why sparsity is useful, I’ll tell you why sparsity is useless. “. See the above link for more details.

Indeed, such opinions do tell you that there is a lot of fecund ground for research in these areas.

I have been compiling a reading list on some of this stuff and will make a blog-post on the same soon.

________________

Onionesque Reality Home >>

Read Full Post »

Synopsis of a recent direction of work with Gábor Sárközy, Endre Szemerédi and Fei Song — “The Regularity Lemma is a deep result from extremal graph theory having great utility as a fundamental tool to prove theoretical results, but can it be employed in more “practical” settings?”

More specifically we are interested in the problem of harnessing the power of the regularity lemma to do clustering. This blog post is organized as follows: We first sketch the regularity lemma, we then see that it is an existential predicate and state an algorithmic version, we then look at how this constructive version may be used for clustering/segmentation. It must be noted that the idea seems to have originated from an earlier interesting work by Sperotto and Pellilio.

Before a brief introduction to the regularity lemma, I’d quote Luca Trevisan on (the related) Szemeredi’s Theorem from his blog

Szemeredi’s theorem on arithmetic progressions is one of the great triumphs of the “Hungarian approach” to mathematics: pose very difficult problems, and let deep results, connections between different areas of math, and applications, come out as byproducts of the search for a solution.

Even though I am nothing close to being a mathematician, but we’ll see that even an Electrical Engineer can somewhat understand and appreciate what Prof. Trevisan means in this specific context! ;-)

I will attempt to sketch a few things below with more emphasis on intuition than rigour! To be specific – intuition and then rigour.

_________________________

Introduction: One of the most fundamental and ingenious results in graph theory and discrete mathematics is the Szemeredi Regularity Lemma. It was originally advanced by Endre Szemeredi as an auxiliary result to prove a long standing conjecture of Erdős and Turán from 1936 (on the Ramsey properties of arithmetic progressions), which stated that sequences of integers with postive upper density must contain arbitrarily long arithmetic progressions (now called as Szemeredi’s theorem (1978)). Now the regularity lemma by itself is considered as one of the most important tools in graph theory.

Prof. Endre Szemeredi: Originator of the "Regularity Method"

A very rough statement of the regularity lemma could be made as follows:

Every graph can be approximated by random graphs. This is in the sense that every graph can be partitioned into a bounded number of equal parts such that:
1. Most edges run between different parts
2. And that these edges behave as if generated at random.

The fundamental nature of the Regularity Lemma as a tool in Graph Theory (and beyond) could be understood by the fact that two recent Fields Medals (Timothy Gowers, 1998 and Terence Tao, 2006) were awarded atleast in part for achieving breakthrough results on the regularity lemma and using those to prove fundamental results on arithmetic progressions amongst other results (Such as the Green-Tao Theorem).

_________________________

A Temporary Digression and Prof. Trevisan’s Statement: A first look at the statement of the regularity lemma above reminds one (especially those trained in signal processing I suppose) of the Fourier series. While this is extremely simplified – It’s not entirely incorrect to say that it points in part to a deeper relationship between combinatorics and analysis. Let’s look at it this way – Szemeredi employed the regularity lemma (a graph theoretic method) to prove what is now called Szemeredi’s theorem. The Szemeredi Theorem is a difficult one and it has four different proofs (graph-theoretic, ergodic, hypergraph-theoretic and lastly based on Fourier analysis)*.  Infact work on the Szemeredi theorem has contributed in better understanding connections between these areas (as an example – Timothy Gowers was awarded the Fields Medal for proving very deep results on the connections between analysis and combinatorics).

Terence Tao and Timothy Gowers have made breakthrough contributions to the regularity lemma and Szemeredi's Theorem amongst numerous others (Image Source: Republic of Math)

So it could be said that it all really started with Erdős and Turán posing a hard problem – The search for it’s solution has led to not only some good theorems but also some interesting connections between different areas of mathematics and thus a better understanding of all of them as a result. So we see what Prof. Trevisan means!

*While here the connection appears very sketchy though intuitive. The strong relationship between the Regularity Lemma and Analysis was shown in a breakthrough paper by László Lovász and Balázs Szegedy. In this paper they showed that the regularity lemma could be thought of as a result in analysis

Szemeredi Regularity Lemma for the Analyst (László Lovász and Balázs Szegedy)

_________________________

Given the background now we get to stating Szemeredi’s Regularity Lemma:

The Szemeredi Lemma: Expanding upon the rough statement mentioned a couple of paragraphs above, we could be a little more specific and say that

The Lemma states that every graph could be partitioned into a bounded number of quasi-random bi-partite graphs, called regular pairs and a few left over edges.

Now for a more precise statement, we introduce some definitions and notation.

Definition 1: If G = (V,E) is a graph  and A, B are disjoint subsets of V . Then, we denote the count of number edges with one endpoint in A and another in B by e(A,B) . The density of edges between A and B can then be defined as:

\displaystyle d(A,B) = \frac{e(A,B)}{|A||B|}

This just defines the edge density in a bipartite graph.

Definition 2: The bipartite graph G = (A, B, E) is called \varepsilon -regular if for every X \subset A and Y \subset B satisfying

|X| > \varepsilon |A|, |Y| > \varepsilon |B|

we would have |d(X,Y) - d(A,B)| < \varepsilon

This means that a bipartite graph is epsilon regular if we were to take any random subsets (of some minimum size) X, Y of A,B respectively and even then find that the edge density between these subsets is almost the same as the edge density in the original bipartite graph. In effect this implies that if a bipartite graph is epsilon regular then all the edges between the the two disjoint sets are distributed uniformly.

Definition 3: An equitable partition of the vertex set V of a graph G = (V,E) is a partition of V such that all the classes C_0, C_1, \dots, C_k are pairwise disjoint. And that all classes C_i with 1 \leq i \leq k have the same cardinality. It is noteworthy that oftentimes the vertex set might not have a cardinality that could be equally divided into the said number of classes, thus C_0 is present for a technical reason: To ensure that all the classes have the same cardinality.

Definition 4: For every equitable partition P of the vertex set V of G = (V,E) into classes C_0, C_1, \dots, C_k we associate a measure called the potential or the index of the partition P , which is defined as:

\displaystyle ind(P) = \frac{1}{k^2} \sum_{1 \leq r \leq s \leq k} d(C_r, C_s)^2

This measure just defines how close a partition is close to a regular one.

Definition 5: An equitable partition C_0, C_1, \dots, C_k of the vertex set of graph G given by V , where C_0 is the exceptional set, is called \varepsilon -regular if C_0 < \varepsilon |V| and all but \varepsilon k^2 of the pairs (C_i,C_j) are \varepsilon -regular such that 1 \leq i < j \leq k .

We are now in a position to state the regularity lemma:

Theorem 1: [Regularity Lemma]
For every positive \varepsilon and positive integer m, there are positive integers N = N(\varepsilon, m) and M = M(\varepsilon, m) such that the following property holds: For all graphs G with |V|\geq N , there is an \varepsilon -regular partition of G(V,E) into k+1 classes such that m\leq k \leq M .

The beauty of the regularity lemma is in the point that that the approximation for any graph does not depend on the number of vertices it has, but only on the error in approximation (represented by \varepsilon ).

In the proof of the regularity lemma we start with an initial partition (with low potential) and then iteratively refine the partition in such a way that the potential increases to a point such that the partition is epsilon regular. However, this leads to a major problem (the first of our concerns) – In refining the partitions iteratively the number of classes increases exponentially in each iteration. We then end up with a partition in which the number of classes is a tower function – usually an astronomical figure. Clearly, this does not work very well for graphs arising in practice. This same point is made by Prof. Trevisan:

[…] one may imagine using the Regularity Lemma as follows: to prove an approximate quantitative statement about arbitrary graphs, we only need to prove a related statement for the finite set of objects that approximate arbitrary graphs up to a certain level of accuracy. The latter, completely finite, goal may be performed via a computer search. Unfortunately, the size of the objects arising in the Regularity Lemma grow so astronomically fast in the approximation that this approach is completely impractical.

As an aside: For a long while it was thought that maybe a tower function is not necessary. However in a remarkable paper Timothy Gowers constructed graphs that demonstrated that functions of the tower type were indeed necessary.

In any case, how can we get around this problem for approximating most graphs so that it could be useful in applications such as clustering? A possible solution to this problem in the context of clustering was proposed by Sperotto and Pelillo. They make some interesting points, however do not provide many details on their approach. We will get back to this problem in a short while.

But the first problem that we face is the following: The Szemeredi Lemma as originally proposed is an existential predicate. It does not give a method to obtain the regular partition for a given graph, but only says that one must exist! So if we are to use the Lemma in any practical setting, we need an algorithmic version. There now exist two algorithmic versions:

1. One proposed in a paper by Alon et al in 1992.

2. Another proposed by Frieze and Kannan and is based on the very intriguing relationship between singular values and regularity!

We for now, focus on the Alon et al. version.

_________________________

Algorithmic Version of the Regularity Lemma:

Theorem 2: [A Constructive Version of the Regularity Lemma – Alon et al.]

For every \varepsilon > 0 and every positive integer t there is an integer Q = Q(\varepsilon, t) such that every graph with n > Q vertices has an \varepsilon -regular partition into k + 1 classes, where t \le k \le Q . For every fixed \varepsilon > 0 and t \ge 1 such a partition can be found in O(M(n)) sequential time, where M(n) is the time for multiplying two n by n matrices with 0,1 entries over the integers. It can also be found in time O(logn) on an ERPW PRAM with a polynomial number of parallel processors.

To really understand how the above theorem would require the introduction of several supporting definitions and lemmas (including one from Szemeredi’s original paper). Since our focus is on the application, we’d just state one lemma using which the idea behind the constructive version could be revealed in a more concrete sense.

Lemma 1: [Alon et al.] Let H be a bipartite graph with equally sized classes |A| = |B| = n . Let \displaystyle 2n^{\frac{-1}{4}} < \varepsilon <\frac{1}{16} . There is an O(M(n)) algorithm that verifies that H is \varepsilon -regular or finds two subset A' \subset A , B' \subset B , \displaystyle |A'| \ge \frac{{\varepsilon}^4}{16}n , \displaystyle |B'| \ge \frac{{\varepsilon}^4}{16}n , such that \displaystyle |d(A, B) - d(A', B')| \ge \varepsilon^4 . The algorithm can be parallelized and implemented in NC.

This lemma basically says that whether or not a bipartite graph is epsilon regular is a question that can be answered very quickly. If it is – then we can have an algorithm to say so and if it is not, it should return the answer with a certificate or proof that it is not so. This certificate is nothing but subsets of the classes from the original bipartite graph. The idea of the certificate is to help to proceed to the next step.

The general idea in the Alon algorithm is:

Start with a random equitable partition of the graph C_0, C_1, \dots, C_b , where \displaystyle |C_i| = \lfloor\ \frac{n}{b} ;\rfloor and i = 1, 2, \dots, b . Also |C_0| < b . Also let k_1 = b .

Then, for each pair C_r, C_s in the partition P_i check for regularity. If it is regular report so, if not find the certificate for the pair. If out of all the pairs, only \varepsilon k_i^2 are not epsilon regular – then halt, the partition P_i is epsilon regular. If not then we refine this partition using the information gained from the certificates in the cases when the pairs not epsilon regular. On refining this partition we obtain a new partition P' with 1 + k_i4^{k_i} classes. We repeat this process till we hit a partition which is epsilon regular.

_________________________

Using the Constructive Version for Clustering:

A little earlier it was pointed out that there are two main difficulties in using the Szemeredi Lemma in practical settings. At one level, to use the lemma in a practical setting we need a constructive version. However the constructive versions are so designed that they work for all graphs. To illustrate how this is a problem, we quote a line from the algorithmic version based on singular values (Freize and Kannan):

The Algorithm finishes in atmost O(\varepsilon^{-45}) steps with an \varepsilon -regular partition

Now, consider what this implies – If \varepsilon is \frac{1}{16} (a typical value). Then the number of steps in which we are guaranteed to find a regular partition is 1.53249554 × 1054.

This is astonishingly large number. To make the lemma truly practical, we would have to give up the aim of making it work for all graphs. Instead we should be content that it works on most graphs. Most graphs would largely be graphs that appear in practice. Like was mentioned earlier some directions in this regard were provided by Sperotto and Pellilio.

Another problem (as was mentioned earlier too) was that the number of classes grows exponentially with each refinement step. We can not allow the number of classes to grow astronomically either because if we do so then consequently we would never be able to refine the partitions far enough. Here we would have to compromise on the generality again. Instead of allowing the number of classes to grow exponentially we could allow it to grow by a defined value in each iteration. This value could be chosen depending on a number of parameters, including the data set size.

Even in making such an approximation – the potential would always increase with each iteration albeit much slowly as compared to the methodology as originally described. Infact we would say that it would work for most graphs. In our work this approach seems to work quite well.

Given these two changes, what still remains is how could the lemma be used for clustering datasets? One possible way is suggested by another result called the Key Lemma and the implication that it might have. This is stated below:

Lemma 2: Given an arbitrary graph G=(V,E) , a partition of V in k clusters as in regularity lemma described above and two parameters \varepsilon, d , we describe the reduced graph G^R as the graph whose vertices are associated to the clusters and whose edges are associated to the \varepsilon -regular pairs with density above d . If we have a coloring on the edges of G, then the edges of the reduced graph will be colored with a color that appears on most of the edges between the two clusters.

Using the properties of what is called the Key-Lemma/Blow-Up Lemma, it is understood that this reduced graph should retain properties of the original graph. And thus any changes made on this reduced graph would reflect on the original graph.

Thus, one possible way to cluster is using a two-phase clustering strategy: First, we use the modified regularity lemma to construct a reduced graph using the method above. In the second phase, we use a pairwise clustering method such as spectral clustering to do cluster the the reduced graph. The results obtained on this reduced graph are then projected back onto the original graph using Lemma 2.

Two Phase Strategy to use the Regularity Lemma

Infact such a methodology gives results quite encouraging as compared to other standard clustering methods. These results would be reported here once the paper under consideration is published. :) Another thing to be noted is that the reduced graph is usually quite small as compared to the original graph and working on this smaller – reduced graph is much faster.

_________________________

Other Possible Changes:

1. The Alon et al. algorithm is used and modified to use the regularity lemma for clustering. It would be interesting to explore the results given by the Frieze & Kannan method as it has a different way to find the certificate.

2. There has been some work on the sparse regularity lemma. This is something that would have a lot of practical value as in the above we construct a dense graph. Using the sparse version would allow us to use nearest neighbor graphs instead of dense graphs. This would reduce the computational burden significantly.

3. A recent paper by Fischer et al.Approximate Hypergraph Partitioning and Applications” has received a lot of attention. In this paper they give a new approach for finding regular partitions. All the previous ones work to find partitions of the tower type, while this paper gives a method to find a smaller regular partition if one exists in the graph. Employing this methodology for refinement instead of using an approximate version of the algorithmic regularity lemma could be a fruitful direction of work.

_________________________

Onionesque Reality Home >>

Read Full Post »