Archive for the ‘Machine Learning’ Category

A semi-useful fact for people working in Machine Learning.

Blurb: Recently, my supervisor was kind enough to institute a small reading group on Deep Neural Networks to help us understand them better. A side product of this would hopefully be that I get to explore this old interest of mine (I have an old interest in Neural Nets and have always wanted to study them in detail, especially work from the mid 90s when the area matured and connections to many others areas such as statistical physics became apparent. But before I got to do that, they seem to be back in vogue! I dub this resurgence as Connectionism 2.0 Although the hipster in me dislikes hype, it is always lends a good excuse to explore an interest). I also attended a Summer School on the topic at UCLA, find slides and talks over here.

Coming back: we have been making our way through this Foundations and Trends volume by Yoshua Bengio (which I quite highly recommend). Bengio spends a lot of time giving intuitions and arguments on why and when Deep Architectures are useful. One particular argument (which is actually quite general and not specific to trees as it might appear) went like this:

Ensembles of trees (like boosted trees, and forests) are more powerful than a single tree. They add a third level to the architecture which allows the model to discriminate among a number of regions exponential in the number of parameters. As illustrated in […], they implicitly form a distributed representation […]

This is followed by an illustration with the following figure:

[Image Source: Learning Deep Architectures for AI]

and accompanying text to the above figure:

Whereas a single decision tree (here just a two-way partition) can discriminate among a number of regions linear in the number of parameters (leaves), an ensemble of trees can discriminate among a number of regions exponential in the number of trees, i.e., exponential in the total number of parameters (at least as long as the number of trees does not exceed the number of inputs, which is not quite the case here). Each distinguishable region is associated with one of the leaves of each tree (here there are three 2-way trees, each defining two regions, for a total of seven regions). This is equivalent to a multi-clustering, here three clusterings each associated with two regions.

Now, the following question was considered: Referring to the text in boldface above, is the number of regions obtained exponential in the general case? It is easy to see that there would be cases where it is not exponential. For example: the number of regions obtained by the three trees would be the same  as those obtained by one tree if all three trees overlap, hence giving no benefit. The above claim refers to a paper (also by Yoshua Bengio) where he constructs examples to show the number of regions could be exponential in some cases.

But suppose for our satisfaction we are interested in actual numbers and the following general question, an answer to which should also answer the question raised above:

What is the maximum possible number of regions that can be obtained in {\bf R}^n by the intersection of m hyperplanes?

Let’s consider some examples in the {\bf R}^2 case.

Where there is one hyperplace i.e. m = 1, then the maximum number of possible regions is obviously two.

[1 Hyperplane: 2 Regions]

Clearly, for two hyerplanes, the maximum number of possible regions is 4.

[2 Hyperplanes: 4 Regions]

In case there are three hyperplanes, the maximum number of regions that might be possible would be 7 as illustrated in the first figure. For m =4 i.e. 4 hyerplanes, this number is 11 as shown below:

[4 Hyperplanes: 11 Regions]

On inspection we can see the number of regions with m hyperplanes in ${\bf R}^2$ is given by:

Number of Regions = #lines + #intersections + 1

Now let’s consider the general case: What is the expression that would give us the maximum number of regions possibles with m hyperplanes in {\bf R}^n? Turns out that there is a definite expression for the same:

Let \mathcal{A} be an arrangement of m hyperplanes in {\bf R}^n, then the maximum number of regions possible is given by

\displaystyle r(\mathcal{A}) = 1 + m + \dbinom{m}{2} \ldots + \dbinom{m}{n}

the number of bounded regions (closed from all sides) is given by:

\displaystyle b(\mathcal{A}) = \dbinom{m - 1}{n}

The above expressions actually derive from a more general expression when \mathcal{A} is specifically a real arrangement.

I am not familiar with some of the techniques that are required to prove this in the general case (\mathcal{A} is a set of affine hyperplanes in a vector space defined over some Field).  The above might be proven by induction. However, it turns out that in the {\bf R}^2 case the result is related to the crossing lemma (see discussion over here), I think it was one of the most favourite results of one of my previous advisors and he often talked about it. This connection makes me want to study the details [1],[2], which I haven’t had the time to look at and I will post the proof for the general version in a future post.



1. An Introduction to Hyperplane Arrangements: Richard P. Stanley (PDF)


(Helpful for the proof)

2. Partially Ordered Sets: William Trotter (Handbook of Combinatorics) (PDF)


Onionesque Reality Home >>


Read Full Post »

Deep Learning reads Wikipedia and discovers the meaning of life – Geoff Hinton.

The above quote is from a very interesting talk by Geoffrey Hinton I had the chance to attend recently.

I have been at a summer school on Deep Neural Nets and Unsupervised Featured Learning at the Institute for Pure and Applied Mathematics at UCLA since July 9 (till July 27). It has been organized by Geoff Hinton, Yoshua Bengio, Stan Osher, Andrew Ng and Yann LeCun.

I have always been a “fan” of Neural Nets and the recent spike in interest in them has made me excited, thus the school happened at just the right time. The objective of the summer school is to give a broad overview of some of the recent work in Deep Learning and Unsupervised Feature Learning with emphasis on optimization, deep architectures and sparse representations. I must add that after getting here and looking at the peer group I would consider myself lucky to have obtained funding for the event!

[Click on the above image to see slides for the talks. Videos will be added at this location after July 27 Videos are now available]

That aside, if you are interested in Deep Learning or Neural Networks in general, the slides for the talks are being uploaded over here (or click on the image above), videos will be added at the same location some time after the summer school ends so you might like to bookmark this link.

The school has been interesting given the wide range of people who are here. The diversity of opinions about Deep Learning itself has given a good perspective on the subject and the issues and strengths of it. There are quite a few here who are somewhat skeptical of deep learning but are curious, while there are some who have been actively working on the same for a while. Also, it has been enlightening to see completely divergent views between some of the speakers on key ideas such as sparsity. For example Geoff Hinton had a completely different view of why sparsity was useful in classification tasks than compared to Stéphane Mallat, who gave a very interesting talk today even joking that “Hinton and Yann LeCun told you why sparsity is useful, I’ll tell you why sparsity is useless. “. See the above link for more details.

Indeed, such opinions do tell you that there is a lot of fecund ground for research in these areas.

I have been compiling a reading list on some of this stuff and will make a blog-post on the same soon.


Onionesque Reality Home >>

Read Full Post »

I thought I understood Spectral Clustering well enough till I came across these two paragraphs:

Graph Laplacians are interesting linear operators attached to graphs. They are the discrete analogues of the Laplace-Beltrami operators that appear in the Hodge theory of Riemannian manifolds, whose null spaces provide particularly nice representative forms for de Rham cohomology. In particular, their eigenfunctions produce functions on the vertex set of the graph. They can be used, for example, to produce cluster decompositions of data sets when the graph is the 1-skeleton of a Vietoris-Rips complex. We find that these eigenfunctions (again applied to the 1-skeleton of the Vietoris-Rips complex of a point cloud) also can produce useful filters in the Mapper analysis of data sets

– From Prof. Gunnar Carlsson’s survey Topology and Data. (More on this survey as a manifesto for “Topology and Data” in a later post). That aside, I do like how the image on the wiki entry for Vietoris-Rips complex looks like:

A little less intimidating ( now this almost borders on “ofcourse that’s how it is”. I am interested in the same reaction for the paragraph above some months later):

A related application [of the graph laplacian] is “Spectral Clustering”, which is based on the observation that nodal domains of the first eigenvectors of the graph laplacian can be used as indicators for suitably size-balanced minimum cuts.

– From Laplacian Eigenvectors of Graphs linked in the previous post. While this isn’t really as compressed as the lines above, they made me think since I did not know about Courant’s Nodal domain theorem. Like I did in the previous blog post, I would highly recommend this (about 120 page) book. It soon covers the Nodal Domain theorem and things make sense (even in context of links between PCA and k-means and Non-Negative Matrix Factorization and Spectral Clustering, at least in an abstract sense).


Onionesque Reality Home >>

Read Full Post »

Perhaps the most fundamental differential operator on Euclidean space R^d is the Laplacian – Terence Tao.

One can only agree with Prof. Tao. This agreement only intensifies when one considers the generalizations of the Laplacian such as the Laplace-Beltrami operators that appear in the Hodge theory of Riemannian Manifolds. Or when one considers the discrete analogues of the Laplace Beltrami operator such as the Graph Laplacian which I dare say have changed the landscape of research in unsupervised machine learning in the past decade or so. For a sample consider (the related) laplacian eigenmaps, spectral clustering and diffusion maps for just three examples. I have had the chance to work on Manifold learning for a while and have been very fascinated by the Graph Laplacian and its uncanny prowess. I was thus looking for material that actually relates and talks about the laplacian from the point of view of “flow”, diffusion and the heat equation beyond a superficial sense. The idea of diffusion or flow is a very interesting way of looking at distance, which is also partly the reason why the said machine learning techniques are so successful.  This paper (Semi-Supervised Learning on Riemannian Manifolds) by Belkin and Niyogi is quite beautiful from a machine learning point of view, but I was recently pointed out to this short volume by my supervisor (with the gentle warning that it might take a lot of work):

The Laplacian on a Riemannian Manifold – S. Rosenberg

Click on Image to view on Amazon

Thus, the title of the post is a little misleading, as this monograph has almost nothing to do (directly) with graph laplacians. But then I am only interested in this from the point of approach wherein I can get a better understanding of why they are so powerful. This seems like a slow read but is not inaccessible and is very well written. But the best thing about this book is that it is pointed. Some sections in between can be skipped without any problem too. In the worst case it could be considered a roadmap to what needs to be known to really understand the power of the graph laplacian.

Another book that I have been reading slowly these days (it’s actually almost like a long paper than a book and thinking of it that way has a big psychological impact), and quite enjoying is this one:

Laplacian Eigenvectors of Graphs

Click on Image to view on Amazon

Unlike the book above, this is far more accessible and I believe would be to somebody who has an interest in the graph laplacian or even spectral clustering. It mostly deals with some interesting issues related to the graph laplacian that I had never even heard of. While these books are not new, I just discovered them a month back and I think they should fascinate anybody who is fascinated with the Laplacian.


Onionesque Reality Home >>

Read Full Post »

Synopsis of a recent direction of work with Gábor Sárközy, Endre Szemerédi and Fei Song — “The Regularity Lemma is a deep result from extremal graph theory having great utility as a fundamental tool to prove theoretical results, but can it be employed in more “practical” settings?”

More specifically we are interested in the problem of harnessing the power of the regularity lemma to do clustering. This blog post is organized as follows: We first sketch the regularity lemma, we then see that it is an existential predicate and state an algorithmic version, we then look at how this constructive version may be used for clustering/segmentation. It must be noted that the idea seems to have originated from an earlier interesting work by Sperotto and Pellilio.

Before a brief introduction to the regularity lemma, I’d quote Luca Trevisan on (the related) Szemeredi’s Theorem from his blog

Szemeredi’s theorem on arithmetic progressions is one of the great triumphs of the “Hungarian approach” to mathematics: pose very difficult problems, and let deep results, connections between different areas of math, and applications, come out as byproducts of the search for a solution.

Even though I am nothing close to being a mathematician, but we’ll see that even an Electrical Engineer can somewhat understand and appreciate what Prof. Trevisan means in this specific context! ;-)

I will attempt to sketch a few things below with more emphasis on intuition than rigour! To be specific – intuition and then rigour.


Introduction: One of the most fundamental and ingenious results in graph theory and discrete mathematics is the Szemeredi Regularity Lemma. It was originally advanced by Endre Szemeredi as an auxiliary result to prove a long standing conjecture of Erdős and Turán from 1936 (on the Ramsey properties of arithmetic progressions), which stated that sequences of integers with postive upper density must contain arbitrarily long arithmetic progressions (now called as Szemeredi’s theorem (1978)). Now the regularity lemma by itself is considered as one of the most important tools in graph theory.

Prof. Endre Szemeredi: Originator of the "Regularity Method"

A very rough statement of the regularity lemma could be made as follows:

Every graph can be approximated by random graphs. This is in the sense that every graph can be partitioned into a bounded number of equal parts such that:
1. Most edges run between different parts
2. And that these edges behave as if generated at random.

The fundamental nature of the Regularity Lemma as a tool in Graph Theory (and beyond) could be understood by the fact that two recent Fields Medals (Timothy Gowers, 1998 and Terence Tao, 2006) were awarded atleast in part for achieving breakthrough results on the regularity lemma and using those to prove fundamental results on arithmetic progressions amongst other results (Such as the Green-Tao Theorem).


A Temporary Digression and Prof. Trevisan’s Statement: A first look at the statement of the regularity lemma above reminds one (especially those trained in signal processing I suppose) of the Fourier series. While this is extremely simplified – It’s not entirely incorrect to say that it points in part to a deeper relationship between combinatorics and analysis. Let’s look at it this way – Szemeredi employed the regularity lemma (a graph theoretic method) to prove what is now called Szemeredi’s theorem. The Szemeredi Theorem is a difficult one and it has four different proofs (graph-theoretic, ergodic, hypergraph-theoretic and lastly based on Fourier analysis)*.  Infact work on the Szemeredi theorem has contributed in better understanding connections between these areas (as an example – Timothy Gowers was awarded the Fields Medal for proving very deep results on the connections between analysis and combinatorics).

Terence Tao and Timothy Gowers have made breakthrough contributions to the regularity lemma and Szemeredi's Theorem amongst numerous others (Image Source: Republic of Math)

So it could be said that it all really started with Erdős and Turán posing a hard problem – The search for it’s solution has led to not only some good theorems but also some interesting connections between different areas of mathematics and thus a better understanding of all of them as a result. So we see what Prof. Trevisan means!

*While here the connection appears very sketchy though intuitive. The strong relationship between the Regularity Lemma and Analysis was shown in a breakthrough paper by László Lovász and Balázs Szegedy. In this paper they showed that the regularity lemma could be thought of as a result in analysis

Szemeredi Regularity Lemma for the Analyst (László Lovász and Balázs Szegedy)


Given the background now we get to stating Szemeredi’s Regularity Lemma:

The Szemeredi Lemma: Expanding upon the rough statement mentioned a couple of paragraphs above, we could be a little more specific and say that

The Lemma states that every graph could be partitioned into a bounded number of quasi-random bi-partite graphs, called regular pairs and a few left over edges.

Now for a more precise statement, we introduce some definitions and notation.

Definition 1: If G = (V,E) is a graph  and A, B are disjoint subsets of V . Then, we denote the count of number edges with one endpoint in A and another in B by e(A,B) . The density of edges between A and B can then be defined as:

\displaystyle d(A,B) = \frac{e(A,B)}{|A||B|}

This just defines the edge density in a bipartite graph.

Definition 2: The bipartite graph G = (A, B, E) is called \varepsilon -regular if for every X \subset A and Y \subset B satisfying

|X| > \varepsilon |A|, |Y| > \varepsilon |B|

we would have |d(X,Y) - d(A,B)| < \varepsilon

This means that a bipartite graph is epsilon regular if we were to take any random subsets (of some minimum size) X, Y of A,B respectively and even then find that the edge density between these subsets is almost the same as the edge density in the original bipartite graph. In effect this implies that if a bipartite graph is epsilon regular then all the edges between the the two disjoint sets are distributed uniformly.

Definition 3: An equitable partition of the vertex set V of a graph G = (V,E) is a partition of V such that all the classes C_0, C_1, \dots, C_k are pairwise disjoint. And that all classes C_i with 1 \leq i \leq k have the same cardinality. It is noteworthy that oftentimes the vertex set might not have a cardinality that could be equally divided into the said number of classes, thus C_0 is present for a technical reason: To ensure that all the classes have the same cardinality.

Definition 4: For every equitable partition P of the vertex set V of G = (V,E) into classes C_0, C_1, \dots, C_k we associate a measure called the potential or the index of the partition P , which is defined as:

\displaystyle ind(P) = \frac{1}{k^2} \sum_{1 \leq r \leq s \leq k} d(C_r, C_s)^2

This measure just defines how close a partition is close to a regular one.

Definition 5: An equitable partition C_0, C_1, \dots, C_k of the vertex set of graph G given by V , where C_0 is the exceptional set, is called \varepsilon -regular if C_0 < \varepsilon |V| and all but \varepsilon k^2 of the pairs (C_i,C_j) are \varepsilon -regular such that 1 \leq i < j \leq k .

We are now in a position to state the regularity lemma:

Theorem 1: [Regularity Lemma]
For every positive \varepsilon and positive integer m, there are positive integers N = N(\varepsilon, m) and M = M(\varepsilon, m) such that the following property holds: For all graphs G with |V|\geq N , there is an \varepsilon -regular partition of G(V,E) into k+1 classes such that m\leq k \leq M .

The beauty of the regularity lemma is in the point that that the approximation for any graph does not depend on the number of vertices it has, but only on the error in approximation (represented by \varepsilon ).

In the proof of the regularity lemma we start with an initial partition (with low potential) and then iteratively refine the partition in such a way that the potential increases to a point such that the partition is epsilon regular. However, this leads to a major problem (the first of our concerns) – In refining the partitions iteratively the number of classes increases exponentially in each iteration. We then end up with a partition in which the number of classes is a tower function – usually an astronomical figure. Clearly, this does not work very well for graphs arising in practice. This same point is made by Prof. Trevisan:

[…] one may imagine using the Regularity Lemma as follows: to prove an approximate quantitative statement about arbitrary graphs, we only need to prove a related statement for the finite set of objects that approximate arbitrary graphs up to a certain level of accuracy. The latter, completely finite, goal may be performed via a computer search. Unfortunately, the size of the objects arising in the Regularity Lemma grow so astronomically fast in the approximation that this approach is completely impractical.

As an aside: For a long while it was thought that maybe a tower function is not necessary. However in a remarkable paper Timothy Gowers constructed graphs that demonstrated that functions of the tower type were indeed necessary.

In any case, how can we get around this problem for approximating most graphs so that it could be useful in applications such as clustering? A possible solution to this problem in the context of clustering was proposed by Sperotto and Pelillo. They make some interesting points, however do not provide many details on their approach. We will get back to this problem in a short while.

But the first problem that we face is the following: The Szemeredi Lemma as originally proposed is an existential predicate. It does not give a method to obtain the regular partition for a given graph, but only says that one must exist! So if we are to use the Lemma in any practical setting, we need an algorithmic version. There now exist two algorithmic versions:

1. One proposed in a paper by Alon et al in 1992.

2. Another proposed by Frieze and Kannan and is based on the very intriguing relationship between singular values and regularity!

We for now, focus on the Alon et al. version.


Algorithmic Version of the Regularity Lemma:

Theorem 2: [A Constructive Version of the Regularity Lemma – Alon et al.]

For every \varepsilon > 0 and every positive integer t there is an integer Q = Q(\varepsilon, t) such that every graph with n > Q vertices has an \varepsilon -regular partition into k + 1 classes, where t \le k \le Q . For every fixed \varepsilon > 0 and t \ge 1 such a partition can be found in O(M(n)) sequential time, where M(n) is the time for multiplying two n by n matrices with 0,1 entries over the integers. It can also be found in time O(logn) on an ERPW PRAM with a polynomial number of parallel processors.

To really understand how the above theorem would require the introduction of several supporting definitions and lemmas (including one from Szemeredi’s original paper). Since our focus is on the application, we’d just state one lemma using which the idea behind the constructive version could be revealed in a more concrete sense.

Lemma 1: [Alon et al.] Let H be a bipartite graph with equally sized classes |A| = |B| = n . Let \displaystyle 2n^{\frac{-1}{4}} < \varepsilon <\frac{1}{16} . There is an O(M(n)) algorithm that verifies that H is \varepsilon -regular or finds two subset A' \subset A , B' \subset B , \displaystyle |A'| \ge \frac{{\varepsilon}^4}{16}n , \displaystyle |B'| \ge \frac{{\varepsilon}^4}{16}n , such that \displaystyle |d(A, B) - d(A', B')| \ge \varepsilon^4 . The algorithm can be parallelized and implemented in NC.

This lemma basically says that whether or not a bipartite graph is epsilon regular is a question that can be answered very quickly. If it is – then we can have an algorithm to say so and if it is not, it should return the answer with a certificate or proof that it is not so. This certificate is nothing but subsets of the classes from the original bipartite graph. The idea of the certificate is to help to proceed to the next step.

The general idea in the Alon algorithm is:

Start with a random equitable partition of the graph C_0, C_1, \dots, C_b , where \displaystyle |C_i| = \lfloor\ \frac{n}{b} ;\rfloor and i = 1, 2, \dots, b . Also |C_0| < b . Also let k_1 = b .

Then, for each pair C_r, C_s in the partition P_i check for regularity. If it is regular report so, if not find the certificate for the pair. If out of all the pairs, only \varepsilon k_i^2 are not epsilon regular – then halt, the partition P_i is epsilon regular. If not then we refine this partition using the information gained from the certificates in the cases when the pairs not epsilon regular. On refining this partition we obtain a new partition P' with 1 + k_i4^{k_i} classes. We repeat this process till we hit a partition which is epsilon regular.


Using the Constructive Version for Clustering:

A little earlier it was pointed out that there are two main difficulties in using the Szemeredi Lemma in practical settings. At one level, to use the lemma in a practical setting we need a constructive version. However the constructive versions are so designed that they work for all graphs. To illustrate how this is a problem, we quote a line from the algorithmic version based on singular values (Freize and Kannan):

The Algorithm finishes in atmost O(\varepsilon^{-45}) steps with an \varepsilon -regular partition

Now, consider what this implies – If \varepsilon is \frac{1}{16} (a typical value). Then the number of steps in which we are guaranteed to find a regular partition is 1.53249554 × 1054.

This is astonishingly large number. To make the lemma truly practical, we would have to give up the aim of making it work for all graphs. Instead we should be content that it works on most graphs. Most graphs would largely be graphs that appear in practice. Like was mentioned earlier some directions in this regard were provided by Sperotto and Pellilio.

Another problem (as was mentioned earlier too) was that the number of classes grows exponentially with each refinement step. We can not allow the number of classes to grow astronomically either because if we do so then consequently we would never be able to refine the partitions far enough. Here we would have to compromise on the generality again. Instead of allowing the number of classes to grow exponentially we could allow it to grow by a defined value in each iteration. This value could be chosen depending on a number of parameters, including the data set size.

Even in making such an approximation – the potential would always increase with each iteration albeit much slowly as compared to the methodology as originally described. Infact we would say that it would work for most graphs. In our work this approach seems to work quite well.

Given these two changes, what still remains is how could the lemma be used for clustering datasets? One possible way is suggested by another result called the Key Lemma and the implication that it might have. This is stated below:

Lemma 2: Given an arbitrary graph G=(V,E) , a partition of V in k clusters as in regularity lemma described above and two parameters \varepsilon, d , we describe the reduced graph G^R as the graph whose vertices are associated to the clusters and whose edges are associated to the \varepsilon -regular pairs with density above d . If we have a coloring on the edges of G, then the edges of the reduced graph will be colored with a color that appears on most of the edges between the two clusters.

Using the properties of what is called the Key-Lemma/Blow-Up Lemma, it is understood that this reduced graph should retain properties of the original graph. And thus any changes made on this reduced graph would reflect on the original graph.

Thus, one possible way to cluster is using a two-phase clustering strategy: First, we use the modified regularity lemma to construct a reduced graph using the method above. In the second phase, we use a pairwise clustering method such as spectral clustering to do cluster the the reduced graph. The results obtained on this reduced graph are then projected back onto the original graph using Lemma 2.

Two Phase Strategy to use the Regularity Lemma

Infact such a methodology gives results quite encouraging as compared to other standard clustering methods. These results would be reported here once the paper under consideration is published. :) Another thing to be noted is that the reduced graph is usually quite small as compared to the original graph and working on this smaller – reduced graph is much faster.


Other Possible Changes:

1. The Alon et al. algorithm is used and modified to use the regularity lemma for clustering. It would be interesting to explore the results given by the Frieze & Kannan method as it has a different way to find the certificate.

2. There has been some work on the sparse regularity lemma. This is something that would have a lot of practical value as in the above we construct a dense graph. Using the sparse version would allow us to use nearest neighbor graphs instead of dense graphs. This would reduce the computational burden significantly.

3. A recent paper by Fischer et al.Approximate Hypergraph Partitioning and Applications” has received a lot of attention. In this paper they give a new approach for finding regular partitions. All the previous ones work to find partitions of the tower type, while this paper gives a method to find a smaller regular partition if one exists in the graph. Employing this methodology for refinement instead of using an approximate version of the algorithmic regularity lemma could be a fruitful direction of work.


Onionesque Reality Home >>

Read Full Post »

Changing or increasing functionality of circuits in biological evolution is a form of computational learning. – Leslie Valiant

The title of this post comes from Prof. Leslie Valiant‘s The ACM Alan M. Turing award lecture titled “The Extent and Limitations of Mechanistic Explanations of Nature”.

Prof. Leslie G. Valiant

Click on the image above to watch the lecture

[Image Source: CACM “Beauty and Elegance”]

Short blurb: Though the lecture came out sometime in June-July 2011, and I have shared it (and a paper that it quotes) on every online social network I have presence on, I have no idea why I never blogged about it.

The fact that I have zero training (and epsilon knowledge of) in biology that has not stopped me from being completely fascinated by the contents of the talk and a few papers that he cites in it. I have tried to see the lecture a few times and have also started to read and understand some of the papers he mentions. Infact, the talk has inspired me enough to know more about PAC Learning than the usual Machine Learning graduate course might cover. Knowing more about it is now my “full time side-project” and it is a very exciting side-project to say the least!


Getting back to the title: One of the motivating questions about this work is the following:

It is widely accepted that Darwinian Evolution has been the driving force for the immense complexity observed in life or how life evolved. In this beautiful 10 minute video Carl Sagan sums up the timeline and the progression:

There is however one problem: While evolution is considered the driving force for such complexity, there isn’t a satisfactory explanation of how 13.75 billion years of it could have been enough. Many have often complained that this reduces it to a little more than an intuitive explanation. Can we understand the underlying mechanism of Evolution (that can in turn give reasonable time bounds)? Valiant makes the case that this underlying mechanism is of computational learning.

There have been a number of computational models that have been based on the general intuitive idea of Darwinian Evolution. Some of these include: Genetic Algorithms/Programming etc. However, people like Valiant amongst others find such methods useful in an engineering sense but unsatisfying w.r.t the question.

In the talk Valiant mentions that this question was asked in Darwin’s day as well. To which Darwin proposed a bound of 300 million years for such evolution to occur. This immediately fell into a problem as Lord Kelvin, one of the leading physicists of the time put the figure of the age of Earth to be 24 million years. Now obviously this was a problem as evolution could not have happened for more than 24 million years according to Kelvin’s estimate. The estimate of the age of the Earth is now much higher. ;-)

The question can be rehashed as: How much time is enough? Can biological circuits evolve in sub-exponential time?

For more I would point out to his paper:

Evolvability: Leslie Valiant (Journal of the ACM – PDF)

Towards the end of the talk he shows a Venn diagram of the type usually seen in complexity theory text books for classes P, NP, BQP etc but with one major difference: These subsets are fact and not unproven:

Fact: Evolvability \subseteq SQ Learnable \subseteq PAC Learnable

*SQ or Statistical Query Learning is due to Michael Kearns (1993)

Coda: Valiant claims that the problem of evolution is no more mysterious than the problem of learning. The mechanism that underlies biological evolution is “evolvable target pursuit”, which in turn is the same as “learnable target pursuit”.


Onionesque Reality Home >>

Read Full Post »

One interesting project that I am involved in these days involves certain problems in Intelligent Tutors. It turns out that perhaps one of the best ways to tackle them is by using Conditional Random Fields (CRFs). Many attempts to solving these problems still involve Hidden Markov Models (HMMs). Since I have never really been a Graphical Models guy (though I am always fascinated) so I found the going on studying CRFs quite difficult. Now that the survey is more or less over, here are my suggestions for beginners to go about learning them.

Tutorials and Theory

1. Log-Linear Models and Conditional Random Fields (Tutorial by Charles Elkan)

Log-linear Models and Conditional Random Fields
Charles Elkan

6 videos: Click on Image above to view

Two directions of approaching CRFs are especially useful to get a good perspective on their use. One of these is considering CRFs as an alternate to Hidden Markov Models (HMMs) while another is to think of CRFs building over Logistic Regression.

This tutorial makes an approach from the second direction and is easily one of the most basic around. Most people interested in CRFs would ofcourse be familiar with ideas of maximum likelihood, logistic regression etc. This tutorial does a good job, starting with the absolute basics – talking about logistic regression (for a two class problem) to a more general multi-label machine learning problem with a structured output (outputs having a structure). I tried reading a few tutorials before this one, but found this to be the most comprehensive and the best place to start. It however seems that there is one lecture missing in this series which (going by the notes) covered more training algorithms.

2. Survey Papers on Relational Learning

These are not really tutorials on CRFs, but talk of sequential learning in general. For beginners, these surveys are useful to clarify the range of problems in which CRFs might be useful while also discussing other methods for the same briefly. I would recommend these two tutorials to help put CRFs in perspective in the broader machine learning sub-area of Relational Learning.

— Machine Learning for Sequential Learning: A Survey (Thomas Dietterich)


This is a very broad survey that talks of sequential learning, defines the problem and some of the most used methods.

— An Introduction to Structured Discriminative Learning (R Memisevic)


This tutorial is like the above, however focuses more on comparing CRFs with large margin methods such as SVM. Giving yet another interesting perspective in placing CRFs.

3. Comprehensive CRF Tutorial (Andrew McCallum and Charles Sutton)


This tutorial is the most compendious tutorial available for CRF. While it claims to start from the bare bone basics, I found it hard for a start and took it on third (after the above two). It is potentially the starting and ending point for a more advanced Graphical Models student. It is extensive (90 pages) and gives a feeling of comfort with CRFs when done. It is definitely the best tutorial available though by no means the most easiest point to start if you have never done any sequential learning before.

This might be considered an extension to this tutorial by McCallum et al : CRFs for Relational Learning (PDF)

4. Original CRF Paper (John Lafferty et al.)


Though not necessary to learn CRFs given many better tutorials, this paper is still recommended, being the first on CRFs.

5. Training/Derivations (Rahul Gupta)


This report is good for the various training methods and for one to go through the derivations associated.

6. Applications to Vision (Nowozin/Lampert)

If your primary focus is using structured prediction in Computer Vision/Image Analysis then a good tutorial (with a large section on CRFs) can be found over here:

Structured prediction and learning in Computer Vision (Foundations and Trends Volume).



Extensions to the CRF concept

There are a number of extensions to CRFs. The two that I have found most helpful in my work are (these are easy to follow given the above):

1. Hidden State Conditional Random Fields (H CRF)

2. Latent Dynamic Conditional Random Fields (LDCRF)

Both of these extensions work to include hidden variables in the CRF framework.


Software Packages

1. Kevin Murphy’s CRF toolbox (MATLAB)

2. MALLET (I haven’t used MALLET, it is Java based)

3. HCRF – LDCRF Library (MATLAB, C++, Python). As as the name suggests, this package is for HCRF and LDCRF, though can be used as a standalone package for CRF as well.

Read Full Post »

« Newer Posts - Older Posts »