Feeds:
Posts

Where does the Sigmoid in Logistic Regression come from?

Note: The title of this post is circular. But I use/abuse it because of the post linked below.

I noticed on the Hacker News front page (and via multiple reshares on twitter), a discussion on why logistic regression uses a sigmoid. The article linked in the story talks about the log-odds ratio, and how it leads to the sigmoid (and gives a good intuitive plug on it).

However, I think that the more important question is – Why do you care about log-odds? Why do you use log-odds and not something else? The point of this quick post is to write out why using the log-odds is infact very well motivated in the first place, and once it is modeled by a linear function, what you get is the logistic function.

Beginning with log-odds would infact be begging the question, so let us try to understand.

____________________________

To motivate and in order to define the loss etc, suppose we had a linear classifier: $\hat{y} = h(\mathbf{x}) = sign(w_0 + \mathbf{w}^T\mathbf{x})$. This just means that for a vector input, we take the dot product with the parameters $\mathbf{w}$ and take the sign.

The learning problem would be to figure out a good direction $\mathbf{w}$ and a good location of the decision boundary $w_0$.

____________________________

We want to figure these out so as to minimize the expected 0-1 loss (or expected number of mistakes) for the classifier $h: \mathcal{X} \rightarrow \mathcal{Y}$. The 0-1 loss for a datapoint-label pair $\mathbf{x},y$ is simply:

$\displaystyle L(h(\mathbf{x}),y) = \begin{cases} 0, & \text{if } h(\mathbf{x}) = y \\ 1, & \text{if } h(\mathbf{x}) \neq y \end{cases}$

Now, the next question we would like to ask. What is the risk of this classifier that we want to minimize? The risk is the expected loss. That is, if we draw a random sample from the (unknown) distribution $p(\mathbf{x},y)$, what would be the expected error? More concretely:

$R(h) = \mathbb{E}_{\mathbf{x},y} [L(h(\mathbf{x}),y)]$

Writing out the expectation:

$\displaystyle R(h) =\int_x \sum_{c=1}^C L(h(\mathbf{x}),c) p(\mathbf{x}, y=c) d\mathbf{x}$

Using the chain rule this becomes:

$\displaystyle R(h) =\int_x \sum_{c=1}^C L(h(\mathbf{x}),c) p(y=c|\mathbf{x}) p(\mathbf{x})d\mathbf{x}$

It is important to understand this expression. This is not assuming anything about the data. However, it is this expression that we want to minimize if we want to get a good classifier. To minimize this expression, it suffices to simply minimize for the conditional risk for any point $\mathbf{x}$ (i.e. the middle part of the above expression):

$\displaystyle R(h|\mathbf{x}) =\sum_{c=1}^C L(h(\mathbf{x}),c) p(y=c|\mathbf{x})$

But this conditional risk can be written as:

$\displaystyle R(h|\mathbf{x}) =0 \times p(y=h(\mathbf{x})|\mathbf{x}) + 1 \times \sum_{c \neq h(\mathbf{x})} p(y=c|\mathbf{x})$

Note that, $\displaystyle \sum_{c \neq h(\mathbf{x})} p(y=c|\mathbf{x}) = 1 - p(y=h(\mathbf{x})|\mathbf{x})$

Therefore, the conditional risk is simply:

$\displaystyle R(h|\mathbf{x}) = 1 - p(y=h(\mathbf{x})|\mathbf{x})$

Now, it is this conditional risk that we want to minimize given a point $\mathbf{x}$. And in order to do so, looking at the expression above, the classifier must make the following decision:

$\displaystyle h(\mathbf{x}) = \arg\max_c p(y=c| \mathbf{x})$

It is again important to note that so far we have made absolutely no assumptions about the data. So the above classifier is the best classifier that we can have in terms of generalization, in the sense of what might be the expected loss on a new sample point. Such a classifier is called the Bayes Classifier or sometimes called the Plug-in classifier.

But the optimal decision rule mentioned above i.e. $\displaystyle h(\mathbf{x}) = \arg\max_c p(y=c| \mathbf{x})$ is equivalent to saying that:

$\displaystyle h(\mathbf{x}) = c^\ast \iff \frac{p(y = c^\ast|\mathbf{x})}{p(y=c|\mathbf{x})} \geq 1 \text{ }\forall c$

by taking log, this would be:

$\displaystyle h(\mathbf{x}) = c^\ast \iff \log \frac{p(y = c^\ast|\mathbf{x})}{p(y=c|\mathbf{x})} \geq 0 \text{ }\forall c$

If, we were only dealing with binary classification, this would imply:

$\displaystyle h(\mathbf{x}) = 1 \iff \log \frac{p(y = 1|\mathbf{x})}{p(y=0|\mathbf{x})} \geq 0$

Notice that by making no assumptions about the data, simply by writing out the conditional risk, the log-odds ratio has fallen out directly. This is not an accident, because the optimal bayes classifier has this form for binary classification. But the question still remains, how do we model this log-odds ratio? The simplest option is to consider a linear model (there is no reason to stick to a linear model, but due to some reasons, one being convexity, we stick to a linear model):

$\displaystyle log \frac{p(y = 1|\mathbf{x})}{p(y=0|\mathbf{x})} = w_0 + \mathbf{w}^T \mathbf{x} = 0$

Now, we know that $p(y=1|\mathbf{x}) = 1 - p(y=0|\mathbf{x})$, plugging this in the above, and exponentiating, we have:

$\displaystyle log \frac{p(y = 1|\mathbf{x})}{1 - p(y=0|\mathbf{x})} = \exp(w_0 + \mathbf{w}^T \mathbf{x}) = 1$

Rearranging, yields the familiar logistic model (and the sigmoid):

$\displaystyle p(y = 1|\mathbf{x}) = \frac{1}{1+ \exp(- w_0 - \mathbf{w}^T\mathbf{x})} = \frac{1}{2}$.

As noted in the post linked in the beginning, the logistic model, $\sigma(x) = \frac{1}{1+ e^{-x}}$, which for any $x$ is, $0 \leq \sigma(x) \leq 1$, and is monotonic $\sigma(-\inf) = 0, \sigma(+\inf) = 1$.

____________________________

This derivation shows that the log-odds is not an arbitrary choice, infact a very natural choice. The sigmoid is simply a consequence of modeling the log-odds with a linear function (infact logistic regression is arguably the simplest example of a log-linear model, if we had structured outputs, a natural extension of such a model would be the Conditional Random Field. The choice of using a linear function is simply to make the optimization convex, amongst some other favourable properties).

____________________________

Note: This post was inspired by some very succinct notes by Gregory Shakhnarovich from his Machine Learning class, that I both took and served as a TA for.

Neighbourhood Gerrymandering: An Approach to Discriminative Metric Learning via Latent Structured Prediction

An informal summary of a recent project I had some involvement in.

Motivation: Why care about Metric Learning?

In many machine learning algorithms, such as k-means, Support Vector Machines, k-Nearest Neighbour based classification, kernel regression, methods based on Gaussian Processes etc etc – there is a fundamental reliance, that is to be able to measure dissimilarity between two examples. Usually this is done by using the Euclidean distance between points (i.e. points that are closer in this sense are considered more similar), which is usually suboptimal in the sense that will be explained below. Being able to compare examples and decide if they are similar or dissimilar or return a measure of similarity is one of the most fundamental problems in machine learning. Ofcourse a related question is: What does mean by “similar” afterall?

To illustrate the above let us work with k-Nearest Neighbour classification. Before starting, let us just illustrate the really simple idea (of kNN classification) by an example: Consider the following points in $\mathbb{R}^2$, with the classes marked by different colours.

Now suppose we have a new point – marked with black – whose class is unknown. We assign it a class by looking at the nearest neighbors and taking the majority vote:

Some notes on kNN:

A brief digression first before moving on the problem in the above (what is nearest?). kNN classifiers are very simple and yet in many cases they can give excellent performance. For example, consider the performance on the MNIST dataset, it is clear that kNN can give competitive performance as compared to other more complicated models.

Moreover, they are simple to implement, use local information and hence are inherently nonlinear. The biggest advantage in my opinion is that it is easy to add new classes (since no retraining from scratch is required) and since we average across points, kNN is also relatively robust to label noise. It also has some attractive theoretical properties: for example kNN is universally consistent (as the number of points approaches infinity, with appropriate choice of k, the kNN error will approach the Bayes Risk).

Notion of “Nearest”:

At the same time, kNN classifiers also have their disadvantages. One is related to the notion of “nearest” (which falls back on what was talked about at the start) i.e. how does one decide what points are “nearest”. Usually such points are decided on the basis of the Euclidean distance on the native feature space which usually has shortfalls. Why? Because nearness in the Euclidean space may not correspond to nearness in the label space i.e. points that might be far off in the Euclidean space may have similar labels. In such cases, clearly the notion of “near” using the euclidean distance is suboptimal. This is illustrated by a set of figures below (adapted from slides by Kilian Weinberger):

An Illustration:

Consider the image of this lady – now how do we decide what is more similar to it?

Someone might mean similar on the basis of the gender:

Or on the basis of age:

Or on the basis of the hairstyle!

Similarity depends on the context! Something that the euclidean distance in the native feature space would fail to capture. This context is provided by labels.

Distance Metric Learning:

The goal of Metric Learning is to learn a distance metric, so that the above label information is incorporated in the notion of distance i.e. points that are semantically similar are now closer in the new space. The idea is to take the original or native feature space, use the label information and then amplify directions that are more informative and squish directions that are not. This is illustrated in this figure – notice that the point marked in black would be incorrectly classified in the native feature space, however under the learnt metric it would be correctly classified.

It is worthwhile to have a brief look at what this means. The Euclidean distance (with $x_i \in \mathbb{R}^d$) is defined by

$\sqrt{(x_i - x_j)^T (x_i - x_j)}$

as also was evident in the above figure, this corresponds to the following euclidean ball in 2-D

A family of distance measure may be defined using an inner product matrix. These are called the Mahalanobis metrics.

$\sqrt{(x_i - x_j)^T \mathbf{W}(x_i - x_j)}$

The learnt metric affects a rescaling and rotation of the original space. The goal is now to learn this $\mathbf{W} \succeq 0$ using the label information so that the new distances correspond better to the semantic context. It is easy to see that when $\mathbf{W} \succeq 0$, the above is still a distance metric.

Learning $\mathbf{W}$:

Usually the real motivation for metric learning is to optimize for the kNN objective i.e. learn the matrix $\mathbf{W} \succeq 0$ so that the kNN error is reduced. But note that directly optimizing for the kNN loss is intractable because of the combinatorial nature of the optimization (we’ll see this in a bit), so instead, $\mathbf{W}$ is learnt as follows:

1. Define a set of “good” neighbors for each point. The definition of “good” is usually some combination of proximity to the query point and label agreement between the points.

2. Define a set of “bad” neighbours for each point. This might be a set of points that are “close” to the query point but disagree on the label (i.e. inspite of being close to the query point they might give a wrong classification if they were chosen to classify the query point).

3. Set up the optimization problem for $\mathbf{W}$ such that for each query point, “good” neighbours are pulled closer to it while “bad” neighbours are pushed farther away, and thus learn $\mathbf{W}$ so as to minimize the leave one out kNN error.

The exact formulation of “good” and “bad” varies from method to method. Here are some examples:

In one of the earliest papers on distance metric learning by Xing, Ng, Jordan and Russell (2002) – good neighbors are similarly labeled k points. The optimization is done so that each class is mapped into a ball of fixed radius. However no separation is enforced between the classes. This is illustrated in the following figure (the query point is marked with an X, similarly labeled k points are moved into a ball of a fixed radius):

One problem with the above is that the kNN objective does not really require that similarly labeled points are clustered together, hence in a way it optimizes for a harder objective. This is remedied by the LMNN described briefly below.

One of the more famous Metric Learning papers is the Large Margin Nearest Neighbors by Weinberger and Saul (2006). Here good neighbors are similarly labeled k points (and the circle around x is the distance of the farthest of the good neighbours) and “worst offenders” or “bad” neighbours are points that are of a different class but still in the nearest neighbors of the query point. The optimization is basically a semidefinite program that works to pull the good neighbours towards the query point and a margin is enforced by pushing the offending points out of this circle. Thus in a way, the goal in LMNN is to deform the metric in such a way that the neighbourhood for each point is “pure.

There are many approaches to the metric learning problem, however a few more notable ones are:

1. Neighbourhood Components Analysis (Goldberger, Roweis, Hinton and Salakhutdinov, 2004): Here the piecewise constant error of the kNN rule is replaced by a soft version. This leads to a non-convex objective that can be optimized by gradient descent. Basically, NCA tries to optimize for the choice  of neighbour at the price of losing convexity.

2. Collapsing Classes (Globerson and Roweis, 2006): This method attempts to remedy the non-convexity above by optimizing a similar stochastic rule while attempting to collapse each class to one point, making the problem convex.

3. Metric Learning to Rank (McFee and Lankriet, 2010): This paper takes a different take on metric learning, treating it as a ranking problem. Note that given a fixed p.s.d matrix $\mathbf{W}$ a query point induces a permutation on the training set (in order of increasing distance). The idea thus is to optimize the metric for some ranking measure (such as precision@k). But note that this is not necessarily the same as requiring correct classification.

Neighbourhood Gerrymandering:

As a motivation we can look at the cartoon above for LMNN. Since we are looking to optimize for the kNN objective, the requirement to learn the metric should just be correct classification. Thus, we should need to push the points to ensure the same. Thus we can have the circle around x as simply the distance of the farthest point in the k nearest neighbours (irrespective of class). Now, we would like to deform the metric such that enough points are pulled in and pushed out of this circle so as to ensure correct classification. This is illustrated below.

This method is akin to the common practice of Gerrymandering, in drawing up borders of election districts so as to provide advantages to desired political parties. This is done by concentrating voters from a particular party and/or by spreading out voters from other parties. In the above, the “districts” are cells in the Voronoi diagram defined by the Mahalanobis metric and “parties” are class labels voted for by each neighbour.

Motivations and Intuition:

Now we can step back a little from the survey above, and think a bit about the kNN problem in somewhat more precise terms so that the above approach can be motivated better.

For kNN, given a query point and a fixed metric, there is an implicit latent variable: The choice of the k “neighbours”.

Given this latent variable – inference of the label for the query point is trivial – since it is just the majority vote. But notice that for any given query point, there can exist a very large number of  choices of k points that may correspond to correct classification (basically any set of points with majority of correct class will work). Now we basically want to learn a metric so that we prefer one of these sets over any set of k neighbours which would vote for a wrong class. In particular, from the sets that affects correct classification we would like to pick the set that is on average most similar to the query point.

We can write kNN prediction as an inference problem with a structured latent variable being the choice of k neighbours.

The learning then corresponds to  minimizing a sum of structured latent hinge loss and a regularizer. Computing the latent hinge loss involves loss-augmented inference which is basically looking for the worst offending k points (points that have high average similarity with the query point, yet correspond to a high loss). Given the combinatorial nature of the problem, efficient inference and loss-augmented inference is key. Optimization can basically be just gradient descent on the surrograte loss. To make this a bit more clear, the setup is described below:

Problem Setup:

Suppose we are given $N$ training examples that are represented by a “native” feature map, $\mathbf{X} = \{x_1, \dots, x_N\}$ with $x_i \in \mathbb{R}^d$ with class labels $\mathbf{y} = [y_1, \dots, y_N]^T$ with $y_i \in [\mathbf{R}]$, where $[\mathbf{R}]$ stands for the set $\{1, \dots, \mathbf{R}\}$.

Suppose are also provided with a loss matrix $\Lambda$ with $\Lambda(r,r')$ being the loss incurred by predicting $r'$ when the correct class is $r$. We assume that $\Lambda(r,r) = 0$ and $\forall (r,r'), \Lambda(r,r') \geq 0$.

Now let $h \subset \mathbf{X}$ be a set of examples in $\mathbf{X}$.

As stated earlier, we are interested in the Mahalanobis metrics:

$D_W(x,x_i) = (x-x_i)^T W (x-x_i)$

For a fixed $W$ we may define the distance of $h$ with respect to a point $x$ as:

$\displaystyle S_W(x,h) - \sum_{x_j \in h} D_W(x, x_j)$

Therefore, the set of k-Nearest Neighbours of $x$ in $\mathbf{X}$ is:

$h_W(x ) = \arg\max_{|h|=k} S_W(x,h)$

For any set $h$ of $k$ examples from $\mathbf{X}$ we can predict the label of $x$ by a simple majority vote.

$\hat{y}(h) = majority\{y_j: x_j \in h\}$

The kNN classifier therefore predicts $\hat{y}(h_W(x))$.

Thus, the classification loss incurred using the set $h$ can be defined as:

$\Delta(y,h) = \Lambda(y,\hat{y}(h))$

Learning and Inference:

One might want to learn $W$ so as to minimize the training loss:

$\displaystyle \sum_i \Delta(y_i, h_W(x_i))$

However as mentioned in passing above, this fails because of the intractable nature of  the classification loss $\Delta$. Thus we’d have to resort to the usual remedy: define a tractable surrograte loss.

It must be stressed again that the output of prediction is a structured object $h_W$. The loss in structured prediction penalizes the gap between score of the correct structured output and the score of the “worst offending” incorrect output. This leads to the following definition of the surrogate:

$L(x,y,W) = \max_h [S_W(x,h) + \Delta(y,h)] - \max_{h: \Delta(y,h) = 0} S_W(x,h)$

This corresponds to our earlier intuition on wanting to learn $W$ such that the gap between the “good neighbours” and “worst offenders” is increased.

So, although the loss above was arrived at by intuitive arguments, it turns out that our problem is an instance of a familiar type of problem: Latent Structured Prediction and hence the machinery for optimization there can be used here as well. The objective for us corresponds to:

$\displaystyle \min_W \| W\|^2_{F} + C \sum_i (L(x_i, y_i,W))$

Where $\| \cdot \|_F$ is the Frobenius norm.

Note that the regularizer is convex, but the loss is not convex to the subtraction of the max term i.e. now it is a difference of convex functions which means the concave convex procedure may be used for optimization (although we just use stochastic gradient descent). Also note that the optimization at each step needs an efficient subroutine to determine the correct structured output (inference of the best set of neighbours) and the worst offending incorrect structured output (loss augmented inference i.e. finding the worst set of neighbors). Turns out that for this problem this is possible (although not presented here).

It is interesting to think about how this approach extends to regression and to see how it works when the embeddings learnt are not linear.

This post may be considered an extension of the previous post.

The setup and notation is the same as in the previous post (linked above). But to summarize: Earlier we had an unknown smooth regression function $f: \mathbb{R}^d \to \mathbb{R}$. The idea was to estimate at each training point, the gradient of this unknown function $f$, and then taking the sample expectation of the outerproduct of the gradient. This quantity has some interesting properties and applications.

However it has its limitations, for one, the mapping $f: \mathbb{R}^d \to \mathbb{R}$ restricts the Gradient Outer Product being helpful for only regression and binary classification (since for binary classification the problem can be thought of as regression). It is not clear if a similar operator can be constructed when one is dealing with classification, that is the unknown smooth function is a vector valued function $f: \mathbb{R}^d \to \mathbb{R}^c$ where $c$ is the number of classes (let us say for the purpose of this discussion, that for each data point we have a probability distribution over the classes, a $c$ dimensional vector).

In the case of the gradient outer product since we were working with a real valued function, it was possible to define the gradient at each point, which is simply:

$\displaystyle \Bigg[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_d} \Bigg]$

For a vector valued function $f: \mathbb{R}^d \to \mathbb{R}^c$, we can’t have the gradient, but instead can define the Jacobian at each point:

$\displaystyle \mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_d} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_c}{\partial x_1} & \frac{\partial f_c}{\partial x_2} & \dots & \frac{\partial f_c}{\partial x_d}\end{bmatrix}$

Note that $\mathbf{J}$ may be estimated in a similar manner as estimating gradients as in the previous posts. Which leads us to define the quantity $\mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$.

The first thing to note is that $\mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T)$ defined in the previous post is simply the quantity for the special case when $f: \mathbb{R}^d \to \mathbb{R}$. Another note is also in order: The reason why we suffixed that quantity with “outer product” (as opposed to “inner product” here) is simply because we considered the gradient to be a column vector, otherwise they are similar in spirit.

Another thing to note is that it is easy to see that the quantity $\mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ is a positive semi-definite matrix and hence is a Reimannian Metric, which is defined below:

Definition: A Reimannian Metric $G$ on a manifold $\mathcal{M}$ is a symmetric and positive semi-definite matrix, which defines a smoothly varying inner product in the tangent space $\mathbf{T}_x \mathcal{M}$, for each point $x \in \mathcal{M}$ and $a, b \in \mathbf{T}_x \mathcal{M}$. This associated p.s.d matrix is called the metric tensor. In the above case, since $\mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ is p.s.d it defines a Reimannian metric:

$\langle a, b \rangle_x = a^T \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) b$

Thus, $\mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ is a specific metric (more general metrics are dealt with in areas such as metric learning).

Properties: We saw some properties of $\mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T)$ in the previous post. In the same vein, does $\mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ have similar properties? i.e. does the first eigenvector also correspond to the direction of highest average variation? What about the $k$-dimensional subspace? What difference does it make that we are looking at a vector valued function? Also what about the cases when $d > c$ and otherwise?

These are questions that I need to think about and should be the topic for a future post to be made soon, hopefully.

Recently, in course of a project that I had some involvement in, I came across an interesting quadratic form. It is called in the literature as the Gradient Outer Product. This operator, which has applications in supervised dimensionality reduction, inverse regression and metric learning can be motivated in two (related) ways, but before doing so, the following is the set up:

Setup: Suppose we have the usual set up as for nonparametric regression and (binary) classification i.e. let $Y \approx f(X)$ for some unknown smooth $f$, the input $X$ is $d$ dimensional $X = (X^i)_{i=1}^d$

1. Supervised Dimensionality Reduction: It is often the case that $f$ varies most along only some relevant coordinates. This is the main motivation behind variable selection.

The idea in variable selection is the following: That $f(X)$ may be written as $f(PX)$ where $P \in \{0,1\}^{k \times d}$. $P$ projects down the data to only $k$ relevant coordinates (i.e. some features are selected by $P$ while others are discarded).

This idea is generalized in Multi-Index Regression, where the goal is to recover a subspace most relevant to prediction. That is, now suppose the data varies significantly along all coordinates but it still depends on some subspace of smaller dimensionality. This might be achieved by letting $P$ from the above to be $P \in \mathbb{R}^{k \times d}$. It is important to note that $P$ is not any subspace, but rather the $k$-dimensional subspace to which if the data is projected, the regression error would be the least. This idea might be further generalized by means of mapping $X$ to some $P$ non-linearly, but for now we only stick to the relevant subspace.

How can we recover such a subspace?

________________

2. Average Variation of $f$: Another way to motivate this quantity is the following: Suppose we want to find the direction in which $f$ varies the most on average, or the direction in which $f$ varies the second fastest on average and so on. Or more generally, given any direction, we want to find the variation of $f$ along it. How can we recover these?

________________

The Expected Gradient Outer Product:  The expected gradient outer product of the unknown classification or regression function is the quantity: $\mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T)$

The expected gradient outer product recovers the average variation of $f$ in all directions. This can be seen as follows: The directional derivative at $x$ along $v \in \mathbb{R}^d$ is given by $\displaystyle {f'}_v(x) = \nabla f(x)^T v$ or $\mathbb{E}_X |{f'}_v(X)|^2 = \mathbb{E}_X (v^T G(X) v) = v^T (\mathbb{E}_X G(X))v$.

From the above it follows that if $f$ does not vary along $v$ then $v$ must be in the null space of $\mathbb{E}_X (G(X))$.

Infact it is not hard to show that the relevant subspace $P$ as defined earlier can also be recovered from $\mathbb{E}_X (G(X))$. This fact is given in the following lemma.

Lemma: Under the assumed model i.e. $Y \approx f(PX)$, the gradient outer product matrix $\mathbb{E}_X (G(X))$ is of rank at most $k$. Let $\{v_1, v_2, \dots, v_k \}$ be the eigenvectors of $\mathbb{E}_X (G(X))$ corresponding to the top $k$ eigenvalues of $\mathbb{E}_X (G(X))$. Then the following is true:

$span(P) = span(v_1, v_2, \dots, v_k)$

This means that a spectral decomposition of $\mathbb{E}_X (G(X))$ recovers the relevant subspace. Also note that the Gradient Outer Product corresponds to a kind of a supervised version of Principal Component Analysis.

________________

Estimation: Ofcourse in real settings the function is unknown and we are only given points sampled from it. There are various estimators for $\mathbb{E}_X (G(X))$, which usually involve estimation of the derivatives. In one of them the idea is to estimate, at each point $x$ a linear approximation to $f$. The slope of this approximation approximates the gradient at that point. Repeating this at the $n$ sample points, gives a sample gradient outer product. There is some work that shows that some of these estimators are statistically consistent.

________________

Related: Gradient Based Diffusion Maps: The gradient outer product can not isolate local information or geometry and its spectral decomposition, as seen above, gives only a linear embedding. One way to obtain a non-linear dimensionality reduction would be to borrow from and extend the idea of diffusion maps, which are well established tools in semi supervised learning. The central quantity of interest for diffusion maps is the graph laplacian $L = I - D^{-\frac{1}{2}} W D^{-\frac{1}{2}}$, where $D$ is the degree matrix and $W$ the adjacency matrix of the nearest neighbor graph constructed on the data points. The non linear embedding is obtained by a spectral decomposition of the operator $L$ or its powers $L^t$.

As above, a similar diffusion operator may be constructed by using local gradient information. One such possible operator could be:

$\displaystyle W_{ij} = W_{f}(x_i, x_j) = exp \Big( - \frac{ \| x_i - x_j \| ^2}{\sigma_1} - \frac{ | \frac{1}{2} (\nabla f(x_i) + \nabla f(x_j)) (x_i - x_j) |^2 }{\sigma_2}\Big)$

Note that the first term is the same that is used in unsupervised dimension reduction techniques such as laplacian eigenmaps and diffusion maps. The second term can be interpreted as a diffusion on function values. This operator gives a way for non linear supervised dimension reduction using gradient information.

The above operator was defined here, however no consistency results for the same are provided.

Also see: The Jacobian Inner Product.

________________

(Very) Informal thoughts on the No-Free-Lunch Theorem

The No-Free-Lunch theorems (Wolpert and Macready) are one of the mainstays in Machine Learning and optimization. Crudely, the NFL says that “for every learner there exists a distribution in which it fails”, giving a precise handle to reason about why a given learning algorithm might fail on certain tasks (or, “specialization” of a learning algorithm might just be accidental). It gives a good justification why it is useful to incorporate prior informaion about the problem at hand to perform better. Oftentimes the NFL is used to argue that a general purpose learning algorithm is theoretically impossible. This is incorrect. Indeed, universal learning algorithms exist and there is a (possibly very small) free lunch, at least for the case of “interesting problems” (thanks to Vick for pointing out these papers).

The view “for every learner there exists a distribution in which it fails” is very useful. However I had recently been thinking that is particularly instructive to think of No Free Lunch as a kind of diagnolization argument (this is just a different way of thinking about the prior statement). I am not completely sure if this works but it sounds plausible and makes me see NFL in a different light. It might also be that there is some subtle absrudity or flaw in this kind of thinking. However, I think the NFL theorem can be seen as a simple corollary of the following incompleteness theorem in Kolmogorov Complexity.

Theorem (Chaitin): There exists a constant L (which only depends on the particular axiomatic system and the choice of description language) such that there does not exist a string s for which the statement the “The Kolmogorov Complexity of s is greater than L” (as formalized in S) can be proven within the axiomatic system S.

Stated informally: “There’s a number L such that we can’t prove the Kolmogorov complexity of any specific string of bits is more than L”. In particular, you can think of any learning machine as a compression scheme (since compression implies learning, we can think of any learning algorithm as a way to compress the data) of some fixed Kolmogorov Complexity (consider the most optimal program for that machine, the programming language is irrelevant as long as it is Turing Complete. Consider the description length of it in bits. That is its Kolmogorov Complexity). Now for any such learning machine (or class of learning machines w.l.o.g) you can construct a data string which has Kolmogorov Complexity greater than what this learning machine has. i.e. this data string is random from the point of view of this learning machine as it is incompressible for it. In short IF the data string has some Kolmogorov Complexity K and your learning machine has Kolmogorov Complexity k and k < K then you will never be able to find any structure in your data string i.e. can never show that it has a KC greater than k. Thus the data string given to our learner would appear to be structureless. Thus no free lunch is just a simple diagonalization argument in the spirit of incompleteness results. One can always construct a string that is Kolmogorov Random or structureless w.r.t our encoding scheme. Ofcourse the Kolmogorov Complexity is uncomputable. However I think the above argument works even when one is talking of upper bounds on the actual Kolmogorov Complexity, but I am not very sure about it or if there are any problems with it.

Also, if one assumes that that there is no reason to believe that structureless functions exist (if we assume that the data is generated by a computational process then there is probably no reason to believe that it is structureless i.e. the problem is “interesting”). And if this is the case, then you would always have a machine in your library of learning algorithms that will be able to find some upper bound on the Kolmogorov Complexity of the data string (much less than the machine itself). This search might be done using Levin’s universal search, for example. I think this line of reasoning leads naturally, to thinking why NFL doesn’t say anything about whether universal learning algorithms can exist.

PS: This thought train about NFL as diagonalization was inspired by a post by David McAllester on what he calls the “Free Lunch Theorem” over here. Some of this appeared as a comment there. However, now I have been inspired enough to study some of the papers concerned including his. (Unrelated to all this. I particularly found likening Chomsky’s Universal Grammar as simply an invocation of NFL on information theoretic grounds in his post very enlightening).

Book Review: “Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World” by Leslie Valiant

Darwinian Evolution is a form of Computational Learning

The punchline of this book is perhaps: “Changing or increasing functionality of circuits in biological evolution is a form of computational learning“; although it also speaks of topics other than evolution, the underlying framework is of the Probably Approximately Correct model [1] from the theory of Machine Learning, from which the book gets its name.

“Probably Approximately Correct” by Leslie Valiant

[Clicking on the image above will direct you to the amazon page for the book]

I had first heard of this explicit connection between Machine Learning and Evolution in 2010 and have been quite fascinated by it since. It must be noted, however, that similar ideas have appeared in the past. It won’t be incorrect to say that usually they have been in the form of metaphor. It is another matter that this metaphor has generally been avoided for reasons I underline towards the end of this review. When I first heard about the idea it immediately made sense and like all great ideas, in hindsight looks completely obvious. Ergo, I was quite excited to see this book and preordered it months in advance.

Before attempting to sketch a skiagram of the main content of the book: One of the main subthemes of the book, constantly emphasized is to look at computer science as a kind of an enabling tool to study natural science. This is oft ignored, perhaps because of the reason that CS curricula are rarely designed with any natural science component in them and hence there is no impetus for aspiring computer scientists to view them from the computational lens. On the other hand,  the relation of computer science with mathematics has become quite well established. As a technology the impact of Computer Science has been tremendous. All this is quite remarkable given the fact that just about a century ago the notion of a computation was not even well defined. Unrelated to the book: More recently people have started taking the idea of digital physics (i.e. physics from a solely computable/digital perspective) seriously. But in the other natural sciences its usage is still woeful. Valiant draws upon the example of Alan Turing as a natural scientist and not just as a computer scientist to make this point. Alan Turing was more interested in natural phenomenon (intelligence, limits of mechanical calculation, pattern formation etc) and used tools from Computer Science to study them, a fact that is evident from his publications. That Turing was trying to understand natural phenomenon was remarked in his obituary by Max Neumann by summarizing the body of his work as: “the extent and the limitations of mechanistic explanations of nature”.

________________

The book begins with a delightful and quite famous quote by John von Neumann (through this paragraph I also discovered the context to the quote). This paragraph also adequately summarizes the motivation for the book very well:

“In 1947 John von Neumann, the famously gifted mathematician, was keynote speaker at the first annual meeting of the Association for Computng Machinery. In his address he said that future computers would get along with just a dozen instruction types, a number known to be adequate for expressing all of mathematics. He went on to say that one need not be surprised at this small number, since 1000 words were known to be adequate for most situations in real life, and mathematics was only a small part of life, and a very simple part at that. The audience responded with hilarity. This provoked von Neumann to respond: “If people do not believe that mathematics is simple, it is only because they do not realize how complicated life is”

Though counterintuitive, von Neumann’s quip contains an obvious truth. Einstein’s theory of general relativity is simple in the sense that one can write the essential content on one line as a single equation. Understanding its meaning, derivation, and consequences requires more extensive study and effort. However, this formal simplicity is striking and powerful. The power comes from the implied generality, that knowledge of one equation alone will allow one to make accurate predictions about a host of situations not even connected when the equation was first written down.

Most aspects of life are not so simple. If you want to succeed in a job interview, or in making an investment, or in choosing a life partner, you can be quite sure that there is no equation that will guarantee you success. In these endeavors it will not be possible to limit the pieces of knowledge that might be relevant to any one definable source. And even if you had all the relevant knowledge, there may be no surefire way of combining it to yield the best decision.

This book is predicated on taking this distinction seriously […]”

In a way, aspects of life as mentioned above appear theoryless, in the sense that there seems to be no mathematical or scientific theory like relativity for them. Something which is quantitative, definitive and short. Note that these are generally not “theoryless” in the sense that there exists no theory at all since obviously people can do all the tasks mentioned in a complex world quite effectively. A specific example is of how organisms adapt and species evolve without having a theory of the environment as such. How can such coping mechanisms come into being in the first place is the main question asked in the book.

Let’s stick to the specific example of biological evolution. Clearly, it is one of the central ideas in biology and perhaps one of the most important theories in science that changed the way we look at the world. But inspite of its importance, Valiant claims (and correctly in my opinion) that evolution is not understood well in a quantitative sense. Evidence that convinces us of its correctness is of the following sort: Biologists usually show a sequence of evolving objects; stages, where the one coming later is more complex than the previous. Since this is studied mostly via the fossil record there is always a lookout for missing links between successive stages (As a side note: Animal eyes is an extremely fascinating and very readable book that delves with this question but specifically dealing with the evolution of the eye. This is particularly interesting given that due to the very nature of the eye, there can be no fossil records for them). Darwin had remarked that numerous successive paths are necessary – that is, if it was not possible to trace a path from an earlier form to a more complicated form then it was hard to explain how it came about. But again, as Valiant stresses, this is not really an explanation of evolution. It is more of an “existence proof” and not a quantitative explanation. That is, even if there is evidence for the existence of a path, one can’t really say that a certain path is likely to be followed just because it exists. As another side note: Watch this video on evolution by Carl Sagan from the beautiful COSMOS

Related to this, one of the first questions that one might ask, and indeed was asked by Darwin himself: Why has evolution taken as much time as it has? How much time would have sufficed for all the complexity that we see around us to evolve? This question infact bothered Darwin a lot in his day. In On the Origins of Species he originally gave an estimate of the Earth’s age to be at least 300 million years, implying indirectly, that there was enough time. This estimate was immediately thrashed by Lord Kelvin, one of the pre-eminent British physicists of the time, who estimated the age of the Earth to be only about 24 million years. This caused Darwin to withdraw the estimate from his book. However, this estimate greatly worried Darwin as he thought 24 million years just wasn’t enough time. To motivate on the same theme Valiant writes the following:

“[…] William Paley, in a highly influential book, Natural Theology (1802) , argued that life, as complex as it is, could not have come into being without the help of a designer. Numerous lines of evidence have become available in the two centuries since, through genetics and the fossil record, that persuade professional biologists that existing life forms on Earth are indeed related and have indeed evolved. This evidence contradicts Paley’s conclusion, but it does not directly address his argument. A convincing direct counterargument to Paley’s would need a specific evolution mechanism to be demonstrated capable of giving rise to the quantity and quality of the complexity now found in biology, within the time and resources believed to have been available. […]”

A specific, albeit more limited version of this question might be: Consider the human genome, which has about 3 billion base pairs. Now, if evolution is a search problem, as it naturally appears to be, then why did the evolution of this genome not take exponential time? If it would have taken exponential time then clearly such evolution could not have happened in any time scale since the origin of the earth. Thus, a more pointed question to ask would be: What circuits could evolve in sub-exponential time (and on a related note, what circuits are evolvable only in exponential time?). Given the context, the idea of thinking about this in circuits might seem a little foreign. But on some more thought it is quite clear that the idea of a circuit is natural when it comes to modeling such systems (at least in principle). For example: One might think of the genome as a circuit, just as how one might think of the complex protein interaction networks and networks of neurons in the brain as circuits that update themselves in response to the environment.

The last line is essentially the key idea of adaptation, that entities interact with the environment and update themselves (hopefully to cope better) in response. But the catch is that the world/environment is far too complex for simple entities (relatively speaking), with limited computational power, to have a theory for. Hence, somehow the entity will have to cope without really “understanding” the environment (it can only be partially modeled) and improve their functionality. The key thing to pay attention here is the interaction or the interface between the limited computational entity in question and the environment. The central idea in Valiant’s thesis is to think of and model this interaction as a form of computational learning. The entity will absorb information about the world and “learn” so that in the future it “can do better”. A lot of Biology can be thought of as the above: Complex behaviours in environments. Wherein by complex behaviours we mean that in different circumstances, we (or alternately our limited computational entity) might behave completely differently. Complicated in the sense that there can’t possibly be a look up table for modeling such complex interactions. Such interactions-responses can just be thought of as complicated functions e.g. how would a deer react could be seen as a function of some sensory inputs. Or for another example: Protein circuits. The human body has about 25000 proteins. How much of a certain protein is expressed is a function depending on the quantities of other proteins in the body.  This function is quite complicated, certainly not something like a simple linear regression. Thus there are two main questions to be asked: One, how did these circuits come into being without there being a designer to actually design them? Two, How do such circuits maintain themselves? That is, each “node” in protein circuits is a function and as circumstances change it might be best to change the way they work. How could have such a thing evolved?

Given the above, one might ask another question: At what rate can functionality increase or adapt under the Darwinian process? Valiant (like many others, such as Chaitin. See this older blog post for a short discussion) comes to the conclusion that the Darwinian evolution is a very elegant computational process. And since it is so, with the change in the environment there has to be a quantitative way of saying how much rate of change can be kept up with and what environments are unevolvable for the entity. It is not hard to see that this is essentially a question in computer science and no other discipline has the tools to deal with it.

In so far that (biological) interactions-responses might be thought of as complicated functions and that the limited computational entity that is trying to cope has to do better in the future, this is just machine learning! This idea, that changing or increasing functionality of circuits in biological evoution is a form of computational learning, is perhaps very obvious in hindsight. This (changing functionality) is done in Machine Learning in the following sense: We want to acquire complicated functions without explicitly programming for them, from examples and labels (or “correct” answers). This looks at exactly at the question at how complex mechanisms can evolve without someone designing it (consider a simple perceptron learning algorithm for a toy example to illustrate this). In short: We generate a hypothesis and if it doesn’t match our expectations (in performance) then we update the hypothesis by a computational procedure. Just based on a single example one might be able to change the hypothesis. One could draw an analogy to evolution where “examples” could be experiences, and the genome is the circuit that is modified over a period of time. But note that this is not how it looks like in evolution because the above (especially drawing to the perceptron example) sounds more Lamarckian. What the Darwinian process says is that we don’t change the genome directly based on experiences. What instead happens is that we make a lot of copies of the genome which are then tested in the environment with the better one having a higher fitness. Supervised Machine Learning as drawn above is very lamarckian and not exactly Darwinian.

Thus, there is something unsatisfying in the analogy to supervised learning. There is a clear notion of a target in the same. Then one might ask, what is the target of evolution? Evolution is thought of as an undirected process. Without a goal. This is true in a very important sense however this is incorrect. Evolution does not have a goal in the sense that it wants to evolve humans or elephants etc. But it certainly does have a target. This target is “good behaviour” (where behaviour is used very loosely) that would aid in the survival of the entity in an environment. Thus, this target is the “ideal” function (which might be quite complicated) that tells us how to behave in what situation. This is already incorporated in the study of evolution by notions such as fitness that encode that various functions are more beneficial. Thus evolution can be framed as a kind of machine learning problem based on the notion of Darwinian feedback. That is, make many copies of the “current genome” and let the one with good performance in the real world win. More specifically, this is a limited kind of PAC Learning.  If you call your current hypothesis your genome, then your genome does not depend on your experiences. Variants to this genome are generated by a polynomial time randomized Turing Machine. To illuminate the difference with supervised learning, we come back to a point made earlier that PAC Learning is essentially Lamarckian i.e. we have a hidden function that we want to learn. One however has examples and labels corresponding to this hidden function, these could be considered “queries” to this function. We then process these example/label pairs and learn a “good estimate” of this hidden function in polynomial time. It is slightly different in Evolvability. Again, we have a hidden “ideal” function f(x). The examples are genomes. However, how we can find out more about the target is very limited, since one can empirically only observe the aggregate goodness of the genome (a performance function). The task then is to mutate the genome so that it’s functionality improves. So the main difference with the usual supervised learning is that one could query the hidden function in a very limited way: That is we can’t act on a single example and have to take aggregate statistics into account.

Then one might ask what can one prove using this model? Valiant demonstrates some examples. For example, Parity functions are not evolvable for uniform distribution while monotone disjunctions are actually evolvable. This function is ofcourse very non biological but it does illustrate the main idea: That the ideal function together with the real world distribution imposes a fitness landscape and that in some settings we can show that some functions can evolve and some can not. This in turn illustrates that evolution is a computational process independent of substrate.

In the same sense as above it can be shown that Boolean Conjunctions are not evolvable for all distributions. There has also been similar work on real valued functions off late which is not reported in detail in the book. Another recent work that is only mentioned in passing towards the end is the study of multidimensional space of actions that deal with sets of functions (not just one function) that might be evolvable together. This is an interesting line of work as it is pretty clear that biological evolution deals with the evolution of a set of functions together and not functions in isolation.

________________

Overall I think the book is quite good. Although I would rate it 3/5. Let me explain. Clearly this book is aimed at the non expert. But this might be disappointing to those who bought the book because of the fact that this recent area of work, of studying evolution through the lens of computational learning, is very exciting and intellectually interesting. The book is also aimed at biologists, and considering this, the learning sections of the book are quite dumbed down. But at the same time, I think the book might fail to impress most of them any way. I think this is because generally biologists (barring a small subset) have a very different way of thinking (say as compared to the mathematicians or computer scientists) especially through the computational lens. I have had some arguments about the main ideas in the book over the past couple of years with some biologist friends who take the usage of “learning” to mean that what is implied is that evolution is a directed process. It would have been great if the book would have spent more time on this particular aspect. Also, the book explicitly states that it is about quantitative aspects of evolution and has nothing to do with speciation, population dynamics and other rich areas of study. However, I have already seen some criticism of the book by biologists on this premise.

As far as I am concerned, as an admirer of Prof. Leslie Valiant’s range and quality of contributions, I would have preferred if the book went into more depth. Just to have a semi-formal monograph on the study of evolution using the tools of PAC Learning right from the person who initiated this area of study. However this is just a selfish consideration.

________________

References:

[1] A Theory of the Learnable, Leslie Valiant, Communications of the ACM, 1984 (PDF).

[2] Evolvability, Leslie Valiant, Journal of the ACM, 2007 (PDF).

________________

Maximum Number of Regions in Arrangement of Hyperplanes

A semi-useful fact for people working in Machine Learning.

Blurb: Recently, my supervisor was kind enough to institute a small reading group on Deep Neural Networks to help us understand them better. A side product of this would hopefully be that I get to explore this old interest of mine (I have an old interest in Neural Nets and have always wanted to study them in detail, especially work from the mid 90s when the area matured and connections to many others areas such as statistical physics became apparent. But before I got to do that, they seem to be back in vogue! I dub this resurgence as Connectionism 2.0 Although the hipster in me dislikes hype, it is always lends a good excuse to explore an interest). I also attended a Summer School on the topic at UCLA, find slides and talks over here.

Coming back: we have been making our way through this Foundations and Trends volume by Yoshua Bengio (which I quite highly recommend). Bengio spends a lot of time giving intuitions and arguments on why and when Deep Architectures are useful. One particular argument (which is actually quite general and not specific to trees as it might appear) went like this:

Ensembles of trees (like boosted trees, and forests) are more powerful than a single tree. They add a third level to the architecture which allows the model to discriminate among a number of regions exponential in the number of parameters. As illustrated in […], they implicitly form a distributed representation […]

This is followed by an illustration with the following figure:

[Image Source: Learning Deep Architectures for AI]

and accompanying text to the above figure:

Whereas a single decision tree (here just a two-way partition) can discriminate among a number of regions linear in the number of parameters (leaves), an ensemble of trees can discriminate among a number of regions exponential in the number of trees, i.e., exponential in the total number of parameters (at least as long as the number of trees does not exceed the number of inputs, which is not quite the case here). Each distinguishable region is associated with one of the leaves of each tree (here there are three 2-way trees, each defining two regions, for a total of seven regions). This is equivalent to a multi-clustering, here three clusterings each associated with two regions.

Now, the following question was considered: Referring to the text in boldface above, is the number of regions obtained exponential in the general case? It is easy to see that there would be cases where it is not exponential. For example: the number of regions obtained by the three trees would be the same  as those obtained by one tree if all three trees overlap, hence giving no benefit. The above claim refers to a paper (also by Yoshua Bengio) where he constructs examples to show the number of regions could be exponential in some cases.

But suppose for our satisfaction we are interested in actual numbers and the following general question, an answer to which should also answer the question raised above:

What is the maximum possible number of regions that can be obtained in ${\bf R}^n$ by the intersection of $m$ hyperplanes?

Let’s consider some examples in the ${\bf R}^2$ case.

Where there is one hyperplace i.e. $m = 1$, then the maximum number of possible regions is obviously two.

[1 Hyperplane: 2 Regions]

Clearly, for two hyerplanes, the maximum number of possible regions is 4.

[2 Hyperplanes: 4 Regions]

In case there are three hyperplanes, the maximum number of regions that might be possible would be 7 as illustrated in the first figure. For $m =4$ i.e. 4 hyerplanes, this number is 11 as shown below:

[4 Hyperplanes: 11 Regions]

On inspection we can see the number of regions with $m$ hyperplanes in ${\bf R}^2$ is given by:

Number of Regions = #lines + #intersections + 1

Now let’s consider the general case: What is the expression that would give us the maximum number of regions possibles with $m$ hyperplanes in ${\bf R}^n$? Turns out that there is a definite expression for the same:

Let $\mathcal{A}$ be an arrangement of $m$ hyperplanes in ${\bf R}^n$, then the maximum number of regions possible is given by

$\displaystyle r(\mathcal{A}) = 1 + m + \dbinom{m}{2} \ldots + \dbinom{m}{n}$

the number of bounded regions (closed from all sides) is given by:

$\displaystyle b(\mathcal{A}) = \dbinom{m - 1}{n}$

The above expressions actually derive from a more general expression when $\mathcal{A}$ is specifically a real arrangement.

I am not familiar with some of the techniques that are required to prove this in the general case ($\mathcal{A}$ is a set of affine hyperplanes in a vector space defined over some Field).  The above might be proven by induction. However, it turns out that in the ${\bf R}^2$ case the result is related to the crossing lemma (see discussion over here), I think it was one of the most favourite results of one of my previous advisors and he often talked about it. This connection makes me want to study the details [1],[2], which I haven’t had the time to look at and I will post the proof for the general version in a future post.

________________

References:

1. An Introduction to Hyperplane Arrangements: Richard P. Stanley (PDF)

Related:

2. Partially Ordered Sets: William Trotter (Handbook of Combinatorics) (PDF)

________________

Onionesque Reality Home >>

IPAM-UCLA Summer School on Deep Learning and Feature Learning

Deep Learning reads Wikipedia and discovers the meaning of life – Geoff Hinton.

The above quote is from a very interesting talk by Geoffrey Hinton I had the chance to attend recently.

I have been at a summer school on Deep Neural Nets and Unsupervised Featured Learning at the Institute for Pure and Applied Mathematics at UCLA since July 9 (till July 27). It has been organized by Geoff Hinton, Yoshua Bengio, Stan Osher, Andrew Ng and Yann LeCun.

I have always been a “fan” of Neural Nets and the recent spike in interest in them has made me excited, thus the school happened at just the right time. The objective of the summer school is to give a broad overview of some of the recent work in Deep Learning and Unsupervised Feature Learning with emphasis on optimization, deep architectures and sparse representations. I must add that after getting here and looking at the peer group I would consider myself lucky to have obtained funding for the event!

[Click on the above image to see slides for the talks. Videos will be added at this location after July 27 Videos are now available]

That aside, if you are interested in Deep Learning or Neural Networks in general, the slides for the talks are being uploaded over here (or click on the image above), videos will be added at the same location some time after the summer school ends so you might like to bookmark this link.

The school has been interesting given the wide range of people who are here. The diversity of opinions about Deep Learning itself has given a good perspective on the subject and the issues and strengths of it. There are quite a few here who are somewhat skeptical of deep learning but are curious, while there are some who have been actively working on the same for a while. Also, it has been enlightening to see completely divergent views between some of the speakers on key ideas such as sparsity. For example Geoff Hinton had a completely different view of why sparsity was useful in classification tasks than compared to Stéphane Mallat, who gave a very interesting talk today even joking that “Hinton and Yann LeCun told you why sparsity is useful, I’ll tell you why sparsity is useless. “. See the above link for more details.

Indeed, such opinions do tell you that there is a lot of fecund ground for research in these areas.

I have been compiling a reading list on some of this stuff and will make a blog-post on the same soon.

________________

Onionesque Reality Home >>

Importing the Szemerédi Regularity Lemma into Machine Learning

Synopsis of a recent direction of work with Gábor Sárközy, Endre Szemerédi and Fei Song — “The Regularity Lemma is a deep result from extremal graph theory having great utility as a fundamental tool to prove theoretical results, but can it be employed in more “practical” settings?”

More specifically we are interested in the problem of harnessing the power of the regularity lemma to do clustering. This blog post is organized as follows: We first sketch the regularity lemma, we then see that it is an existential predicate and state an algorithmic version, we then look at how this constructive version may be used for clustering/segmentation. It must be noted that the idea seems to have originated from an earlier interesting work by Sperotto and Pellilio.

Before a brief introduction to the regularity lemma, I’d quote Luca Trevisan on (the related) Szemeredi’s Theorem from his blog

Szemeredi’s theorem on arithmetic progressions is one of the great triumphs of the “Hungarian approach” to mathematics: pose very difficult problems, and let deep results, connections between different areas of math, and applications, come out as byproducts of the search for a solution.

Even though I am nothing close to being a mathematician, but we’ll see that even an Electrical Engineer can somewhat understand and appreciate what Prof. Trevisan means in this specific context! ;-)

I will attempt to sketch a few things below with more emphasis on intuition than rigour! To be specific – intuition and then rigour.

_________________________

Introduction: One of the most fundamental and ingenious results in graph theory and discrete mathematics is the Szemeredi Regularity Lemma. It was originally advanced by Endre Szemeredi as an auxiliary result to prove a long standing conjecture of Erdős and Turán from 1936 (on the Ramsey properties of arithmetic progressions), which stated that sequences of integers with postive upper density must contain arbitrarily long arithmetic progressions (now called as Szemeredi’s theorem (1978)). Now the regularity lemma by itself is considered as one of the most important tools in graph theory.

Prof. Endre Szemeredi: Originator of the "Regularity Method"

A very rough statement of the regularity lemma could be made as follows:

Every graph can be approximated by random graphs. This is in the sense that every graph can be partitioned into a bounded number of equal parts such that:
1. Most edges run between different parts
2. And that these edges behave as if generated at random.

The fundamental nature of the Regularity Lemma as a tool in Graph Theory (and beyond) could be understood by the fact that two recent Fields Medals (Timothy Gowers, 1998 and Terence Tao, 2006) were awarded atleast in part for achieving breakthrough results on the regularity lemma and using those to prove fundamental results on arithmetic progressions amongst other results (Such as the Green-Tao Theorem).

_________________________

A Temporary Digression and Prof. Trevisan’s Statement: A first look at the statement of the regularity lemma above reminds one (especially those trained in signal processing I suppose) of the Fourier series. While this is extremely simplified – It’s not entirely incorrect to say that it points in part to a deeper relationship between combinatorics and analysis. Let’s look at it this way – Szemeredi employed the regularity lemma (a graph theoretic method) to prove what is now called Szemeredi’s theorem. The Szemeredi Theorem is a difficult one and it has four different proofs (graph-theoretic, ergodic, hypergraph-theoretic and lastly based on Fourier analysis)*.  Infact work on the Szemeredi theorem has contributed in better understanding connections between these areas (as an example – Timothy Gowers was awarded the Fields Medal for proving very deep results on the connections between analysis and combinatorics).

Terence Tao and Timothy Gowers have made breakthrough contributions to the regularity lemma and Szemeredi's Theorem amongst numerous others (Image Source: Republic of Math)

So it could be said that it all really started with Erdős and Turán posing a hard problem – The search for it’s solution has led to not only some good theorems but also some interesting connections between different areas of mathematics and thus a better understanding of all of them as a result. So we see what Prof. Trevisan means!

*While here the connection appears very sketchy though intuitive. The strong relationship between the Regularity Lemma and Analysis was shown in a breakthrough paper by László Lovász and Balázs Szegedy. In this paper they showed that the regularity lemma could be thought of as a result in analysis

Szemeredi Regularity Lemma for the Analyst (László Lovász and Balázs Szegedy)

_________________________

Given the background now we get to stating Szemeredi’s Regularity Lemma:

The Szemeredi Lemma: Expanding upon the rough statement mentioned a couple of paragraphs above, we could be a little more specific and say that

The Lemma states that every graph could be partitioned into a bounded number of quasi-random bi-partite graphs, called regular pairs and a few left over edges.

Now for a more precise statement, we introduce some definitions and notation.

Definition 1: If $G = (V,E)$ is a graph  and $A, B$ are disjoint subsets of $V$. Then, we denote the count of number edges with one endpoint in $A$ and another in $B$ by $e(A,B)$. The density of edges between $A$ and $B$ can then be defined as:

$\displaystyle d(A,B) = \frac{e(A,B)}{|A||B|}$

This just defines the edge density in a bipartite graph.

Definition 2: The bipartite graph $G = (A, B, E)$ is called $\varepsilon$-regular if for every $X \subset A$ and $Y \subset B$ satisfying

$|X| > \varepsilon |A|, |Y| > \varepsilon |B|$

we would have $|d(X,Y) - d(A,B)| < \varepsilon$

This means that a bipartite graph is epsilon regular if we were to take any random subsets (of some minimum size) $X, Y$ of $A,B$ respectively and even then find that the edge density between these subsets is almost the same as the edge density in the original bipartite graph. In effect this implies that if a bipartite graph is epsilon regular then all the edges between the the two disjoint sets are distributed uniformly.

Definition 3: An equitable partition of the vertex set $V$ of a graph $G = (V,E)$ is a partition of $V$ such that all the classes $C_0, C_1, \dots, C_k$ are pairwise disjoint. And that all classes $C_i$ with $1 \leq i \leq k$ have the same cardinality. It is noteworthy that oftentimes the vertex set might not have a cardinality that could be equally divided into the said number of classes, thus $C_0$ is present for a technical reason: To ensure that all the classes have the same cardinality.

Definition 4: For every equitable partition $P$ of the vertex set $V$ of $G = (V,E)$ into classes $C_0, C_1, \dots, C_k$ we associate a measure called the potential or the index of the partition $P$, which is defined as:

$\displaystyle ind(P) = \frac{1}{k^2} \sum_{1 \leq r \leq s \leq k} d(C_r, C_s)^2$

This measure just defines how close a partition is close to a regular one.

Definition 5: An equitable partition $C_0, C_1, \dots, C_k$ of the vertex set of graph $G$ given by $V$, where $C_0$ is the exceptional set, is called $\varepsilon$-regular if $C_0 < \varepsilon |V|$ and all but $\varepsilon k^2$ of the pairs $(C_i,C_j)$ are $\varepsilon$-regular such that $1 \leq i < j \leq k$.

We are now in a position to state the regularity lemma:

Theorem 1: [Regularity Lemma]
For every positive $\varepsilon$ and positive integer m, there are positive integers $N = N(\varepsilon, m)$ and $M = M(\varepsilon, m)$ such that the following property holds: For all graphs $G$ with $|V|\geq N$, there is an $\varepsilon$-regular partition of $G(V,E)$ into $k+1$ classes such that $m\leq k \leq M$.

The beauty of the regularity lemma is in the point that that the approximation for any graph does not depend on the number of vertices it has, but only on the error in approximation (represented by $\varepsilon$).

In the proof of the regularity lemma we start with an initial partition (with low potential) and then iteratively refine the partition in such a way that the potential increases to a point such that the partition is epsilon regular. However, this leads to a major problem (the first of our concerns) – In refining the partitions iteratively the number of classes increases exponentially in each iteration. We then end up with a partition in which the number of classes is a tower function – usually an astronomical figure. Clearly, this does not work very well for graphs arising in practice. This same point is made by Prof. Trevisan:

[…] one may imagine using the Regularity Lemma as follows: to prove an approximate quantitative statement about arbitrary graphs, we only need to prove a related statement for the finite set of objects that approximate arbitrary graphs up to a certain level of accuracy. The latter, completely finite, goal may be performed via a computer search. Unfortunately, the size of the objects arising in the Regularity Lemma grow so astronomically fast in the approximation that this approach is completely impractical.

As an aside: For a long while it was thought that maybe a tower function is not necessary. However in a remarkable paper Timothy Gowers constructed graphs that demonstrated that functions of the tower type were indeed necessary.

In any case, how can we get around this problem for approximating most graphs so that it could be useful in applications such as clustering? A possible solution to this problem in the context of clustering was proposed by Sperotto and Pelillo. They make some interesting points, however do not provide many details on their approach. We will get back to this problem in a short while.

But the first problem that we face is the following: The Szemeredi Lemma as originally proposed is an existential predicate. It does not give a method to obtain the regular partition for a given graph, but only says that one must exist! So if we are to use the Lemma in any practical setting, we need an algorithmic version. There now exist two algorithmic versions:

1. One proposed in a paper by Alon et al in 1992.

2. Another proposed by Frieze and Kannan and is based on the very intriguing relationship between singular values and regularity!

We for now, focus on the Alon et al. version.

_________________________

Algorithmic Version of the Regularity Lemma:

Theorem 2: [A Constructive Version of the Regularity Lemma – Alon et al.]

For every $\varepsilon > 0$ and every positive integer $t$ there is an integer $Q = Q(\varepsilon, t)$ such that every graph with $n > Q$ vertices has an $\varepsilon$-regular partition into $k + 1$ classes, where $t \le k \le Q$. For every fixed $\varepsilon > 0$ and $t \ge 1$ such a partition can be found in $O(M(n))$ sequential time, where $M(n)$ is the time for multiplying two $n$ by $n$ matrices with 0,1 entries over the integers. It can also be found in time $O(logn)$ on an ERPW PRAM with a polynomial number of parallel processors.

To really understand how the above theorem would require the introduction of several supporting definitions and lemmas (including one from Szemeredi’s original paper). Since our focus is on the application, we’d just state one lemma using which the idea behind the constructive version could be revealed in a more concrete sense.

Lemma 1: [Alon et al.] Let $H$ be a bipartite graph with equally sized classes $|A| = |B| = n$. Let $\displaystyle 2n^{\frac{-1}{4}} < \varepsilon <\frac{1}{16}$. There is an $O(M(n))$ algorithm that verifies that $H$ is $\varepsilon$-regular or finds two subset $A' \subset A$, $B' \subset B$, $\displaystyle |A'| \ge \frac{{\varepsilon}^4}{16}n$, $\displaystyle |B'| \ge \frac{{\varepsilon}^4}{16}n$, such that $\displaystyle |d(A, B) - d(A', B')| \ge \varepsilon^4$. The algorithm can be parallelized and implemented in $NC$.

This lemma basically says that whether or not a bipartite graph is epsilon regular is a question that can be answered very quickly. If it is – then we can have an algorithm to say so and if it is not, it should return the answer with a certificate or proof that it is not so. This certificate is nothing but subsets of the classes from the original bipartite graph. The idea of the certificate is to help to proceed to the next step.

The general idea in the Alon algorithm is:

Start with a random equitable partition of the graph $C_0, C_1, \dots, C_b$, where $\displaystyle |C_i| = \lfloor\ \frac{n}{b} ;\rfloor$ and $i = 1, 2, \dots, b$. Also $|C_0| < b$. Also let $k_1 = b$.

Then, for each pair $C_r, C_s$ in the partition $P_i$ check for regularity. If it is regular report so, if not find the certificate for the pair. If out of all the pairs, only $\varepsilon k_i^2$ are not epsilon regular – then halt, the partition $P_i$ is epsilon regular. If not then we refine this partition using the information gained from the certificates in the cases when the pairs not epsilon regular. On refining this partition we obtain a new partition $P'$ with $1 + k_i4^{k_i}$ classes. We repeat this process till we hit a partition which is epsilon regular.

_________________________

Using the Constructive Version for Clustering:

A little earlier it was pointed out that there are two main difficulties in using the Szemeredi Lemma in practical settings. At one level, to use the lemma in a practical setting we need a constructive version. However the constructive versions are so designed that they work for all graphs. To illustrate how this is a problem, we quote a line from the algorithmic version based on singular values (Freize and Kannan):

The Algorithm finishes in atmost $O(\varepsilon^{-45})$ steps with an $\varepsilon$-regular partition

Now, consider what this implies – If $\varepsilon$ is $\frac{1}{16}$ (a typical value). Then the number of steps in which we are guaranteed to find a regular partition is 1.53249554 × 1054.

This is astonishingly large number. To make the lemma truly practical, we would have to give up the aim of making it work for all graphs. Instead we should be content that it works on most graphs. Most graphs would largely be graphs that appear in practice. Like was mentioned earlier some directions in this regard were provided by Sperotto and Pellilio.

Another problem (as was mentioned earlier too) was that the number of classes grows exponentially with each refinement step. We can not allow the number of classes to grow astronomically either because if we do so then consequently we would never be able to refine the partitions far enough. Here we would have to compromise on the generality again. Instead of allowing the number of classes to grow exponentially we could allow it to grow by a defined value in each iteration. This value could be chosen depending on a number of parameters, including the data set size.

Even in making such an approximation – the potential would always increase with each iteration albeit much slowly as compared to the methodology as originally described. Infact we would say that it would work for most graphs. In our work this approach seems to work quite well.

Given these two changes, what still remains is how could the lemma be used for clustering datasets? One possible way is suggested by another result called the Key Lemma and the implication that it might have. This is stated below:

Lemma 2: Given an arbitrary graph $G=(V,E)$, a partition of $V$ in $k$ clusters as in regularity lemma described above and two parameters $\varepsilon, d$, we describe the reduced graph $G^R$ as the graph whose vertices are associated to the clusters and whose edges are associated to the $\varepsilon$-regular pairs with density above $d$. If we have a coloring on the edges of $G$, then the edges of the reduced graph will be colored with a color that appears on most of the edges between the two clusters.

Using the properties of what is called the Key-Lemma/Blow-Up Lemma, it is understood that this reduced graph should retain properties of the original graph. And thus any changes made on this reduced graph would reflect on the original graph.

Thus, one possible way to cluster is using a two-phase clustering strategy: First, we use the modified regularity lemma to construct a reduced graph using the method above. In the second phase, we use a pairwise clustering method such as spectral clustering to do cluster the the reduced graph. The results obtained on this reduced graph are then projected back onto the original graph using Lemma 2.

Two Phase Strategy to use the Regularity Lemma

Infact such a methodology gives results quite encouraging as compared to other standard clustering methods. These results would be reported here once the paper under consideration is published. :) Another thing to be noted is that the reduced graph is usually quite small as compared to the original graph and working on this smaller – reduced graph is much faster.

_________________________

Other Possible Changes:

1. The Alon et al. algorithm is used and modified to use the regularity lemma for clustering. It would be interesting to explore the results given by the Frieze & Kannan method as it has a different way to find the certificate.

2. There has been some work on the sparse regularity lemma. This is something that would have a lot of practical value as in the above we construct a dense graph. Using the sparse version would allow us to use nearest neighbor graphs instead of dense graphs. This would reduce the computational burden significantly.

3. A recent paper by Fischer et al.Approximate Hypergraph Partitioning and Applications” has received a lot of attention. In this paper they give a new approach for finding regular partitions. All the previous ones work to find partitions of the tower type, while this paper gives a method to find a smaller regular partition if one exists in the graph. Employing this methodology for refinement instead of using an approximate version of the algorithmic regularity lemma could be a fruitful direction of work.

_________________________

Onionesque Reality Home >>

“Darwinian Evolution is a form of PAC (Machine) Learning”

Changing or increasing functionality of circuits in biological evolution is a form of computational learning. – Leslie Valiant

The title of this post comes from Prof. Leslie Valiant‘s The ACM Alan M. Turing award lecture titled “The Extent and Limitations of Mechanistic Explanations of Nature”.

Prof. Leslie G. Valiant

Click on the image above to watch the lecture

[Image Source: CACM “Beauty and Elegance”]

Short blurb: Though the lecture came out sometime in June-July 2011, and I have shared it (and a paper that it quotes) on every online social network I have presence on, I have no idea why I never blogged about it.

The fact that I have zero training (and epsilon knowledge of) in biology that has not stopped me from being completely fascinated by the contents of the talk and a few papers that he cites in it. I have tried to see the lecture a few times and have also started to read and understand some of the papers he mentions. Infact, the talk has inspired me enough to know more about PAC Learning than the usual Machine Learning graduate course might cover. Knowing more about it is now my “full time side-project” and it is a very exciting side-project to say the least!

_________________________

It is widely accepted that Darwinian Evolution has been the driving force for the immense complexity observed in life or how life evolved. In this beautiful 10 minute video Carl Sagan sums up the timeline and the progression:

There is however one problem: While evolution is considered the driving force for such complexity, there isn’t a satisfactory explanation of how 13.75 billion years of it could have been enough. Many have often complained that this reduces it to a little more than an intuitive explanation. Can we understand the underlying mechanism of Evolution (that can in turn give reasonable time bounds)? Valiant makes the case that this underlying mechanism is of computational learning.

There have been a number of computational models that have been based on the general intuitive idea of Darwinian Evolution. Some of these include: Genetic Algorithms/Programming etc. However, people like Valiant amongst others find such methods useful in an engineering sense but unsatisfying w.r.t the question.

In the talk Valiant mentions that this question was asked in Darwin’s day as well. To which Darwin proposed a bound of 300 million years for such evolution to occur. This immediately fell into a problem as Lord Kelvin, one of the leading physicists of the time put the figure of the age of Earth to be 24 million years. Now obviously this was a problem as evolution could not have happened for more than 24 million years according to Kelvin’s estimate. The estimate of the age of the Earth is now much higher. ;-)

The question can be rehashed as: How much time is enough? Can biological circuits evolve in sub-exponential time?

For more I would point out to his paper:

Evolvability: Leslie Valiant (Journal of the ACM – PDF)

Towards the end of the talk he shows a Venn diagram of the type usually seen in complexity theory text books for classes P, NP, BQP etc but with one major difference: These subsets are fact and not unproven:

$Fact: Evolvability \subseteq SQ Learnable \subseteq PAC Learnable$

*SQ or Statistical Query Learning is due to Michael Kearns (1993)

Coda: Valiant claims that the problem of evolution is no more mysterious than the problem of learning. The mechanism that underlies biological evolution is “evolvable target pursuit”, which in turn is the same as “learnable target pursuit”.

_________________________

Onionesque Reality Home >>