Feeds:
Posts
Comments

Posts Tagged ‘Supervised Learning’

This post may be considered an extension of the previous post.

The setup and notation is the same as in the previous post (linked above). But to summarize: Earlier we had an unknown smooth regression function f: \mathbb{R}^d \to \mathbb{R}. The idea was to estimate at each training point, the gradient of this unknown function f, and then taking the sample expectation of the outerproduct of the gradient. This quantity has some interesting properties and applications.

However it has its limitations, for one, the mapping f: \mathbb{R}^d \to \mathbb{R} restricts the Gradient Outer Product being helpful for only regression and binary classification (since for binary classification the problem can be thought of as regression). It is not clear if a similar operator can be constructed when one is dealing with classification, that is the unknown smooth function is a vector valued function f: \mathbb{R}^d \to \mathbb{R}^c where c is the number of classes (let us say for the purpose of this discussion, that for each data point we have a probability distribution over the classes, a c dimensional vector).

In the case of the gradient outer product since we were working with a real valued function, it was possible to define the gradient at each point, which is simply:

\displaystyle \Bigg[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_d} \Bigg]

For a vector valued function f: \mathbb{R}^d \to \mathbb{R}^c, we can’t have the gradient, but instead can define the Jacobian at each point:

\displaystyle \mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_d} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_c}{\partial x_1} & \frac{\partial f_c}{\partial x_2} & \dots & \frac{\partial f_c}{\partial x_d}\end{bmatrix}

Note that \mathbf{J} may be estimated in a similar manner as estimating gradients as in the previous posts. Which leads us to define the quantity \mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}).

The first thing to note is that \mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T) defined in the previous post is simply the quantity for the special case when f: \mathbb{R}^d \to \mathbb{R}. Another note is also in order: The reason why we suffixed that quantity with “outer product” (as opposed to “inner product” here) is simply because we considered the gradient to be a column vector, otherwise they are similar in spirit.

Another thing to note is that it is easy to see that the quantity \mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) is a positive semi-definite matrix and hence is a Reimannian Metric, which is defined below:

Definition: A Reimannian Metric G on a manifold \mathcal{M} is a symmetric and positive semi-definite matrix, which defines a smoothly varying inner product in the tangent space \mathbf{T}_x \mathcal{M}, for each point x \in \mathcal{M} and a, b \in \mathbf{T}_x \mathcal{M}. This associated p.s.d matrix is called the metric tensor. In the above case, since \mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) is p.s.d it defines a Reimannian metric:

\langle a, b \rangle_x = a^T \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) b

Thus, \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) is a specific metric (more general metrics are dealt with in areas such as metric learning).

Properties: We saw some properties of \mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T) in the previous post. In the same vein, does \mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) have similar properties? i.e. does the first eigenvector also correspond to the direction of highest average variation? What about the k-dimensional subspace? What difference does it make that we are looking at a vector valued function? Also what about the cases when d > c and otherwise?

These are questions that I need to think about and should be the topic for a future post to be made soon, hopefully.

Read Full Post »

Recently, in course of a project that I had some involvement in, I came across an interesting quadratic form. It is called in the literature as the Gradient Outer Product. This operator, which has applications in supervised dimensionality reduction, inverse regression and metric learning can be motivated in two (related) ways, but before doing so, the following is the set up:

Setup: Suppose we have the usual set up as for nonparametric regression and (binary) classification i.e. let Y \approx f(X) for some unknown smooth f, the input X is d dimensional X = (X^i)_{i=1}^d

1. Supervised Dimensionality Reduction: It is often the case that f varies most along only some relevant coordinates. This is the main motivation behind variable selection.

The idea in variable selection is the following: That f(X) may be written as f(PX) where P \in \{0,1\}^{k \times d}. P projects down the data to only k relevant coordinates (i.e. some features are selected by P while others are discarded).

This idea is generalized in Multi-Index Regression, where the goal is to recover a subspace most relevant to prediction. That is, now suppose the data varies significantly along all coordinates but it still depends on some subspace of smaller dimensionality. This might be achieved by letting P from the above to be P \in \mathbb{R}^{k \times d}. It is important to note that P is not any subspace, but rather the k-dimensional subspace to which if the data is projected, the regression error would be the least. This idea might be further generalized by means of mapping X to some P non-linearly, but for now we only stick to the relevant subspace.

How can we recover such a subspace?

________________

2. Average Variation of f: Another way to motivate this quantity is the following: Suppose we want to find the direction in which f varies the most on average, or the direction in which f varies the second fastest on average and so on. Or more generally, given any direction, we want to find the variation of f along it. How can we recover these?

________________

The Expected Gradient Outer Product:  The expected gradient outer product of the unknown classification or regression function is the quantity: \mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T)

The expected gradient outer product recovers the average variation of f in all directions. This can be seen as follows: The directional derivative at x along v \in \mathbb{R}^d is given by \displaystyle {f'}_v(x) = \nabla f(x)^T v or \mathbb{E}_X |{f'}_v(X)|^2 = \mathbb{E}_X (v^T G(X) v) = v^T (\mathbb{E}_X G(X))v.

From the above it follows that if f does not vary along v then v must be in the null space of \mathbb{E}_X (G(X)).

Infact it is not hard to show that the relevant subspace P as defined earlier can also be recovered from \mathbb{E}_X (G(X)). This fact is given in the following lemma.

Lemma: Under the assumed model i.e. Y \approx f(PX), the gradient outer product matrix \mathbb{E}_X (G(X)) is of rank at most k. Let \{v_1, v_2, \dots, v_k \} be the eigenvectors of \mathbb{E}_X (G(X)) corresponding to the top k eigenvalues of \mathbb{E}_X (G(X)). Then the following is true:

span(P) = span(v_1, v_2, \dots, v_k)

This means that a spectral decomposition of \mathbb{E}_X (G(X)) recovers the relevant subspace. Also note that the Gradient Outer Product corresponds to a kind of a supervised version of Principal Component Analysis.

________________

Estimation: Ofcourse in real settings the function is unknown and we are only given points sampled from it. There are various estimators for \mathbb{E}_X (G(X)), which usually involve estimation of the derivatives. In one of them the idea is to estimate, at each point x a linear approximation to f. The slope of this approximation approximates the gradient at that point. Repeating this at the n sample points, gives a sample gradient outer product. There is some work that shows that some of these estimators are statistically consistent.

________________

Related: Gradient Based Diffusion Maps: The gradient outer product can not isolate local information or geometry and its spectral decomposition, as seen above, gives only a linear embedding. One way to obtain a non-linear dimensionality reduction would be to borrow from and extend the idea of diffusion maps, which are well established tools in semi supervised learning. The central quantity of interest for diffusion maps is the graph laplacian L = I - D^{-\frac{1}{2}} W D^{-\frac{1}{2}}, where D is the degree matrix and W the adjacency matrix of the nearest neighbor graph constructed on the data points. The non linear embedding is obtained by a spectral decomposition of the operator L or its powers L^t.

As above, a similar diffusion operator may be constructed by using local gradient information. One such possible operator could be:

\displaystyle W_{ij} = W_{f}(x_i, x_j) = exp \Big( - \frac{ \| x_i - x_j \| ^2}{\sigma_1} - \frac{ | \frac{1}{2} (\nabla f(x_i) + \nabla f(x_j)) (x_i - x_j) |^2 }{\sigma_2}\Big)

Note that the first term is the same that is used in unsupervised dimension reduction techniques such as laplacian eigenmaps and diffusion maps. The second term can be interpreted as a diffusion on function values. This operator gives a way for non linear supervised dimension reduction using gradient information.

The above operator was defined here, however no consistency results for the same are provided.

Also see: The Jacobian Inner Product.

________________

Read Full Post »