Feeds:
Posts

Recently, in course of a project that I had some involvement in, I came across an interesting quadratic form. It is called in the literature as the Gradient Outer Product. This operator, which has applications in supervised dimensionality reduction, inverse regression and metric learning can be motivated in two (related) ways, but before doing so, the following is the set up:

Setup: Suppose we have the usual set up as for nonparametric regression and (binary) classification i.e. let $Y \approx f(X)$ for some unknown smooth $f$, the input $X$ is $d$ dimensional $X = (X^i)_{i=1}^d$

1. Supervised Dimensionality Reduction: It is often the case that $f$ varies most along only some relevant coordinates. This is the main motivation behind variable selection.

The idea in variable selection is the following: That $f(X)$ may be written as $f(PX)$ where $P \in \{0,1\}^{k \times d}$. $P$ projects down the data to only $k$ relevant coordinates (i.e. some features are selected by $P$ while others are discarded).

This idea is generalized in Multi-Index Regression, where the goal is to recover a subspace most relevant to prediction. That is, now suppose the data varies significantly along all coordinates but it still depends on some subspace of smaller dimensionality. This might be achieved by letting $P$ from the above to be $P \in \mathbb{R}^{k \times d}$. It is important to note that $P$ is not any subspace, but rather the $k$-dimensional subspace to which if the data is projected, the regression error would be the least. This idea might be further generalized by means of mapping $X$ to some $P$ non-linearly, but for now we only stick to the relevant subspace.

How can we recover such a subspace?

________________

2. Average Variation of $f$: Another way to motivate this quantity is the following: Suppose we want to find the direction in which $f$ varies the most on average, or the direction in which $f$ varies the second fastest on average and so on. Or more generally, given any direction, we want to find the variation of $f$ along it. How can we recover these?

________________

The Expected Gradient Outer Product:  The expected gradient outer product of the unknown classification or regression function is the quantity: $\mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T)$

The expected gradient outer product recovers the average variation of $f$ in all directions. This can be seen as follows: The directional derivative at $x$ along $v \in \mathbb{R}^d$ is given by $\displaystyle {f'}_v(x) = \nabla f(x)^T v$ or $\mathbb{E}_X |{f'}_v(X)|^2 = \mathbb{E}_X (v^T G(X) v) = v^T (\mathbb{E}_X G(X))v$.

From the above it follows that if $f$ does not vary along $v$ then $v$ must be in the null space of $\mathbb{E}_X (G(X))$.

Infact it is not hard to show that the relevant subspace $P$ as defined earlier can also be recovered from $\mathbb{E}_X (G(X))$. This fact is given in the following lemma.

Lemma: Under the assumed model i.e. $Y \approx f(PX)$, the gradient outer product matrix $\mathbb{E}_X (G(X))$ is of rank at most $k$. Let $\{v_1, v_2, \dots, v_k \}$ be the eigenvectors of $\mathbb{E}_X (G(X))$ corresponding to the top $k$ eigenvalues of $\mathbb{E}_X (G(X))$. Then the following is true:

$span(P) = span(v_1, v_2, \dots, v_k)$

This means that a spectral decomposition of $\mathbb{E}_X (G(X))$ recovers the relevant subspace. Also note that the Gradient Outer Product corresponds to a kind of a supervised version of Principal Component Analysis.

________________

Estimation: Ofcourse in real settings the function is unknown and we are only given points sampled from it. There are various estimators for $\mathbb{E}_X (G(X))$, which usually involve estimation of the derivatives. In one of them the idea is to estimate, at each point $x$ a linear approximation to $f$. The slope of this approximation approximates the gradient at that point. Repeating this at the $n$ sample points, gives a sample gradient outer product. There is some work that shows that some of these estimators are statistically consistent.

________________

Related: Gradient Based Diffusion Maps: The gradient outer product can not isolate local information or geometry and its spectral decomposition, as seen above, gives only a linear embedding. One way to obtain a non-linear dimensionality reduction would be to borrow from and extend the idea of diffusion maps, which are well established tools in semi supervised learning. The central quantity of interest for diffusion maps is the graph laplacian $L = I - D^{-\frac{1}{2}} W D^{-\frac{1}{2}}$, where $D$ is the degree matrix and $W$ the adjacency matrix of the nearest neighbor graph constructed on the data points. The non linear embedding is obtained by a spectral decomposition of the operator $L$ or its powers $L^t$.

As above, a similar diffusion operator may be constructed by using local gradient information. One such possible operator could be:

$\displaystyle W_{ij} = W_{f}(x_i, x_j) = exp \Big( - \frac{ \| x_i - x_j \| ^2}{\sigma_1} - \frac{ | \frac{1}{2} (\nabla f(x_i) + \nabla f(x_j)) (x_i - x_j) |^2 }{\sigma_2}\Big)$

Note that the first term is the same that is used in unsupervised dimension reduction techniques such as laplacian eigenmaps and diffusion maps. The second term can be interpreted as a diffusion on function values. This operator gives a way for non linear supervised dimension reduction using gradient information.

The above operator was defined here, however no consistency results for the same are provided.

Also see: The Jacobian Inner Product.

________________