The Jacobian Inner Product

Neighbourhood Gerrymandering: An Approach to Discriminative Metric Learning via Latent Structured Prediction »

The Jacobian Inner Product

August 20, 2014 by Shubhendu Trivedi

This post may be considered an extension of the previous post.

The setup and notation is the same as in the previous post (linked above). But to summarize: Earlier we had an unknown smooth regression function $f: \mathbb{R}^d \to \mathbb{R}$ . The idea was to estimate at each training point, the gradient of this unknown function $f$ , and then taking the sample expectation of the outerproduct of the gradient. This quantity has some interesting properties and applications.

However it has its limitations, for one, the mapping $f: \mathbb{R}^d \to \mathbb{R}$ restricts the Gradient Outer Product being helpful for only regression and binary classification (since for binary classification the problem can be thought of as regression). It is not clear if a similar operator can be constructed when one is dealing with classification, that is the unknown smooth function is a vector valued function $f: \mathbb{R}^d \to \mathbb{R}^c$ where $c$ is the number of classes (let us say for the purpose of this discussion, that for each data point we have a probability distribution over the classes, a $c$ dimensional vector).

In the case of the gradient outer product since we were working with a real valued function, it was possible to define the gradient at each point, which is simply:

$\displaystyle \Bigg[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_d} \Bigg]$

For a vector valued function $f: \mathbb{R}^d \to \mathbb{R}^c$ , we can’t have the gradient, but instead can define the Jacobian at each point:

$\displaystyle \mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_d} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_c}{\partial x_1} & \frac{\partial f_c}{\partial x_2} & \dots & \frac{\partial f_c}{\partial x_d}\end{bmatrix}$

Note that $\mathbf{J}$ may be estimated in a similar manner as estimating gradients as in the previous posts. Which leads us to define the quantity $\mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ .

The first thing to note is that $\mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T)$ defined in the previous post is simply the quantity for the special case when $f: \mathbb{R}^d \to \mathbb{R}$ . Another note is also in order: The reason why we suffixed that quantity with “outer product” (as opposed to “inner product” here) is simply because we considered the gradient to be a column vector, otherwise they are similar in spirit.

Another thing to note is that it is easy to see that the quantity $\mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ is a positive semi-definite matrix and hence is a Reimannian Metric, which is defined below:

Definition: A Reimannian Metric $G$ on a manifold $\mathcal{M}$ is a symmetric and positive semi-definite matrix, which defines a smoothly varying inner product in the tangent space $\mathbf{T}_x \mathcal{M}$ , for each point $x \in \mathcal{M}$ and $a, b \in \mathbf{T}_x \mathcal{M}$ . This associated p.s.d matrix is called the metric tensor. In the above case, since $\mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ is p.s.d it defines a Reimannian metric:

$\langle a, b \rangle_x = a^T \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) b$

Thus, $\mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ is a specific metric (more general metrics are dealt with in areas such as metric learning).

Properties: We saw some properties of $\mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T)$ in the previous post. In the same vein, does $\mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J})$ have similar properties? i.e. does the first eigenvector also correspond to the direction of highest average variation? What about the $k$ -dimensional subspace? What difference does it make that we are looking at a vector valued function? Also what about the cases when $d > c$ and otherwise?

These are questions that I need to think about and should be the topic for a future post to be made soon, hopefully.

Posted in Machine Learning | Tagged Classification, Dimensionality Reduction, Jacobian, Machine Learning, Manifold Learning, Regression, Statistics, Supervised Learning | 1 Comment

One Response

on August 20, 2014 at 3:28 am | Reply The Gradient Outer Product | Onionesque Reality

[…] Implementation and Abstraction in Mathematics The Jacobian Inner Product […]

Comments RSS

	Vihan on 1966 Film on John von Neu…
	Shubhendu Trivedi on Strangeness Minus Three (Richa…
	Shubhendu Trivedi on “Darwinian Evolution is…
	Deep on “Darwinian Evolution is…
	Deep on Strangeness Minus Three (Richa…
	Ofir Gagulashvily on Maximum Number of Regions in A…
	Perception and Visio… on Blind Geometers
	Andrea on A Note on the Graph Lapla…
	eJournal (Final Entr… on A Fractal with Zero Area and a…
	Deep on Where does the Sigmoid in Logi…

Onionesque Reality

..::A Random Walk::..