Posts Tagged ‘Classification’

This post may be considered an extension of the previous post.

The setup and notation is the same as in the previous post (linked above). But to summarize: Earlier we had an unknown smooth regression function f: \mathbb{R}^d \to \mathbb{R}. The idea was to estimate at each training point, the gradient of this unknown function f, and then taking the sample expectation of the outerproduct of the gradient. This quantity has some interesting properties and applications.

However it has its limitations, for one, the mapping f: \mathbb{R}^d \to \mathbb{R} restricts the Gradient Outer Product being helpful for only regression and binary classification (since for binary classification the problem can be thought of as regression). It is not clear if a similar operator can be constructed when one is dealing with classification, that is the unknown smooth function is a vector valued function f: \mathbb{R}^d \to \mathbb{R}^c where c is the number of classes (let us say for the purpose of this discussion, that for each data point we have a probability distribution over the classes, a c dimensional vector).

In the case of the gradient outer product since we were working with a real valued function, it was possible to define the gradient at each point, which is simply:

\displaystyle \Bigg[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_d} \Bigg]

For a vector valued function f: \mathbb{R}^d \to \mathbb{R}^c, we can’t have the gradient, but instead can define the Jacobian at each point:

\displaystyle \mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_d} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_c}{\partial x_1} & \frac{\partial f_c}{\partial x_2} & \dots & \frac{\partial f_c}{\partial x_d}\end{bmatrix}

Note that \mathbf{J} may be estimated in a similar manner as estimating gradients as in the previous posts. Which leads us to define the quantity \mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}).

The first thing to note is that \mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T) defined in the previous post is simply the quantity for the special case when f: \mathbb{R}^d \to \mathbb{R}. Another note is also in order: The reason why we suffixed that quantity with “outer product” (as opposed to “inner product” here) is simply because we considered the gradient to be a column vector, otherwise they are similar in spirit.

Another thing to note is that it is easy to see that the quantity \mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) is a positive semi-definite matrix and hence is a Reimannian Metric, which is defined below:

Definition: A Reimannian Metric G on a manifold \mathcal{M} is a symmetric and positive semi-definite matrix, which defines a smoothly varying inner product in the tangent space \mathbf{T}_x \mathcal{M}, for each point x \in \mathcal{M} and a, b \in \mathbf{T}_x \mathcal{M}. This associated p.s.d matrix is called the metric tensor. In the above case, since \mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) is p.s.d it defines a Reimannian metric:

\langle a, b \rangle_x = a^T \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) b

Thus, \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) is a specific metric (more general metrics are dealt with in areas such as metric learning).

Properties: We saw some properties of \mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T) in the previous post. In the same vein, does \mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) have similar properties? i.e. does the first eigenvector also correspond to the direction of highest average variation? What about the k-dimensional subspace? What difference does it make that we are looking at a vector valued function? Also what about the cases when d > c and otherwise?

These are questions that I need to think about and should be the topic for a future post to be made soon, hopefully.

Read Full Post »

Recently, in course of a project that I had some involvement in, I came across an interesting quadratic form. It is called in the literature as the Gradient Outer Product. This operator, which has applications in supervised dimensionality reduction, inverse regression and metric learning can be motivated in two (related) ways, but before doing so, the following is the set up:

Setup: Suppose we have the usual set up as for nonparametric regression and (binary) classification i.e. let Y \approx f(X) for some unknown smooth f, the input X is d dimensional X = (X^i)_{i=1}^d

1. Supervised Dimensionality Reduction: It is often the case that f varies most along only some relevant coordinates. This is the main motivation behind variable selection.

The idea in variable selection is the following: That f(X) may be written as f(PX) where P \in \{0,1\}^{k \times d}. P projects down the data to only k relevant coordinates (i.e. some features are selected by P while others are discarded).

This idea is generalized in Multi-Index Regression, where the goal is to recover a subspace most relevant to prediction. That is, now suppose the data varies significantly along all coordinates but it still depends on some subspace of smaller dimensionality. This might be achieved by letting P from the above to be P \in \mathbb{R}^{k \times d}. It is important to note that P is not any subspace, but rather the k-dimensional subspace to which if the data is projected, the regression error would be the least. This idea might be further generalized by means of mapping X to some P non-linearly, but for now we only stick to the relevant subspace.

How can we recover such a subspace?


2. Average Variation of f: Another way to motivate this quantity is the following: Suppose we want to find the direction in which f varies the most on average, or the direction in which f varies the second fastest on average and so on. Or more generally, given any direction, we want to find the variation of f along it. How can we recover these?


The Expected Gradient Outer Product:  The expected gradient outer product of the unknown classification or regression function is the quantity: \mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T)

The expected gradient outer product recovers the average variation of f in all directions. This can be seen as follows: The directional derivative at x along v \in \mathbb{R}^d is given by \displaystyle {f'}_v(x) = \nabla f(x)^T v or \mathbb{E}_X |{f'}_v(X)|^2 = \mathbb{E}_X (v^T G(X) v) = v^T (\mathbb{E}_X G(X))v.

From the above it follows that if f does not vary along v then v must be in the null space of \mathbb{E}_X (G(X)).

Infact it is not hard to show that the relevant subspace P as defined earlier can also be recovered from \mathbb{E}_X (G(X)). This fact is given in the following lemma.

Lemma: Under the assumed model i.e. Y \approx f(PX), the gradient outer product matrix \mathbb{E}_X (G(X)) is of rank at most k. Let \{v_1, v_2, \dots, v_k \} be the eigenvectors of \mathbb{E}_X (G(X)) corresponding to the top k eigenvalues of \mathbb{E}_X (G(X)). Then the following is true:

span(P) = span(v_1, v_2, \dots, v_k)

This means that a spectral decomposition of \mathbb{E}_X (G(X)) recovers the relevant subspace. Also note that the Gradient Outer Product corresponds to a kind of a supervised version of Principal Component Analysis.


Estimation: Ofcourse in real settings the function is unknown and we are only given points sampled from it. There are various estimators for \mathbb{E}_X (G(X)), which usually involve estimation of the derivatives. In one of them the idea is to estimate, at each point x a linear approximation to f. The slope of this approximation approximates the gradient at that point. Repeating this at the n sample points, gives a sample gradient outer product. There is some work that shows that some of these estimators are statistically consistent.


Related: Gradient Based Diffusion Maps: The gradient outer product can not isolate local information or geometry and its spectral decomposition, as seen above, gives only a linear embedding. One way to obtain a non-linear dimensionality reduction would be to borrow from and extend the idea of diffusion maps, which are well established tools in semi supervised learning. The central quantity of interest for diffusion maps is the graph laplacian L = I - D^{-\frac{1}{2}} W D^{-\frac{1}{2}}, where D is the degree matrix and W the adjacency matrix of the nearest neighbor graph constructed on the data points. The non linear embedding is obtained by a spectral decomposition of the operator L or its powers L^t.

As above, a similar diffusion operator may be constructed by using local gradient information. One such possible operator could be:

\displaystyle W_{ij} = W_{f}(x_i, x_j) = exp \Big( - \frac{ \| x_i - x_j \| ^2}{\sigma_1} - \frac{ | \frac{1}{2} (\nabla f(x_i) + \nabla f(x_j)) (x_i - x_j) |^2 }{\sigma_2}\Big)

Note that the first term is the same that is used in unsupervised dimension reduction techniques such as laplacian eigenmaps and diffusion maps. The second term can be interpreted as a diffusion on function values. This operator gives a way for non linear supervised dimension reduction using gradient information.

The above operator was defined here, however no consistency results for the same are provided.

Also see: The Jacobian Inner Product.


Read Full Post »

This question appears very trivial and might just be meaningless, so I put it up at the risk of embarrassing myself. However, since it is a genuine question. I still put it up. All thoughts are welcome.

Question: When trying to quantify the performance of a classifier. What advantages does RMSE offer over the Area under the Curve under the ROC metric? And what does the AUC offer that the RMSE does not? I find AUC very intuitive and prefer using it for classification tasks. But can I give a theoretical reason for using it above RMSE and the vice versa? Review committees have different preferences, some journals prefer reporting the RMSE while some prefer the AUC, some ask for both. Another example being – The 2010 KDD Cup used RMSE while the 2010 UCSD data mining competition used AUC.

Or is this a bad question to ask?

To paraphrase my question – What can be instances in which a classifier is deemed as “good” by the AUC measure and “not so good” by the RMSE measure. What would be the exact reason for such a different “opinion”? And in what situations should I use AUC and in what situations should I use RMSE?

Some Background : If they are equivalent, then you would expect a strong linear relationship (with a negative correlation). That means that for a perfect classifier RMSE would be zero and the AUC 1.

I always use both for all purposes. Here is a sample graph.

RMSE versus AUC for a classifier on some Intelligent Tutoring Data

This is actually a very typical graph, and there are no surprises with it. If you leave out some “bad examples” such as those at (0.4, 0.65) and (0.38, 0.7), the graph has a good negative correlation (as measured by the line fit).

So, the question remains for me. What are the advantages and disadvantages of both?

Recommendations :

1. ROC Graphs : Notes and Practical Considerations for Researchers – Tom Fawcett

Read Full Post »