This post may be considered an extension of the previous post.

The setup and notation is the same as in the previous post (linked above). But to summarize: Earlier we had an unknown smooth regression function f: \mathbb{R}^d \to \mathbb{R}. The idea was to estimate at each training point, the gradient of this unknown function f, and then taking the sample expectation of the outerproduct of the gradient. This quantity has some interesting properties and applications.

However it has its limitations, for one, the mapping f: \mathbb{R}^d \to \mathbb{R} restricts the Gradient Outer Product being helpful for only regression and binary classification (since for binary classification the problem can be thought of as regression). It is not clear if a similar operator can be constructed when one is dealing with classification, that is the unknown smooth function is a vector valued function f: \mathbb{R}^d \to \mathbb{R}^c where c is the number of classes (let us say for the purpose of this discussion, that for each data point we have a probability distribution over the classes, a c dimensional vector).

In the case of the gradient outer product since we were working with a real valued function, it was possible to define the gradient at each point, which is simply:

\displaystyle \Bigg[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_d} \Bigg]

For a vector valued function f: \mathbb{R}^d \to \mathbb{R}^c, we can’t have the gradient, but instead can define the Jacobian at each point:

\displaystyle \mathbf{J} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \dots & \frac{\partial f_1}{\partial x_d} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_c}{\partial x_1} & \frac{\partial f_c}{\partial x_2} & \dots & \frac{\partial f_c}{\partial x_d}\end{bmatrix}

Note that \mathbf{J} may be estimated in a similar manner as estimating gradients as in the previous posts. Which leads us to define the quantity \mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}).

The first thing to note is that \mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T) defined in the previous post is simply the quantity for the special case when f: \mathbb{R}^d \to \mathbb{R}. Another note is also in order: The reason why we suffixed that quantity with “outer product” (as opposed to “inner product” here) is simply because we considered the gradient to be a column vector, otherwise they are similar in spirit.

Another thing to note is that it is easy to see that the quantity \mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) is a positive semi-definite matrix and hence is a Reimannian Metric, which is defined below:

Definition: A Reimannian Metric G on a manifold \mathcal{M} is a symmetric and positive semi-definite matrix, which defines a smoothly varying inner product in the tangent space \mathbf{T}_x \mathcal{M}, for each point x \in \mathcal{M} and a, b \in \mathbf{T}_x \mathcal{M}. This associated p.s.d matrix is called the metric tensor. In the above case, since \mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) is p.s.d it defines a Reimannian metric:

\langle a, b \rangle_x = a^T \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) b

Thus, \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) is a specific metric (more general metrics are dealt with in areas such as metric learning).

Properties: We saw some properties of \mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T) in the previous post. In the same vein, does \mathbb{E}_X G(X) = \mathbb{E}_X ( \mathbf{J}^T \mathbf{J}) have similar properties? i.e. does the first eigenvector also correspond to the direction of highest average variation? What about the k-dimensional subspace? What difference does it make that we are looking at a vector valued function? Also what about the cases when d > c and otherwise?

These are questions that I need to think about and should be the topic for a future post to be made soon, hopefully.

Recently, in course of a project that I had some involvement in, I came across an interesting quadratic form. It is called in the literature as the Gradient Outer Product. This operator, which has applications in supervised dimensionality reduction, inverse regression and metric learning can be motivated in two (related) ways, but before doing so, the following is the set up:

Setup: Suppose we have the usual set up as for nonparametric regression and (binary) classification i.e. let Y \approx f(X) for some unknown smooth f, the input X is d dimensional X = (X^i)_{i=1}^d

1. Supervised Dimensionality Reduction: It is often the case that f varies most along only some relevant coordinates. This is the main motivation behind variable selection.

The idea in variable selection is the following: That f(X) may be written as f(PX) where P \in \{0,1\}^{k \times d}. P projects down the data to only k relevant coordinates (i.e. some features are selected by P while others are discarded).

This idea is generalized in Multi-Index Regression, where the goal is to recover a subspace most relevant to prediction. That is, now suppose the data varies significantly along all coordinates but it still depends on some subspace of smaller dimensionality. This might be achieved by letting P from the above to be P \in \mathbb{R}^{k \times d}. It is important to note that P is not any subspace, but rather the k-dimensional subspace to which if the data is projected, the regression error would be the least. This idea might be further generalized by means of mapping X to some P non-linearly, but for now we only stick to the relevant subspace.

How can we recover such a subspace?


2. Average Variation of f: Another way to motivate this quantity is the following: Suppose we want to find the direction in which f varies the most on average, or the direction in which f varies the second fastest on average and so on. Or more generally, given any direction, we want to find the variation of f along it. How can we recover these?


The Expected Gradient Outer Product:  The expected gradient outer product of the unknown classification or regression function is the quantity: \mathbb{E}_X G(X) = \mathbb{E}_X ( \nabla f(X)\nabla f(X)^T)

The expected gradient outer product recovers the average variation of f in all directions. This can be seen as follows: The directional derivative at x along v \in \mathbb{R}^d is given by \displaystyle {f'}_v(x) = \nabla f(x)^T v or \mathbb{E}_X |{f'}_v(X)|^2 = \mathbb{E}_X (v^T G(X) v) = v^T (\mathbb{E}_X G(X))v.

From the above it follows that if f does not vary along v then v must be in the null space of \mathbb{E}_X (G(X)).

Infact it is not hard to show that the relevant subspace P as defined earlier can also be recovered from \mathbb{E}_X (G(X)). This fact is given in the following lemma.

Lemma: Under the assumed model i.e. Y \approx f(PX), the gradient outer product matrix \mathbb{E}_X (G(X)) is of rank at most k. Let \{v_1, v_2, \dots, v_k \} be the eigenvectors of \mathbb{E}_X (G(X)) corresponding to the top k eigenvalues of \mathbb{E}_X (G(X)). Then the following is true:

span(P) = span(v_1, v_2, \dots, v_k)

This means that a spectral decomposition of \mathbb{E}_X (G(X)) recovers the relevant subspace. Also note that the Gradient Outer Product corresponds to a kind of a supervised version of Principal Component Analysis.


Estimation: Ofcourse in real settings the function is unknown and we are only given points sampled from it. There are various estimators for \mathbb{E}_X (G(X)), which usually involve estimation of the derivatives. In one of them the idea is to estimate, at each point x a linear approximation to f. The slope of this approximation approximates the gradient at that point. Repeating this at the n sample points, gives a sample gradient outer product. There is some work that shows that some of these estimators are statistically consistent.


Related: Gradient Based Diffusion Maps: The gradient outer product can not isolate local information or geometry and its spectral decomposition, as seen above, gives only a linear embedding. One way to obtain a non-linear dimensionality reduction would be to borrow from and extend the idea of diffusion maps, which are well established tools in semi supervised learning. The central quantity of interest for diffusion maps is the graph laplacian L = I - D^{-\frac{1}{2}} W D^{-\frac{1}{2}}, where D is the degree matrix and W the adjacency matrix of the nearest neighbor graph constructed on the data points. The non linear embedding is obtained by a spectral decomposition of the operator L or its powers L^t.

As above, a similar diffusion operator may be constructed by using local gradient information. One such possible operator could be:

\displaystyle W_{ij} = W_{f}(x_i, x_j) = exp \Big( - \frac{ \| x_i - x_j \| ^2}{\sigma_1} - \frac{ | \frac{1}{2} (\nabla f(x_i) + \nabla f(x_j)) (x_i - x_j) |^2 }{\sigma_2}\Big)

Note that the first term is the same that is used in unsupervised dimension reduction techniques such as laplacian eigenmaps and diffusion maps. The second term can be interpreted as a diffusion on function values. This operator gives a way for non linear supervised dimension reduction using gradient information.

The above operator was defined here, however no consistency results for the same are provided.

Also see: The Jacobian Inner Product.


I recently noticed on arXiv that the following manuscript “Implementation and Abstraction in Mathematics” by David McAllester. A couple of years ago, I had taken a graduate course taught by David that had a similar flavour (the material in the manuscript is more advanced, in particular the main results, not to mention it is better organized and the presentation more polished), presenting a type theoretic foundation of mathematics. Although I can’t say I did very well in the course, I certainly enjoyed the ideas in it very much, and thus the above manuscript might be worth a look. Perhaps it might be a good idea to audit that course again, just to make sure I understand the main ideas better this time. :)

I found these images on twitter via (Paige Bailey) @DynamicWebPaige  Found them so hilarious that I thought they deserved a blog post (Needless to say, click on each image for a higher resolution version).

PS: A quick search indicates that these are from “Land of Lisp: Learn to Program in List, One Game at a time” by M. D. Conrad Barski.

The 40s and 50s

40s60s and 70s


80s and 90s




Recently I was going through Shannon’s original 1948 Information Theory paper and a paper by Kolmogorov from 1983 that places the differences between “Shannon Information” and “Algorithmic Information” in sharp relief. After much information diffusion over the past decades the difference is quite obvious and particularly interesting to contrast nonetheless. Nevertheless I found these two paragraphs from these two papers interesting, if for nothing then for historical reasons.

“The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design.” (C. E. Shannon, A Mathematical Theory of Communication, 1948).
“Our definition of the quantity of information has the advantage that it refers to individual objects and not to objects treated as members of a set of objects with a probability distribution given on it. The probabilistic definition can be convincingly applied to the information contained, for example, in a stream of congratulatory telegrams. But it would not be clear how to apply it, for example, to an estimate of the quantity of information contained in a novel or in the translation of a novel into another language relative to the original. I think that the new definition is capable of introducing in similar applications of the theory at least clarity of principle.” (A. N. Kolmogorov, Combinatorial Foundations of Information Theory and the Calculus of Probabilities, 1983).
Note the reference that Shannon makes is to a message selected from a set of possible messages (and hence the use of probability theory), his problem was mainly concerned with the engineering application of communication. While Kolmogorov talks of individual objects and is not directly concerned with communication and hence the usage of computability (the two notions are related ofcourse, for example the expected Kolmogorov Complexity is equal to Shannon entropy).

The No-Free-Lunch theorems (Wolpert and Macready) are one of the mainstays in Machine Learning and optimization. Crudely, the NFL says that “for every learner there exists a distribution in which it fails”, giving a precise handle to reason about why a given learning algorithm might fail on certain tasks (or, “specialization” of a learning algorithm might just be accidental). It gives a good justification why it is useful to incorporate prior informaion about the problem at hand to perform better. Oftentimes the NFL is used to argue that a general purpose learning algorithm is theoretically impossible. This is incorrect. Indeed, universal learning algorithms exist and there is a (possibly very small) free lunch, at least for the case of “interesting problems” (thanks to Vick for pointing out these papers).

The view “for every learner there exists a distribution in which it fails” is very useful. However I had recently been thinking that is particularly instructive to think of No Free Lunch as a kind of diagnolization argument (this is just a different way of thinking about the prior statement). I am not completely sure if this works but it sounds plausible and makes me see NFL in a different light. It might also be that there is some subtle absrudity or flaw in this kind of thinking. However, I think the NFL theorem can be seen as a simple corollary of the following incompleteness theorem in Kolmogorov Complexity.

Theorem (Chaitin): There exists a constant L (which only depends on the particular axiomatic system and the choice of description language) such that there does not exist a string s for which the statement the “The Kolmogorov Complexity of s is greater than L” (as formalized in S) can be proven within the axiomatic system S. 

Stated informally: “There’s a number L such that we can’t prove the Kolmogorov complexity of any specific string of bits is more than L”. In particular, you can think of any learning machine as a compression scheme (since compression implies learning, we can think of any learning algorithm as a way to compress the data) of some fixed Kolmogorov Complexity (consider the most optimal program for that machine, the programming language is irrelevant as long as it is Turing Complete. Consider the description length of it in bits. That is its Kolmogorov Complexity). Now for any such learning machine (or class of learning machines w.l.o.g) you can construct a data string which has Kolmogorov Complexity greater than what this learning machine has. i.e. this data string is random from the point of view of this learning machine as it is incompressible for it. In short IF the data string has some Kolmogorov Complexity K and your learning machine has Kolmogorov Complexity k and k < K then you will never be able to find any structure in your data string i.e. can never show that it has a KC greater than k. Thus the data string given to our learner would appear to be structureless. Thus no free lunch is just a simple diagonalization argument in the spirit of incompleteness results. One can always construct a string that is Kolmogorov Random or structureless w.r.t our encoding scheme. Ofcourse the Kolmogorov Complexity is uncomputable. However I think the above argument works even when one is talking of upper bounds on the actual Kolmogorov Complexity, but I am not very sure about it or if there are any problems with it.

Also, if one assumes that that there is no reason to believe that structureless functions exist (if we assume that the data is generated by a computational process then there is probably no reason to believe that it is structureless i.e. the problem is “interesting”). And if this is the case, then you would always have a machine in your library of learning algorithms that will be able to find some upper bound on the Kolmogorov Complexity of the data string (much less than the machine itself). This search might be done using Levin’s universal search, for example. I think this line of reasoning leads naturally, to thinking why NFL doesn’t say anything about whether universal learning algorithms can exist.

PS: This thought train about NFL as diagonalization was inspired by a post by David McAllester on what he calls the “Free Lunch Theorem” over here. Some of this appeared as a comment there. However, now I have been inspired enough to study some of the papers concerned including his. (Unrelated to all this. I particularly found likening Chomsky’s Universal Grammar as simply an invocation of NFL on information theoretic grounds in his post very enlightening).

Over the past 4-5 months whenever there is some time to spare, I have been working through The Cauchy-Schwarz Master Class by J. Michael Steele. And, although I am still left with the last two chapters, I have been reflecting on the material already covered in order to get a better perspective on what I have been slowly learning over the months. This blog post is a small exercise in this direction.

Ofcourse, there is nothing mysterious about proving the Cauchy-Schwarz inequality; it is fairly obvious and basic. But I thought it still might be instructive (mostly to myself) to reproduce some proofs that I know out of memory (following a maxim of my supervisor on a white/blackboard blogpost). Although, why Cauchy-Schwarz keeps appearing all the time and what makes it so useful and fundamental is indeed quite interesting and non-obvious. And like Gil Kalai notes, it is also unclear why is it that it is Cauchy-Schwarz which is mainly useful. I feel that Steele’s book has made me appreciate this importance somewhat more (compared to 4-5 months ago) by drawing to many concepts that link back to Cauchy-Schwarz.

Before getting to the post, a word on the book: This book is perhaps amongst the best mathematics book that I have seen in many years. True to its name, it is indeed a Master Class and also truly addictive. I could not put aside the book completely once I had picked it up and eventually decided to go all the way. Like most great books, the way it is organized makes it “very natural” to rediscover many susbtantial results (some of them named) appearing much later by yourself, provided you happen to just ask the right questions. The emphasis on problem solving makes sure you make very good friends with some of the most interesting inequalities. The number of inequalities featured is also extensive. It starts off with the inequalities dealing with “natural” notions such as monotonicity and positivity and later moves onto somewhat less natural notions such as convexity. I can’t recommend this book enough!

Now getting to the proofs: Some of these proofs appear in Steele’s book, mostly as either challenge problems or as exercises. All of them were solvable after some hints.


Proof 1: A Self-Generalizing proof

This proof is essentially due to Titu Andreescu and Bogdan Enescu and has now grown to be my favourite Cauchy-Schwarz proof.

We start with the trivial identity (for a, b \in \mathbb{R}, x >0, y>0 ):

Identity 1: (ay - bx)^2 \geq 0

Expanding we have

a^2y^2 + b^2x^2 - 2abxy \geq 0

Rearranging this we get:

\displaystyle \frac{a^2y}{x} + \frac{b^2x}{y} \geq 2ab

Further: \displaystyle a^2 + b^2 + \frac{a^2y}{x} + \frac{b^2x}{y} \geq (a+b)^2;

Rearranging this we get the following trivial Lemma:

Lemma 1: \displaystyle \frac{(a+b)^2}{(x+y)} \leq \frac{a^2}{x} + \frac{b^2}{y}

Notice that this Lemma is self generalizing in the following sense. Suppose we replace b with b + c and y with y + z, then we have:

\displaystyle \frac{(a+b+c)^2}{(x+y+z)} \leq \frac{a^2}{x} + \frac{(b+c)^2}{y+z}

But we can apply Lemma 1 to the second term of the right hand side one more time. So we would get the following inequality:

\displaystyle \frac{(a+b+c)^2}{(x+y+z)} \leq \frac{a^2}{x} + \frac{b^2}{y} + \frac{c^2}{z}

Using the same principle n times we get the following:

\displaystyle \frac{(a_1+a_2+ \dots a_n)^2}{(x_1+x_2+ \dots + x_n)} \leq \frac{a_1^2}{x_1} + \frac{a_2^2}{x_2} + \dots + \frac{a_n^2}{x_n}

Now substitute a_i = \alpha_i \beta_i and x_i = \beta_i^2 to get:

\displaystyle ( \sum_{i=1}^n \alpha_i \beta_i )^2 \leq \sum_{i=1}^n ( \alpha_i )^2 \sum_{i=1}^n ( \beta_i )^2

This is just the Cauchy-Schwarz Inequality, thus completing the proof.


Proof 2: By Induction

Again, the Cauchy-Schwarz Inequality is the following: for a, b \in \mathbb{R}

\displaystyle ( \sum_{i=1}^n a_i b_i )^2 \leq \sum_{i=1}^n ( a_i )^2 \sum_{i=1}^n ( b_i )^2

For proof of the inequality by induction, the most important thing is starting with the right base case. Clearly n = 1 is trivially true, suggesting that it is perhaps not of much import. So we consider the case for n = 2. Which is:

\displaystyle ( a_1 b_1 + a_2 b_2 )^2 \leq ( a_1^2 + a_2^2 ) (b_1^2 + b_2^2)

To prove the base case, we simply expand the expressions. To get:

\displaystyle a_1^2 b_1^2 + a_2^2 b_2^2 + 2 a_1 b_1 a_2 b_2 \leq a_1^2 b_1^2 + a_1^2 b_2^2 + a_2^2 b_1^2 + a_2^2 b_2^2

Which is just:

\displaystyle a_1^2 b_2^2 + a_2^2 b_1^2 - 2 a_1 b_1 a_2 b_2 \geq 0


\displaystyle (a_1 b_2 - a_2 b_1 )^2 \geq 0

Which proves the base case.

Moving ahead, we assume the following inequality to be true:

\displaystyle ( \sum_{i=1}^k a_i b_i )^2 \leq \sum_{i=1}^k ( a_i )^2 \sum_{i=1}^k ( b_i )^2

To establish Cauchy-Schwarz, we have to demonstrate, assuming the above, that

\displaystyle ( \sum_{i=1}^{k+1} a_i b_i )^2 \leq \sum_{i=1}^{k+1} ( a_i )^2 \sum_{i=1}^{k+1} ( b_i )^2

So, we start from H(k):

\displaystyle ( \sum_{i=1}^k a_i b_i )^2 \leq \sum_{i=1}^k ( a_i )^2 \sum_{i=1}^k ( b_i )^2

we further have,

\displaystyle \Big(\sum_{i=1}^k a_i b_i\Big) + a_{k+1}b_{k+1} \leq \Big(\sum_{i=1}^k ( a_i )^2\Big)^{1/2} \Big(\sum_{i=1}^k ( b_i )^2\Big)^{1/2} + a_{k+1}b_{k+1} \ \ \ \ (1)

Now, we can apply the case for n = 2. Recall that: \displaystyle a_1 b_1 + a_2 b_2 \leq (a_1^2 + a_2^2)^{1/2} (b_1^2 + b_2^2)^{1/2}

Thus, using this in the R. H. S of (1) , we would now have:

\displaystyle \Big(\sum_{i=1}^{k+1} a_i b_i\Big) \leq \Big(\sum_{i=1}^k ( a_i )^2 + a_{k+1}^2\Big)^{1/2} \Big(\sum_{i=1}^k ( b_i )^2 + b_{k+1}^2\Big)^{1/2}


\displaystyle \Big(\sum_{i=1}^{k+1} a_i b_i\Big) \leq \Big(\sum_{i=1}^{k+1} ( a_i )^2 \Big)^{1/2} \Big(\sum_{i=1}^{k+1} ( b_i )^2 \Big)^{1/2}

This proves the case H(k+1) on assuming H(k). Thus also proving the Cauchy-Schwarz inequality by the principle of mathematical induction.


Proof 3: For Infinite Sequences using the Normalization Trick:

We start with the following question.

Problem: For a, b \in \mathbb{R} If \displaystyle \Big(\sum_{i=1}^{\infty} a_i^2 \Big) < \infty and \displaystyle \Big(\sum_{i=1}^{\infty} b_i^2 \Big) < \infty then is \displaystyle \Big(\sum_{i=1}^{\infty} |a_i||b_i|\Big) < \infty ?

Note that this is easy to establish. We simply start with the trivial identity (x-y)^2 \geq 0 which in turn gives us \displaystyle xy \leq \frac{x^2}{2} + \frac{y^2}{2}

Next, take x = |a_i| and y = |b_i| on summing up to infinity on both sides, we get the following:

\displaystyle \Big( \sum_{i=1}^{\infty} |a_i||b_i| \Big)^2 \leq \frac{1}{2}\Big(\sum_{i=1}^{\infty} a_i^2\Big) + \frac{1}{2} \Big(\sum_{i=1}^{\infty} b_i^2\Big)\ \ \ \ \ \ \ \ (2)

From this it immediately follows that

\displaystyle \Big(\sum_{i=1}^{\infty} |a_i||b_i|\Big) < \infty

Now let

\displaystyle \hat{a}_i = \frac{a_i}{\Big(\sum_j a_i^2\Big)^{1/2}} and

\displaystyle \hat{b}_i = \frac{b_i}{\Big(\sum_j b_i^2\Big)^{1/2}}; substituting in (2), we get:

\displaystyle \Big( \sum_{i=1}^{\infty} |\hat{a}_i||\hat{b}_i| \Big)^2 \leq \frac{1}{2}\Big(\sum_{i=1}^{\infty} \hat{a}_i^2\Big) + \frac{1}{2} \Big(\sum_{i=1}^{\infty} \hat{b}_i^2\Big) or,

\displaystyle \Bigg(\sum_{i = 1}^{\infty}\frac{a_i}{\Big(\sum_j a_j^2\Big)^{1/2}}\frac{b_i}{\Big(\sum_j b_j^2\Big)^{1/2}}\Bigg)^2 \leq \frac{1}{2} + \frac{1}{2}

Which simply gives back Cauchy’s inequality for infinite sequences thus completing the proof:

\displaystyle \Big(\sum_{i=1}^{\infty} a_i b_i\Big)^2 \leq \Big(\sum_{i=1}^{\infty}a_i^2\Big) \Big(\sum_{i=1}^{\infty}b_i^2\Big)


Proof 4: Using Lagrange’s Identity

We first start with a polynomial which we denote by \mathbf{Q}_n:

\mathbf{Q}_n = \big(a_1^2 + a_2^2 + \dots + a_n^2 \big) \big(b_1^2 + b_2^2 + \dots + b_n^2 \big) - \big(a_1b_1 + a_2b_2 + \dots + a_nb_n\big)^2

The question to now ask, is \mathbf{Q}_n \geq 0? To answer this question, we start of by re-writing \mathbf{Q}_n in a “better” form.

\displaystyle \mathbf{Q}_n = \sum_{i=1}^{n}\sum_{j=1}^{n} a_i^2 b_j^2 - \sum_{i=1}^{n}\sum_{j=1}^{n} a_ib_i a_jb_j

Next, as J. Michael Steele puts, we pursue symmetry and rewrite the above so as to make it apparent.

\displaystyle \mathbf{Q}_n = \frac{1}{2} \sum_{i=1}^{n}\sum_{j=1}^{n} \big(a_i^2 b_j^2 + a_j^2 b_i^2\big) - \frac{2}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} a_ib_i a_jb_j

Thus, we now have:

\displaystyle \mathbf{Q}_n = \frac{1}{2} \sum_{i=1}^{n}\sum_{j=1}^{n} \big(a_i b_j - a_j b_i\big)^2

This makes it clear that \mathbf{Q}_n can be written as a sum of squares and hence is always postive. Let us write out the above completely:

\displaystyle \sum_{i=1}^{n}\sum_{j=1}^{n} a_i^2 b_j^2 - \sum_{i=1}^{n}\sum_{j=1}^{n} a_ib_ib_jb_j = \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \big(a_ib_j -a_jb_i\big)^2

Now, reversing the step we took at the onset to write the L.H.S better, we simply have:

\displaystyle \sum_{i}^{n} a_i^2 \sum_{}^{n} b_i^2 - \big(\sum_{i=1}^n a_ib_i\big)^2 = \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \big(a_ib_j -a_jb_i\big)^2

This is called Lagrange’s Identity. Now since the R.H.S. is always greater than or equal to zero. We get the following inequality as a corrollary:

\displaystyle \big(\sum_{i=1}^n a_ib_i\big)^2 \leq \sum_{i}^{n} a_i^2 \sum_{}^{n} b_i^2

This is just the Cauchy-Schwarz inequality, completing the proof.


Proof 5: Gram-Schmidt Process gives an automatic proof of Cauchy-Schwarz

First we quickly review the Gram-Schmidt Process: Given a set of linearly independent elements of a real or complex inner product space \big(V,\langle\cdot, \cdot\rangle\big), \mathbf{x_1}, \mathbf{x_2}, \dots, \mathbf{x_n}. We can get an orthonormal set of n elemets \mathbf{e_1}, \mathbf{e_2}, \dots, \mathbf{e_n} by the simple recursion (after setting \displaystyle \mathbf{e_1 = \frac{\mathbf{x_1}}{\|x_1 \|}}).

\displaystyle \mathbf{z_k} = \mathbf{x_k} - \sum_{j=1}^{k-1} \langle\mathbf{x_k},\mathbf{e_j}\rangle\mathbf{e_j} and then

\displaystyle \mathbf{e_k} = \frac{\mathbf{z_k}}{\|\mathbf{z_k}\|}

for k = 2, 3, \dots, n.

Keeping the above in mind, assume that \| x\| = 1. Now let x = e_1. Thus, we have:

\mathbf{z} = \mathbf{y} - \langle \mathbf{y}, \mathbf{e_1}\rangle\mathbf{e_1}

Giving: \displaystyle \mathbf{e_2} = \frac{\mathbf{z}}{\|\mathbf{z}\|}. Rearranging we have:

\displaystyle \|\mathbf{z}\|\mathbf{e_2} = \mathbf{y} - \langle \mathbf{y}, \mathbf{e_1}\rangle\mathbf{e_1} or

\displaystyle \mathbf{y} = \langle \mathbf{y},\mathbf{e_1}\rangle \mathbf{e_1} + \|\mathbf{z}\|\mathbf{e_2} or

\displaystyle \mathbf{y} = \mathbf{c_1} \mathbf{e_1} + \mathbf{c_2}\mathbf{e_2} where c_1, c_2 are constants.

Now note that: \displaystyle \langle\mathbf{x},\mathbf{y}\rangle = c_1 and

\displaystyle \langle \mathbf{y},\mathbf{y}\rangle = |c_1|^2 + |c_2|^2. The following bound is trivial:

\displaystyle |c_1| \leq (|c_1|^2 + |c_2|^2)^{1/2}. But note that this is simply \langle x,y \rangle \leq \langle y,y \rangle^{1/2}

Which is just the Cauchy-Schwarz inequality when \|x\| = 1.


Proof 6: Proof of the Continuous version for d =2; Schwarz’s Proof

For this case, the inequality may be stated as:

Suppose we have S \subset \mathbb{R}^2 and that f: S \to \mathbb{R} and g: S \to \mathbb{R}. Then consider the double integrals:

\displaystyle A = \iint_S f^2 dx dy, \displaystyle B = \iint_S fg dx dy and \displaystyle C = \iint_S g^2 dx dy. These double integrals must satisfy the following inequality:

|B| \leq \sqrt{A} . \sqrt{C}.

The proof given by Schwarz as is reported in Steele’s book (and indeed in standard textbooks) is based on the following observation:

The real polynomial below is always non-negative:

\displaystyle p(t) = \iint_S \Big( t f(x,y) + g(x,y) \Big)^2 dx dy = At^2 + 2Bt + C

p(t) > 0 unless f and g are proportional. Thus from the binomial formula we have that B^2 \leq AC, moreover the inequality is strict unless f and g are proportional.


Proof 7: Proof using the Projection formula

Problem: Consider any point x \neq 0 in \mathbb{R}^d. Now consider the line that passes through this point and origin. Let us call this line \mathcal{L} = \{ tx: t \in \mathbb{R}\}. Find the point on the line closest to any point v \in \mathbb{R}^d.

If P(v) is the point on the line that is closest to v, then it is given by the projection formula: \displaystyle P(v) = x \frac{\langle x, v \rangle }{\langle x, x \rangle}

This is fairly elementary to establish. To find the value of t, such that distance \rho(v,tx) is minimized, we can simply consider the squared distance \rho^2(v,tx) since it is easier to work with. Which by definition is:

\displaystyle \rho^2(v,tx) = \langle v - x, v - tx \rangle

which is simply:

\displaystyle \rho^2(v,tx) = \langle v, v \rangle - 2t \langle v, x \rangle + t^2 \langle x, x \rangle

\displaystyle = \langle x, x \rangle \bigg( t^2 -2t \frac{\langle v, x \rangle}{\langle x, x \rangle} + \frac{\langle v, v \rangle}{\langle x, x \rangle}\bigg)

\displaystyle = \langle x, x \rangle \bigg\{ \bigg(t - \frac{\langle v,x \rangle}{\langle x,x \rangle}\bigg)^2 - \frac{\langle v,x \rangle^2}{\langle x,x \rangle^2}\bigg\} + \frac{\langle v, v \rangle}{\langle x, x \rangle}\bigg)

\displaystyle = \langle x, x \rangle \bigg\{ \bigg(t - \frac{\langle v,x \rangle}{\langle x,x \rangle}\bigg)^2 - \frac{\langle v,x \rangle^2}{\langle x,x \rangle^2} + \frac{\langle v, v \rangle}{\langle x, x \rangle} \bigg\}

\displaystyle = \langle x, x \rangle \bigg\{ \bigg(t - \frac{\langle v,x \rangle}{\langle x,x \rangle}\bigg)^2 - \frac{\langle v,x \rangle^2}{\langle x,x \rangle^2} + \frac{\langle v, v \rangle \langle x, x \rangle}{\langle x, x \rangle^2} \bigg\}

So, the value of t for which the above is minimized is \displaystyle \frac{\langle v,x \rangle}{\langle x,x \rangle}. Note that this simply reproduces the projection formula.

Therefore, the minimum squared distance is given by the expression below:

\displaystyle \min_{t \in \mathbb{R}} \rho^2(v, tx) = \frac{\langle v,v\rangle \langle x,x \rangle - \langle v,x \rangle^2}{\langle x,x \rangle}

Note that the L. H. S is always positive. Therefore we have:

\displaystyle \frac{\langle v,v\rangle \langle x,x \rangle - \langle v,x \rangle^2}{\langle x,x \rangle} \geq 0

Rearranging, we have:

\displaystyle \langle v,x \rangle^2 \leq \langle v,v\rangle \langle x,x \rangle

Which is just Cauchy-Schwarz, thus proving the inequality.


Proof 8: Proof using an identity

A variant of this proof is amongst the most common Cauchy-Schwarz proofs that are given in textbooks. Also, this is related to proof (6) above. However, it still has some value in its own right. While also giving an useful expression for the “defect” for Cauchy-Schwarz like the Lagrange Identity above.

We start with the following polynomial:

P(t) = \langle v - tw, v - tw \rangle. Clearly P(t) \geq 0.

To find the minimum of this polynomial we find its derivative w.r.t t and setting to zero:

P'(t) = 2t \langle w, w \rangle - 2 \langle v, w \rangle = 0 giving:

\displaystyle t_0 = \frac{\langle v, w \rangle}{\langle w, w \rangle}

Clearly we have P(t) \geq P(t_0) \geq 0. We consider:

P(t_0) \geq 0, substituting \displaystyle t_0 = \frac{\langle v, w \rangle}{\langle w, w \rangle} we have:

\displaystyle \langle v,v \rangle - \frac{\langle v, w \rangle}{\langle w, w \rangle} \langle v,w \rangle - \frac{\langle v, w \rangle}{\langle w, w \rangle} \langle w,v \rangle + \frac{\langle v, w \rangle^2}{\langle w, w \rangle^2}\langle w,w \rangle \geq 0 \ \ \ \ \ \ \ \ (A)

Just rearrangine and simplifying:

\displaystyle \langle v,v \rangle \langle w,w \rangle - \langle v, w \rangle^2 \geq 0

This proves Cauchy-Schwarz inequality.

Now suppose we are interested in an expression for the defect in Cauchy-Schwarz i.e. the difference \displaystyle \langle v,v \rangle \langle w,w \rangle - \langle v, w \rangle^2. For this we can just consider the L.H.S of equation (A) since it is just \displaystyle \langle w,w \rangle \Big(\langle v,v \rangle \langle w,w \rangle - \langle v, w \rangle^2\Big).

i.e. Defect =

\displaystyle \langle w,w \rangle \bigg(\langle v,v \rangle - 2 \frac{\langle v,w \rangle^2}{\langle w,w \rangle} + \frac{\langle v,w \rangle^2}{\langle w,w \rangle}\bigg)

Which is just:

\displaystyle \langle w,w \rangle\bigg(\Big\langle v - \frac{\langle w,v \rangle}{\langle w,w \rangle}w,v - \frac{\langle w,v \rangle}{\langle w,w \rangle}w \Big\rangle\bigg)

This defect term is much in the spirit of the defect term that we saw in Lagrange’s identity above, and it is instructive to compare them.


Proof 9: Proof using the AM-GM inequality

Let us first recall the AM-GM inequality:

For non-negative reals x_1, x_2, \dots x_n we have the following basic inequality:

\displaystyle \sqrt[n]{x_1 x_2 \dots x_n} \leq \Big(\frac{x_1 + x_2 + \dots x_n}{n}\Big).

Now let us define \displaystyle A = \sqrt{a_1^2 + a_2^2 + \dots + a_n^2} and \displaystyle B = \sqrt{b_1^2 + b_2^2 + \dots + b_n^2}

Now consider the trivial bound (which gives us the AM-GM): (x-y)^2 \geq 0, which is just \displaystyle \frac{1}{2}\Big(x^2 + y^2\Big) \geq xy. Note that AM-GM as stated above for n = 2 is immediate when we consider x \to \sqrt{x} and y \to \sqrt{y}

Using the above, we have:

\displaystyle \frac{1}{2} \Big(\frac{a_i^2}{A^2} + \frac{b_i^2}{B^2}\Big) \geq \frac{a_ib_i}{AB}

Summing over n, we have:

\displaystyle \sum_{i=1}^n\frac{1}{2} \Big(\frac{a_i^2}{A^2} + \frac{b_i^2}{B^2}\Big) \geq \sum_{i=1}^n \frac{a_ib_i}{AB}

But note that the L.H.S equals 1, therefore:

\displaystyle \sum_{i=1}^n \frac{a_ib_i}{AB} \leq 1 or \displaystyle \sum_{i=1}^n a_ib_i \leq AB

Writing out A and B as defined above, we have:

\displaystyle \sum_{i=1}^n a_ib_i \leq \sqrt{\sum_{i=1}^na_i^2}\sqrt{\sum_{i=1}^nb_i^2}.

Thus proving the Cauchy-Schwarz inequality.


Proof 10: Using Jensen’s Inequality

We begin by recalling Jensen’s Inequality:

Suppose that f: [p, q] \to \mathbb{R} is a convex function. Also suppose that there are non-negative numbers p_1, p_2, \dots, p_n such that \displaystyle \sum_{i=1}^{n} p_i = 1. Then for all x_i \in [p, q] for i = 1, 2, \dots, n one has:

\displaystyle f\Big(\sum_{i=1}^{n}p_ix_i\Big) \leq \sum_{i=1}^{n}p_if(x_i).

Now we know that f(x) = x^2 is convex. Applying Jensen’s Inequality, we have:

\displaystyle \Big(\sum_{i=1}^{n} p_i x_i \Big)^2 \leq \sum_{i=1}^{n} p_i x_i^2

Now, for b_i \neq 0 for all i = 1, 2, \dots, n, let \displaystyle x_i = \frac{a_i}{b_i} and let \displaystyle p_i = \frac{b_i^2}{\sum_{i=1}^{n}b_i^2}.

Which gives:

\displaystyle \Big(\sum_{i=1}^n \frac{a_ib_i}{\sum_{i=1}^{n}b_i^2}\Big)^2 \leq \Big(\sum_{i=1}^{n}\frac{a_i^2}{\sum_{i=1}^{n}b_i^2}\Big)

Rearranging this just gives the familiar form of Cauchy-Schwarz at once:

\displaystyle \Big(\sum_{i=1}^{n} a_ib_i\Big)^2 \leq \Big(\sum_{i=1}^{n} a_i^2\Big)\Big(\sum_{i=1}^{n} b_i^2\Big)


Proof 11: Pictorial Proof for d = 2

Here (page 4) is an attractive pictorial proof by means of tilings for the case d = 2 by Roger Nelson.



Get every new post delivered to your Inbox.

Join 98 other followers