Feeds:
Posts
Comments

Archive for the ‘Mathematics’ Category

A short note on how the Graph Laplacian is a natural analogue of the Laplacian in vector analysis.

For a twice differentiable function f on the euclidean space, the Laplacian of f, \Delta f is the div(grad(f)) or the divergence of the gradient of f. Let’s consider the laplacian in two dimensions:

\displaystyle \Delta f(x,y) = \nabla \cdot \nabla f(x,y) = \frac{d^2 f(x,y)}{dx^2} + \frac{d^2 f(x,y)}{dy^2}

To get an intuition about what it does, it is useful to consider a finite difference approximation to the above:

\displaystyle \Delta f(x,y) \approx \frac{f(x+h,y) + f(x-h,y) + f(x,y+h) + f(x,y-h) - 4f(x,y)}{h^2}

The above makes it clear that the Laplacian behaves like a local averaging operator. Infact the above finite difference approximation is used in image processing (say, when h is set to one pixel) to detect sharp changes such as edges, as the above is close to zero in smooth image regions.

____________________________

The finite difference approximation intuitively suggests that a discrete laplace operator should be natural. At the same time, the Graph Theory literature defines the Graph Laplacian as the matrix L = D - A (where D is the degree matrix and A the adjacency matrix of the graph G, both are defined below). It is not obvious that this matrix L follows from the div(grad(f)) definition mentioned above. The purpose of this note is to explicate in short on how this is natural, once the notion of gradient and divergence operators have been defined over a graph.

First of all the two most natural matrices that can be associated with a graph are defined below:

____________________________

There are two natural matrices that are used to specify a simple graph G = (V,E) with vertex set V =\{1, \dots, n \} and edge set E = \{1, \dots, m \}: The Adjacency matrix A and the Incidence matrix \nabla.

The Adjacency matrix is defined as a n \times n matrix such that:

\displaystyle A_{u,v} = \begin{cases} 1 & \mbox{if } uv \in E \\ 0 & \mbox{otherwise} \end{cases}

To define the m \times n incidence matrix \nabla, we need to fix an orientation of the edges (for our purpose an arbitrary orientation suffices), then:

\displaystyle \nabla_{e,v} = \begin{cases} -1 & \text{if } v \text{ is initial vertex of edge } e \\ +1 & \text{if } v \text{ is terminal vertex of edge } e \\ 0 & \text{if } e \text{ is not incident on vertex } v \end{cases}

Now suppose we have a real valued function f over V, f: V \to \mathbb{R}. For G, we view this as a column vector indexed by the vertices.

Similarly let g be a real valued function over the edges E, g: E \to \mathbb{R} (again, we view this as a row vector indexed by the edges).

Now, consider the following map: f \mapsto \nabla f (which is now a vector indexed by E). The value of this map at any edge (\nabla f) (e) is simply the difference of the values of f at the two end points of the edge e i.e. f_u - f_v. This makes it somewhat intuitive that the matrix \nabla is some sort of a difference operator on G. Given this “difference operator” or “discrete differential” view, (\nabla f) can then be viewed as the gradient that measures the change of the function f along the edges of G. Thus the map f \mapsto \nabla f is just the gradient operator.

Like the above, now we consider the following map: g \mapsto \nabla^{T} g^T (which is now a vector indexed by V). Recall that g was defined on the edges. The value of this map at any vertex (g \nabla)(v) is \displaystyle \sum_{e \text{ exits } v} g_e - \sum_{e \text{ enters } v} g_e. If we were to view g as describing some kind of a flow on the edges, then (g \nabla)(v) is the net outbound flow on the vertex v, which is just the divergence. Thus the map f \mapsto g \nabla is just the divergence operator.

Now consider f \mapsto \nabla^{T} \nabla f (note that this makes sense since \nabla f is a column vector), going by the above, this should be the divergence of the gradient. Thus, the analogy with the “real” Laplacian makes sense and the matrix L_G = \nabla^{T} \nabla is appropriately called the Graph Laplacian.

Using the definition of \nabla, a simple computation yields the following for L_G:

\displaystyle L_{G{u,v}} = \begin{cases} -1 & \mbox{if } uv \in E \\ deg(u) & \mbox{if } u =v \end{cases}

Looking at the above and recalling the definition for the adjacency matrix A, it is easy to see that L_G = D - A, where D is the diagonal matrix with D_{u,u} = deg(u). This is just the familiar definition that I mentioned at the start.

PS: From the above it is also clear that the Laplacian is positive semidefinite. Indeed, consider f ^TL f, which is written as f^T \nabla^T \nabla f =\|\nabla f\|^2 = \sum_{(u,v) \in E} (f_u - f_v)^2

Read Full Post »

I recently noticed on arXiv that the following manuscript “Implementation and Abstraction in Mathematics” by David McAllester. A couple of years ago, I had taken a graduate course taught by David that had a similar flavour (the material in the manuscript is more advanced, in particular the main results, not to mention it is better organized and the presentation more polished), presenting a type theoretic foundation of mathematics. Although I can’t say I did very well in the course, I certainly enjoyed the ideas in it very much, and thus the above manuscript might be worth a look. Perhaps it might be a good idea to audit that course again, just to make sure I understand the main ideas better this time. :)

Read Full Post »

Over the past 4-5 months whenever there is some time to spare, I have been working through The Cauchy-Schwarz Master Class by J. Michael Steele. And, although I am still left with the last two chapters, I have been reflecting on the material already covered in order to get a better perspective on what I have been slowly learning over the months. This blog post is a small exercise in this direction.

Ofcourse, there is nothing mysterious about proving the Cauchy-Schwarz inequality; it is fairly obvious and basic. But I thought it still might be instructive (mostly to myself) to reproduce some proofs that I know out of memory (following a maxim of my supervisor on a white/blackboard blogpost). Although, why Cauchy-Schwarz keeps appearing all the time and what makes it so useful and fundamental is indeed quite interesting and non-obvious. And like Gil Kalai notes, it is also unclear why is it that it is Cauchy-Schwarz which is mainly useful. I feel that Steele’s book has made me appreciate this importance somewhat more (compared to 4-5 months ago) by drawing to many concepts that link back to Cauchy-Schwarz.

Before getting to the post, a word on the book: This book is perhaps amongst the best mathematics book that I have seen in many years. True to its name, it is indeed a Master Class and also truly addictive. I could not put aside the book completely once I had picked it up and eventually decided to go all the way. Like most great books, the way it is organized makes it “very natural” to rediscover many susbtantial results (some of them named) appearing much later by yourself, provided you happen to just ask the right questions. The emphasis on problem solving makes sure you make very good friends with some of the most interesting inequalities. The number of inequalities featured is also extensive. It starts off with the inequalities dealing with “natural” notions such as monotonicity and positivity and later moves onto somewhat less natural notions such as convexity. I can’t recommend this book enough!

Now getting to the proofs: Some of these proofs appear in Steele’s book, mostly as either challenge problems or as exercises. All of them were solvable after some hints.

________________

Proof 1: A Self-Generalizing proof

This proof is essentially due to Titu Andreescu and Bogdan Enescu and has now grown to be my favourite Cauchy-Schwarz proof.

We start with the trivial identity (for a, b \in \mathbb{R}, x >0, y>0 ):

Identity 1: (ay - bx)^2 \geq 0

Expanding we have

a^2y^2 + b^2x^2 - 2abxy \geq 0

Rearranging this we get:

\displaystyle \frac{a^2y}{x} + \frac{b^2x}{y} \geq 2ab

Further: \displaystyle a^2 + b^2 + \frac{a^2y}{x} + \frac{b^2x}{y} \geq (a+b)^2;

Rearranging this we get the following trivial Lemma:

Lemma 1: \displaystyle \frac{(a+b)^2}{(x+y)} \leq \frac{a^2}{x} + \frac{b^2}{y}

Notice that this Lemma is self generalizing in the following sense. Suppose we replace b with b + c and y with y + z, then we have:

\displaystyle \frac{(a+b+c)^2}{(x+y+z)} \leq \frac{a^2}{x} + \frac{(b+c)^2}{y+z}

But we can apply Lemma 1 to the second term of the right hand side one more time. So we would get the following inequality:

\displaystyle \frac{(a+b+c)^2}{(x+y+z)} \leq \frac{a^2}{x} + \frac{b^2}{y} + \frac{c^2}{z}

Using the same principle n times we get the following:

\displaystyle \frac{(a_1+a_2+ \dots a_n)^2}{(x_1+x_2+ \dots + x_n)} \leq \frac{a_1^2}{x_1} + \frac{a_2^2}{x_2} + \dots + \frac{a_n^2}{x_n}

Now substitute a_i = \alpha_i \beta_i and x_i = \beta_i^2 to get:

\displaystyle ( \sum_{i=1}^n \alpha_i \beta_i )^2 \leq \sum_{i=1}^n ( \alpha_i )^2 \sum_{i=1}^n ( \beta_i )^2

This is just the Cauchy-Schwarz Inequality, thus completing the proof.

________________

Proof 2: By Induction

Again, the Cauchy-Schwarz Inequality is the following: for a, b \in \mathbb{R}

\displaystyle ( \sum_{i=1}^n a_i b_i )^2 \leq \sum_{i=1}^n ( a_i )^2 \sum_{i=1}^n ( b_i )^2

For proof of the inequality by induction, the most important thing is starting with the right base case. Clearly n = 1 is trivially true, suggesting that it is perhaps not of much import. So we consider the case for n = 2. Which is:

\displaystyle ( a_1 b_1 + a_2 b_2 )^2 \leq ( a_1^2 + a_2^2 ) (b_1^2 + b_2^2)

To prove the base case, we simply expand the expressions. To get:

\displaystyle a_1^2 b_1^2 + a_2^2 b_2^2 + 2 a_1 b_1 a_2 b_2 \leq a_1^2 b_1^2 + a_1^2 b_2^2 + a_2^2 b_1^2 + a_2^2 b_2^2

Which is just:

\displaystyle a_1^2 b_2^2 + a_2^2 b_1^2 - 2 a_1 b_1 a_2 b_2 \geq 0

Or:

\displaystyle (a_1 b_2 - a_2 b_1 )^2 \geq 0

Which proves the base case.

Moving ahead, we assume the following inequality to be true:

\displaystyle ( \sum_{i=1}^k a_i b_i )^2 \leq \sum_{i=1}^k ( a_i )^2 \sum_{i=1}^k ( b_i )^2

To establish Cauchy-Schwarz, we have to demonstrate, assuming the above, that

\displaystyle ( \sum_{i=1}^{k+1} a_i b_i )^2 \leq \sum_{i=1}^{k+1} ( a_i )^2 \sum_{i=1}^{k+1} ( b_i )^2

So, we start from H(k):

\displaystyle ( \sum_{i=1}^k a_i b_i )^2 \leq \sum_{i=1}^k ( a_i )^2 \sum_{i=1}^k ( b_i )^2

we further have,

\displaystyle \Big(\sum_{i=1}^k a_i b_i\Big) + a_{k+1}b_{k+1} \leq \Big(\sum_{i=1}^k ( a_i )^2\Big)^{1/2} \Big(\sum_{i=1}^k ( b_i )^2\Big)^{1/2} + a_{k+1}b_{k+1} \ \ \ \ (1)

Now, we can apply the case for n = 2. Recall that: \displaystyle a_1 b_1 + a_2 b_2 \leq (a_1^2 + a_2^2)^{1/2} (b_1^2 + b_2^2)^{1/2}

Thus, using this in the R. H. S of (1) , we would now have:

\displaystyle \Big(\sum_{i=1}^{k+1} a_i b_i\Big) \leq \Big(\sum_{i=1}^k ( a_i )^2 + a_{k+1}^2\Big)^{1/2} \Big(\sum_{i=1}^k ( b_i )^2 + b_{k+1}^2\Big)^{1/2}

Or,

\displaystyle \Big(\sum_{i=1}^{k+1} a_i b_i\Big) \leq \Big(\sum_{i=1}^{k+1} ( a_i )^2 \Big)^{1/2} \Big(\sum_{i=1}^{k+1} ( b_i )^2 \Big)^{1/2}

This proves the case H(k+1) on assuming H(k). Thus also proving the Cauchy-Schwarz inequality by the principle of mathematical induction.

________________

Proof 3: For Infinite Sequences using the Normalization Trick:

We start with the following question.

Problem: For a, b \in \mathbb{R} If \displaystyle \Big(\sum_{i=1}^{\infty} a_i^2 \Big) < \infty and \displaystyle \Big(\sum_{i=1}^{\infty} b_i^2 \Big) < \infty then is \displaystyle \Big(\sum_{i=1}^{\infty} |a_i||b_i|\Big) < \infty ?

Note that this is easy to establish. We simply start with the trivial identity (x-y)^2 \geq 0 which in turn gives us \displaystyle xy \leq \frac{x^2}{2} + \frac{y^2}{2}

Next, take x = |a_i| and y = |b_i| on summing up to infinity on both sides, we get the following:

\displaystyle \Big( \sum_{i=1}^{\infty} |a_i||b_i| \Big)^2 \leq \frac{1}{2}\Big(\sum_{i=1}^{\infty} a_i^2\Big) + \frac{1}{2} \Big(\sum_{i=1}^{\infty} b_i^2\Big)\ \ \ \ \ \ \ \ (2)

From this it immediately follows that

\displaystyle \Big(\sum_{i=1}^{\infty} |a_i||b_i|\Big) < \infty

Now let

\displaystyle \hat{a}_i = \frac{a_i}{\Big(\sum_j a_i^2\Big)^{1/2}} and

\displaystyle \hat{b}_i = \frac{b_i}{\Big(\sum_j b_i^2\Big)^{1/2}}; substituting in (2), we get:

\displaystyle \Big( \sum_{i=1}^{\infty} |\hat{a}_i||\hat{b}_i| \Big)^2 \leq \frac{1}{2}\Big(\sum_{i=1}^{\infty} \hat{a}_i^2\Big) + \frac{1}{2} \Big(\sum_{i=1}^{\infty} \hat{b}_i^2\Big) or,

\displaystyle \Bigg(\sum_{i = 1}^{\infty}\frac{a_i}{\Big(\sum_j a_j^2\Big)^{1/2}}\frac{b_i}{\Big(\sum_j b_j^2\Big)^{1/2}}\Bigg)^2 \leq \frac{1}{2} + \frac{1}{2}

Which simply gives back Cauchy’s inequality for infinite sequences thus completing the proof:

\displaystyle \Big(\sum_{i=1}^{\infty} a_i b_i\Big)^2 \leq \Big(\sum_{i=1}^{\infty}a_i^2\Big) \Big(\sum_{i=1}^{\infty}b_i^2\Big)

________________

Proof 4: Using Lagrange’s Identity

We first start with a polynomial which we denote by \mathbf{Q}_n:

\mathbf{Q}_n = \big(a_1^2 + a_2^2 + \dots + a_n^2 \big) \big(b_1^2 + b_2^2 + \dots + b_n^2 \big) - \big(a_1b_1 + a_2b_2 + \dots + a_nb_n\big)^2

The question to now ask, is \mathbf{Q}_n \geq 0? To answer this question, we start of by re-writing \mathbf{Q}_n in a “better” form.

\displaystyle \mathbf{Q}_n = \sum_{i=1}^{n}\sum_{j=1}^{n} a_i^2 b_j^2 - \sum_{i=1}^{n}\sum_{j=1}^{n} a_ib_i a_jb_j

Next, as J. Michael Steele puts, we pursue symmetry and rewrite the above so as to make it apparent.

\displaystyle \mathbf{Q}_n = \frac{1}{2} \sum_{i=1}^{n}\sum_{j=1}^{n} \big(a_i^2 b_j^2 + a_j^2 b_i^2\big) - \frac{2}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} a_ib_i a_jb_j

Thus, we now have:

\displaystyle \mathbf{Q}_n = \frac{1}{2} \sum_{i=1}^{n}\sum_{j=1}^{n} \big(a_i b_j - a_j b_i\big)^2

This makes it clear that \mathbf{Q}_n can be written as a sum of squares and hence is always postive. Let us write out the above completely:

\displaystyle \sum_{i=1}^{n}\sum_{j=1}^{n} a_i^2 b_j^2 - \sum_{i=1}^{n}\sum_{j=1}^{n} a_ib_ib_jb_j = \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \big(a_ib_j -a_jb_i\big)^2

Now, reversing the step we took at the onset to write the L.H.S better, we simply have:

\displaystyle \sum_{i}^{n} a_i^2 \sum_{}^{n} b_i^2 - \big(\sum_{i=1}^n a_ib_i\big)^2 = \frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n} \big(a_ib_j -a_jb_i\big)^2

This is called Lagrange’s Identity. Now since the R.H.S. is always greater than or equal to zero. We get the following inequality as a corrollary:

\displaystyle \big(\sum_{i=1}^n a_ib_i\big)^2 \leq \sum_{i}^{n} a_i^2 \sum_{}^{n} b_i^2

This is just the Cauchy-Schwarz inequality, completing the proof.

________________

Proof 5: Gram-Schmidt Process gives an automatic proof of Cauchy-Schwarz

First we quickly review the Gram-Schmidt Process: Given a set of linearly independent elements of a real or complex inner product space \big(V,\langle\cdot, \cdot\rangle\big), \mathbf{x_1}, \mathbf{x_2}, \dots, \mathbf{x_n}. We can get an orthonormal set of n elemets \mathbf{e_1}, \mathbf{e_2}, \dots, \mathbf{e_n} by the simple recursion (after setting \displaystyle \mathbf{e_1 = \frac{\mathbf{x_1}}{\|x_1 \|}}).

\displaystyle \mathbf{z_k} = \mathbf{x_k} - \sum_{j=1}^{k-1} \langle\mathbf{x_k},\mathbf{e_j}\rangle\mathbf{e_j} and then

\displaystyle \mathbf{e_k} = \frac{\mathbf{z_k}}{\|\mathbf{z_k}\|}

for k = 2, 3, \dots, n.

Keeping the above in mind, assume that \| x\| = 1. Now let x = e_1. Thus, we have:

\mathbf{z} = \mathbf{y} - \langle \mathbf{y}, \mathbf{e_1}\rangle\mathbf{e_1}

Giving: \displaystyle \mathbf{e_2} = \frac{\mathbf{z}}{\|\mathbf{z}\|}. Rearranging we have:

\displaystyle \|\mathbf{z}\|\mathbf{e_2} = \mathbf{y} - \langle \mathbf{y}, \mathbf{e_1}\rangle\mathbf{e_1} or

\displaystyle \mathbf{y} = \langle \mathbf{y},\mathbf{e_1}\rangle \mathbf{e_1} + \|\mathbf{z}\|\mathbf{e_2} or

\displaystyle \mathbf{y} = \mathbf{c_1} \mathbf{e_1} + \mathbf{c_2}\mathbf{e_2} where c_1, c_2 are constants.

Now note that: \displaystyle \langle\mathbf{x},\mathbf{y}\rangle = c_1 and

\displaystyle \langle \mathbf{y},\mathbf{y}\rangle = |c_1|^2 + |c_2|^2. The following bound is trivial:

\displaystyle |c_1| \leq (|c_1|^2 + |c_2|^2)^{1/2}. But note that this is simply \langle x,y \rangle \leq \langle y,y \rangle^{1/2}

Which is just the Cauchy-Schwarz inequality when \|x\| = 1.

________________

Proof 6: Proof of the Continuous version for d =2; Schwarz’s Proof

For this case, the inequality may be stated as:

Suppose we have S \subset \mathbb{R}^2 and that f: S \to \mathbb{R} and g: S \to \mathbb{R}. Then consider the double integrals:

\displaystyle A = \iint_S f^2 dx dy, \displaystyle B = \iint_S fg dx dy and \displaystyle C = \iint_S g^2 dx dy. These double integrals must satisfy the following inequality:

|B| \leq \sqrt{A} . \sqrt{C}.

The proof given by Schwarz as is reported in Steele’s book (and indeed in standard textbooks) is based on the following observation:

The real polynomial below is always non-negative:

\displaystyle p(t) = \iint_S \Big( t f(x,y) + g(x,y) \Big)^2 dx dy = At^2 + 2Bt + C

p(t) > 0 unless f and g are proportional. Thus from the binomial formula we have that B^2 \leq AC, moreover the inequality is strict unless f and g are proportional.

________________

Proof 7: Proof using the Projection formula

Problem: Consider any point x \neq 0 in \mathbb{R}^d. Now consider the line that passes through this point and origin. Let us call this line \mathcal{L} = \{ tx: t \in \mathbb{R}\}. Find the point on the line closest to any point v \in \mathbb{R}^d.

If P(v) is the point on the line that is closest to v, then it is given by the projection formula: \displaystyle P(v) = x \frac{\langle x, v \rangle }{\langle x, x \rangle}

This is fairly elementary to establish. To find the value of t, such that distance \rho(v,tx) is minimized, we can simply consider the squared distance \rho^2(v,tx) since it is easier to work with. Which by definition is:

\displaystyle \rho^2(v,tx) = \langle v - x, v - tx \rangle

which is simply:

\displaystyle \rho^2(v,tx) = \langle v, v \rangle - 2t \langle v, x \rangle + t^2 \langle x, x \rangle

\displaystyle = \langle x, x \rangle \bigg( t^2 -2t \frac{\langle v, x \rangle}{\langle x, x \rangle} + \frac{\langle v, v \rangle}{\langle x, x \rangle}\bigg)

\displaystyle = \langle x, x \rangle \bigg\{ \bigg(t - \frac{\langle v,x \rangle}{\langle x,x \rangle}\bigg)^2 - \frac{\langle v,x \rangle^2}{\langle x,x \rangle^2}\bigg\} + \frac{\langle v, v \rangle}{\langle x, x \rangle}\bigg)

\displaystyle = \langle x, x \rangle \bigg\{ \bigg(t - \frac{\langle v,x \rangle}{\langle x,x \rangle}\bigg)^2 - \frac{\langle v,x \rangle^2}{\langle x,x \rangle^2} + \frac{\langle v, v \rangle}{\langle x, x \rangle} \bigg\}

\displaystyle = \langle x, x \rangle \bigg\{ \bigg(t - \frac{\langle v,x \rangle}{\langle x,x \rangle}\bigg)^2 - \frac{\langle v,x \rangle^2}{\langle x,x \rangle^2} + \frac{\langle v, v \rangle \langle x, x \rangle}{\langle x, x \rangle^2} \bigg\}

So, the value of t for which the above is minimized is \displaystyle \frac{\langle v,x \rangle}{\langle x,x \rangle}. Note that this simply reproduces the projection formula.

Therefore, the minimum squared distance is given by the expression below:

\displaystyle \min_{t \in \mathbb{R}} \rho^2(v, tx) = \frac{\langle v,v\rangle \langle x,x \rangle - \langle v,x \rangle^2}{\langle x,x \rangle}

Note that the L. H. S is always positive. Therefore we have:

\displaystyle \frac{\langle v,v\rangle \langle x,x \rangle - \langle v,x \rangle^2}{\langle x,x \rangle} \geq 0

Rearranging, we have:

\displaystyle \langle v,x \rangle^2 \leq \langle v,v\rangle \langle x,x \rangle

Which is just Cauchy-Schwarz, thus proving the inequality.

________________

Proof 8: Proof using an identity

A variant of this proof is amongst the most common Cauchy-Schwarz proofs that are given in textbooks. Also, this is related to proof (6) above. However, it still has some value in its own right. While also giving an useful expression for the “defect” for Cauchy-Schwarz like the Lagrange Identity above.

We start with the following polynomial:

P(t) = \langle v - tw, v - tw \rangle. Clearly P(t) \geq 0.

To find the minimum of this polynomial we find its derivative w.r.t t and setting to zero:

P'(t) = 2t \langle w, w \rangle - 2 \langle v, w \rangle = 0 giving:

\displaystyle t_0 = \frac{\langle v, w \rangle}{\langle w, w \rangle}

Clearly we have P(t) \geq P(t_0) \geq 0. We consider:

P(t_0) \geq 0, substituting \displaystyle t_0 = \frac{\langle v, w \rangle}{\langle w, w \rangle} we have:

\displaystyle \langle v,v \rangle - \frac{\langle v, w \rangle}{\langle w, w \rangle} \langle v,w \rangle - \frac{\langle v, w \rangle}{\langle w, w \rangle} \langle w,v \rangle + \frac{\langle v, w \rangle^2}{\langle w, w \rangle^2}\langle w,w \rangle \geq 0 \ \ \ \ \ \ \ \ (A)

Just rearrangine and simplifying:

\displaystyle \langle v,v \rangle \langle w,w \rangle - \langle v, w \rangle^2 \geq 0

This proves Cauchy-Schwarz inequality.

Now suppose we are interested in an expression for the defect in Cauchy-Schwarz i.e. the difference \displaystyle \langle v,v \rangle \langle w,w \rangle - \langle v, w \rangle^2. For this we can just consider the L.H.S of equation (A) since it is just \displaystyle \langle w,w \rangle \Big(\langle v,v \rangle \langle w,w \rangle - \langle v, w \rangle^2\Big).

i.e. Defect =

\displaystyle \langle w,w \rangle \bigg(\langle v,v \rangle - 2 \frac{\langle v,w \rangle^2}{\langle w,w \rangle} + \frac{\langle v,w \rangle^2}{\langle w,w \rangle}\bigg)

Which is just:

\displaystyle \langle w,w \rangle\bigg(\Big\langle v - \frac{\langle w,v \rangle}{\langle w,w \rangle}w,v - \frac{\langle w,v \rangle}{\langle w,w \rangle}w \Big\rangle\bigg)

This defect term is much in the spirit of the defect term that we saw in Lagrange’s identity above, and it is instructive to compare them.

________________

Proof 9: Proof using the AM-GM inequality

Let us first recall the AM-GM inequality:

For non-negative reals x_1, x_2, \dots x_n we have the following basic inequality:

\displaystyle \sqrt[n]{x_1 x_2 \dots x_n} \leq \Big(\frac{x_1 + x_2 + \dots x_n}{n}\Big).

Now let us define \displaystyle A = \sqrt{a_1^2 + a_2^2 + \dots + a_n^2} and \displaystyle B = \sqrt{b_1^2 + b_2^2 + \dots + b_n^2}

Now consider the trivial bound (which gives us the AM-GM): (x-y)^2 \geq 0, which is just \displaystyle \frac{1}{2}\Big(x^2 + y^2\Big) \geq xy. Note that AM-GM as stated above for n = 2 is immediate when we consider x \to \sqrt{x} and y \to \sqrt{y}

Using the above, we have:

\displaystyle \frac{1}{2} \Big(\frac{a_i^2}{A^2} + \frac{b_i^2}{B^2}\Big) \geq \frac{a_ib_i}{AB}

Summing over n, we have:

\displaystyle \sum_{i=1}^n\frac{1}{2} \Big(\frac{a_i^2}{A^2} + \frac{b_i^2}{B^2}\Big) \geq \sum_{i=1}^n \frac{a_ib_i}{AB}

But note that the L.H.S equals 1, therefore:

\displaystyle \sum_{i=1}^n \frac{a_ib_i}{AB} \leq 1 or \displaystyle \sum_{i=1}^n a_ib_i \leq AB

Writing out A and B as defined above, we have:

\displaystyle \sum_{i=1}^n a_ib_i \leq \sqrt{\sum_{i=1}^na_i^2}\sqrt{\sum_{i=1}^nb_i^2}.

Thus proving the Cauchy-Schwarz inequality.

________________

Proof 10: Using Jensen’s Inequality

We begin by recalling Jensen’s Inequality:

Suppose that f: [p, q] \to \mathbb{R} is a convex function. Also suppose that there are non-negative numbers p_1, p_2, \dots, p_n such that \displaystyle \sum_{i=1}^{n} p_i = 1. Then for all x_i \in [p, q] for i = 1, 2, \dots, n one has:

\displaystyle f\Big(\sum_{i=1}^{n}p_ix_i\Big) \leq \sum_{i=1}^{n}p_if(x_i).

Now we know that f(x) = x^2 is convex. Applying Jensen’s Inequality, we have:

\displaystyle \Big(\sum_{i=1}^{n} p_i x_i \Big)^2 \leq \sum_{i=1}^{n} p_i x_i^2

Now, for b_i \neq 0 for all i = 1, 2, \dots, n, let \displaystyle x_i = \frac{a_i}{b_i} and let \displaystyle p_i = \frac{b_i^2}{\sum_{i=1}^{n}b_i^2}.

Which gives:

\displaystyle \Big(\sum_{i=1}^n \frac{a_ib_i}{\sum_{i=1}^{n}b_i^2}\Big)^2 \leq \Big(\sum_{i=1}^{n}\frac{a_i^2}{\sum_{i=1}^{n}b_i^2}\Big)

Rearranging this just gives the familiar form of Cauchy-Schwarz at once:

\displaystyle \Big(\sum_{i=1}^{n} a_ib_i\Big)^2 \leq \Big(\sum_{i=1}^{n} a_i^2\Big)\Big(\sum_{i=1}^{n} b_i^2\Big)

________________

Proof 11: Pictorial Proof for d = 2

Here (page 4) is an attractive pictorial proof by means of tilings for the case d = 2 by Roger Nelson.

________________

Read Full Post »

The writings (and even papers/technical books) of Gian-Carlo Rota are perhaps amongst the most insightful that I have encountered in the past 3-4 years (maybe even more). Rota wrote to provoke, never resisting to finish a piece of writing with a rhetorical flourish even at the cost of injecting seeming inconsistency in his stance. I guess this is what you get when you have a first rate mathematician and philosopher endowed with an elegant; at times even devastating turn of phrase, with a huge axe to grind*.

The wisdom of G. C. Rota is best distilled in his book of essays, reviews and other thoughts: Indiscrete Thoughts and to some extent Discrete Thoughts. Perhaps I should review Indiscrete Thoughts in the next post, just to revisit some of those writings and my notes from them myself.
Rota in 1962

Rota in 1962

This post however is not about his writing in general as the title indicates. I recently discovered this excellent dialogue between Rota and David Sharp (1985). I found this on a lead from this László Lovász interview. Here he mentions that Rota’s combinatorics papers were an inspiration for him in his work to find more structure in combinatorics. From the David Sharp interview, here are two relevant excerpts (here too the above mentioned flourish is evident):
“Combinatorics is an honest subject. No adèles, no sigma-algebras. You count balls in a box, and you either have the right number or you haven’t. You get the feeling that the result you have discovered is forever, because it’s concrete. Other branches of mathematics are not so clear-cut. Functional analysis of infinite-dimensional spaces is never fully convincing; you don’t get a feeling of having done an honest day’s work. Don’t get the wrong idea – combinatorics is not just putting balls into boxes. Counting finite sets can be a highbrow undertaking, with sophisticated techniques.
[…]
Much combinatorics of our day came out of an extraordinary coincidence. Disparate problems in combinatorics, ranging from problems in statistical mechanics to the problem of coloring a map, seem to bear no common features. However, they do have at least one common feature: their solution can be reduced to the problem of finding the roots of some polynomial or analytic function. The minimum number of colors required to properly color a map is given by the roots of a polynomial, called the chromatic polynomial; its value at N tells you in how many ways you can color the map with N colors. Similarly, the singularities of some complicated analytic function tell you the temperature at which a phase transition occurs in matter. The great insight, which is a long way from being understood, was to realize that the roots of the polynomials and analytic functions arising in a lot of combinatorial problems are the Betti numbers of certain surfaces related to the problem. Roughly speaking, the Betti numbers of a surface describe the number of different ways you can go around it. We are now trying to understand how this extraordinary coincidence comes about. If we do, we will have found a notable unification in mathematics.”
________________
*While having an axe to grind is a fairly common phrase. I got the idea of using it from an amazon review of Indiscrete thoughts (check link above). Because I really do think that that is the best description for a lot of his writings!

Read Full Post »

John von Neumann made so many fundamental contributions that Paul Halmos remarked that it was almost like von Neumann maintained a list of various subjects that he wanted to touch and develop and he systematically kept ticking items off. This sounds to be remarkably true if one just has a glance at the dizzyingly long “known for” column below his photograph on his wikipedia entry.

John von Neumann with one of his computers.

John von Neumann with one of his computers.

Since Neumann died (young) in 1957, rather unfortunately, there aren’t very many audio/video recordings of his (if I am correct just one 2 minute video recording exists in the public domain so far).

I recently came across a fantastic film on him that I would very highly recommend. Although it is old and the audio quality is not the best, it is certainly worth spending an hour on. The fact that this film features Eugene Wigner, Stanislaw UlamOskar Morgenstern, Paul Halmos (whose little presentation I really enjoyed), Herman Goldstein, Hans Bethe and Edward Teller (who I heard for the first time, spoke quite interestingly) alone makes it worthwhile.

Update: The following youtube links have been removed for breach of copyright. The producer of the film David Hoffman, tells us that the movie should be available as a DVD for purchase soon. Please check the comments to this post for more information.

Part 1

Find Part 2 here.

________________

Onionesque Reality Home >>

Read Full Post »

A semi-useful fact for people working in Machine Learning.

Blurb: Recently, my supervisor was kind enough to institute a small reading group on Deep Neural Networks to help us understand them better. A side product of this would hopefully be that I get to explore this old interest of mine (I have an old interest in Neural Nets and have always wanted to study them in detail, especially work from the mid 90s when the area matured and connections to many others areas such as statistical physics became apparent. But before I got to do that, they seem to be back in vogue! I dub this resurgence as Connectionism 2.0 Although the hipster in me dislikes hype, it is always lends a good excuse to explore an interest). I also attended a Summer School on the topic at UCLA, find slides and talks over here.

Coming back: we have been making our way through this Foundations and Trends volume by Yoshua Bengio (which I quite highly recommend). Bengio spends a lot of time giving intuitions and arguments on why and when Deep Architectures are useful. One particular argument (which is actually quite general and not specific to trees as it might appear) went like this:

Ensembles of trees (like boosted trees, and forests) are more powerful than a single tree. They add a third level to the architecture which allows the model to discriminate among a number of regions exponential in the number of parameters. As illustrated in […], they implicitly form a distributed representation […]

This is followed by an illustration with the following figure:

[Image Source: Learning Deep Architectures for AI]

and accompanying text to the above figure:

Whereas a single decision tree (here just a two-way partition) can discriminate among a number of regions linear in the number of parameters (leaves), an ensemble of trees can discriminate among a number of regions exponential in the number of trees, i.e., exponential in the total number of parameters (at least as long as the number of trees does not exceed the number of inputs, which is not quite the case here). Each distinguishable region is associated with one of the leaves of each tree (here there are three 2-way trees, each defining two regions, for a total of seven regions). This is equivalent to a multi-clustering, here three clusterings each associated with two regions.

Now, the following question was considered: Referring to the text in boldface above, is the number of regions obtained exponential in the general case? It is easy to see that there would be cases where it is not exponential. For example: the number of regions obtained by the three trees would be the same  as those obtained by one tree if all three trees overlap, hence giving no benefit. The above claim refers to a paper (also by Yoshua Bengio) where he constructs examples to show the number of regions could be exponential in some cases.

But suppose for our satisfaction we are interested in actual numbers and the following general question, an answer to which should also answer the question raised above:

What is the maximum possible number of regions that can be obtained in {\bf R}^n by the intersection of m hyperplanes?

Let’s consider some examples in the {\bf R}^2 case.

Where there is one hyperplace i.e. m = 1, then the maximum number of possible regions is obviously two.

[1 Hyperplane: 2 Regions]

Clearly, for two hyerplanes, the maximum number of possible regions is 4.

[2 Hyperplanes: 4 Regions]

In case there are three hyperplanes, the maximum number of regions that might be possible would be 7 as illustrated in the first figure. For m =4 i.e. 4 hyerplanes, this number is 11 as shown below:

[4 Hyperplanes: 11 Regions]

On inspection we can see the number of regions with m hyperplanes in ${\bf R}^2$ is given by:

Number of Regions = #lines + #intersections + 1

Now let’s consider the general case: What is the expression that would give us the maximum number of regions possibles with m hyperplanes in {\bf R}^n? Turns out that there is a definite expression for the same:

Let \mathcal{A} be an arrangement of m hyperplanes in {\bf R}^n, then the maximum number of regions possible is given by

\displaystyle r(\mathcal{A}) = 1 + m + \dbinom{m}{2} \ldots + \dbinom{m}{n}

the number of bounded regions (closed from all sides) is given by:

\displaystyle b(\mathcal{A}) = \dbinom{m - 1}{n}

The above expressions actually derive from a more general expression when \mathcal{A} is specifically a real arrangement.

I am not familiar with some of the techniques that are required to prove this in the general case (\mathcal{A} is a set of affine hyperplanes in a vector space defined over some Field).  The above might be proven by induction. However, it turns out that in the {\bf R}^2 case the result is related to the crossing lemma (see discussion over here), I think it was one of the most favourite results of one of my previous advisors and he often talked about it. This connection makes me want to study the details [1],[2], which I haven’t had the time to look at and I will post the proof for the general version in a future post.

________________

References:

1. An Introduction to Hyperplane Arrangements: Richard P. Stanley (PDF)

Related:

(Helpful for the proof)

2. Partially Ordered Sets: William Trotter (Handbook of Combinatorics) (PDF)

________________

Onionesque Reality Home >>

Read Full Post »

Deep Learning reads Wikipedia and discovers the meaning of life – Geoff Hinton.

The above quote is from a very interesting talk by Geoffrey Hinton I had the chance to attend recently.

I have been at a summer school on Deep Neural Nets and Unsupervised Featured Learning at the Institute for Pure and Applied Mathematics at UCLA since July 9 (till July 27). It has been organized by Geoff Hinton, Yoshua Bengio, Stan Osher, Andrew Ng and Yann LeCun.

I have always been a “fan” of Neural Nets and the recent spike in interest in them has made me excited, thus the school happened at just the right time. The objective of the summer school is to give a broad overview of some of the recent work in Deep Learning and Unsupervised Feature Learning with emphasis on optimization, deep architectures and sparse representations. I must add that after getting here and looking at the peer group I would consider myself lucky to have obtained funding for the event!

[Click on the above image to see slides for the talks. Videos will be added at this location after July 27 Videos are now available]

That aside, if you are interested in Deep Learning or Neural Networks in general, the slides for the talks are being uploaded over here (or click on the image above), videos will be added at the same location some time after the summer school ends so you might like to bookmark this link.

The school has been interesting given the wide range of people who are here. The diversity of opinions about Deep Learning itself has given a good perspective on the subject and the issues and strengths of it. There are quite a few here who are somewhat skeptical of deep learning but are curious, while there are some who have been actively working on the same for a while. Also, it has been enlightening to see completely divergent views between some of the speakers on key ideas such as sparsity. For example Geoff Hinton had a completely different view of why sparsity was useful in classification tasks than compared to Stéphane Mallat, who gave a very interesting talk today even joking that “Hinton and Yann LeCun told you why sparsity is useful, I’ll tell you why sparsity is useless. “. See the above link for more details.

Indeed, such opinions do tell you that there is a lot of fecund ground for research in these areas.

I have been compiling a reading list on some of this stuff and will make a blog-post on the same soon.

________________

Onionesque Reality Home >>

Read Full Post »

Older Posts »