Feeds:
Posts

## IPAM-UCLA Summer School on Deep Learning and Feature Learning

Deep Learning reads Wikipedia and discovers the meaning of life – Geoff Hinton.

The above quote is from a very interesting talk by Geoffrey Hinton I had the chance to attend recently.

I have been at a summer school on Deep Neural Nets and Unsupervised Featured Learning at the Institute for Pure and Applied Mathematics at UCLA since July 9 (till July 27). It has been organized by Geoff Hinton, Yoshua Bengio, Stan Osher, Andrew Ng and Yann LeCun.

I have always been a “fan” of Neural Nets and the recent spike in interest in them has made me excited, thus the school happened at just the right time. The objective of the summer school is to give a broad overview of some of the recent work in Deep Learning and Unsupervised Feature Learning with emphasis on optimization, deep architectures and sparse representations. I must add that after getting here and looking at the peer group I would consider myself lucky to have obtained funding for the event!

[Click on the above image to see slides for the talks. Videos will be added at this location after July 27 Videos are now available]

That aside, if you are interested in Deep Learning or Neural Networks in general, the slides for the talks are being uploaded over here (or click on the image above), videos will be added at the same location some time after the summer school ends so you might like to bookmark this link.

The school has been interesting given the wide range of people who are here. The diversity of opinions about Deep Learning itself has given a good perspective on the subject and the issues and strengths of it. There are quite a few here who are somewhat skeptical of deep learning but are curious, while there are some who have been actively working on the same for a while. Also, it has been enlightening to see completely divergent views between some of the speakers on key ideas such as sparsity. For example Geoff Hinton had a completely different view of why sparsity was useful in classification tasks than compared to Stéphane Mallat, who gave a very interesting talk today even joking that “Hinton and Yann LeCun told you why sparsity is useful, I’ll tell you why sparsity is useless. “. See the above link for more details.

Indeed, such opinions do tell you that there is a lot of fecund ground for research in these areas.

I have been compiling a reading list on some of this stuff and will make a blog-post on the same soon.

________________

Onionesque Reality Home >>

## Importing the Szemerédi Regularity Lemma into Machine Learning

Synopsis of a recent direction of work with Gábor Sárközy, Endre Szemerédi and Fei Song — “The Regularity Lemma is a deep result from extremal graph theory having great utility as a fundamental tool to prove theoretical results, but can it be employed in more “practical” settings?”

More specifically we are interested in the problem of harnessing the power of the regularity lemma to do clustering. This blog post is organized as follows: We first sketch the regularity lemma, we then see that it is an existential predicate and state an algorithmic version, we then look at how this constructive version may be used for clustering/segmentation. It must be noted that the idea seems to have originated from an earlier interesting work by Sperotto and Pellilio.

Before a brief introduction to the regularity lemma, I’d quote Luca Trevisan on (the related) Szemeredi’s Theorem from his blog

Szemeredi’s theorem on arithmetic progressions is one of the great triumphs of the “Hungarian approach” to mathematics: pose very difficult problems, and let deep results, connections between different areas of math, and applications, come out as byproducts of the search for a solution.

Even though I am nothing close to being a mathematician, but we’ll see that even an Electrical Engineer can somewhat understand and appreciate what Prof. Trevisan means in this specific context! ;-)

I will attempt to sketch a few things below with more emphasis on intuition than rigour! To be specific – intuition and then rigour.

_________________________

Introduction: One of the most fundamental and ingenious results in graph theory and discrete mathematics is the Szemeredi Regularity Lemma. It was originally advanced by Endre Szemeredi as an auxiliary result to prove a long standing conjecture of Erdős and Turán from 1936 (on the Ramsey properties of arithmetic progressions), which stated that sequences of integers with postive upper density must contain arbitrarily long arithmetic progressions (now called as Szemeredi’s theorem (1978)). Now the regularity lemma by itself is considered as one of the most important tools in graph theory.

Prof. Endre Szemeredi: Originator of the "Regularity Method"

A very rough statement of the regularity lemma could be made as follows:

Every graph can be approximated by random graphs. This is in the sense that every graph can be partitioned into a bounded number of equal parts such that:
1. Most edges run between different parts
2. And that these edges behave as if generated at random.

The fundamental nature of the Regularity Lemma as a tool in Graph Theory (and beyond) could be understood by the fact that two recent Fields Medals (Timothy Gowers, 1998 and Terence Tao, 2006) were awarded atleast in part for achieving breakthrough results on the regularity lemma and using those to prove fundamental results on arithmetic progressions amongst other results (Such as the Green-Tao Theorem).

_________________________

A Temporary Digression and Prof. Trevisan’s Statement: A first look at the statement of the regularity lemma above reminds one (especially those trained in signal processing I suppose) of the Fourier series. While this is extremely simplified – It’s not entirely incorrect to say that it points in part to a deeper relationship between combinatorics and analysis. Let’s look at it this way – Szemeredi employed the regularity lemma (a graph theoretic method) to prove what is now called Szemeredi’s theorem. The Szemeredi Theorem is a difficult one and it has four different proofs (graph-theoretic, ergodic, hypergraph-theoretic and lastly based on Fourier analysis)*.  Infact work on the Szemeredi theorem has contributed in better understanding connections between these areas (as an example – Timothy Gowers was awarded the Fields Medal for proving very deep results on the connections between analysis and combinatorics).

Terence Tao and Timothy Gowers have made breakthrough contributions to the regularity lemma and Szemeredi's Theorem amongst numerous others (Image Source: Republic of Math)

So it could be said that it all really started with Erdős and Turán posing a hard problem – The search for it’s solution has led to not only some good theorems but also some interesting connections between different areas of mathematics and thus a better understanding of all of them as a result. So we see what Prof. Trevisan means!

*While here the connection appears very sketchy though intuitive. The strong relationship between the Regularity Lemma and Analysis was shown in a breakthrough paper by László Lovász and Balázs Szegedy. In this paper they showed that the regularity lemma could be thought of as a result in analysis

Szemeredi Regularity Lemma for the Analyst (László Lovász and Balázs Szegedy)

_________________________

Given the background now we get to stating Szemeredi’s Regularity Lemma:

The Szemeredi Lemma: Expanding upon the rough statement mentioned a couple of paragraphs above, we could be a little more specific and say that

The Lemma states that every graph could be partitioned into a bounded number of quasi-random bi-partite graphs, called regular pairs and a few left over edges.

Now for a more precise statement, we introduce some definitions and notation.

Definition 1: If $G = (V,E)$ is a graph  and $A, B$ are disjoint subsets of $V$. Then, we denote the count of number edges with one endpoint in $A$ and another in $B$ by $e(A,B)$. The density of edges between $A$ and $B$ can then be defined as:

$\displaystyle d(A,B) = \frac{e(A,B)}{|A||B|}$

This just defines the edge density in a bipartite graph.

Definition 2: The bipartite graph $G = (A, B, E)$ is called $\varepsilon$-regular if for every $X \subset A$ and $Y \subset B$ satisfying

$|X| > \varepsilon |A|, |Y| > \varepsilon |B|$

we would have $|d(X,Y) - d(A,B)| < \varepsilon$

This means that a bipartite graph is epsilon regular if we were to take any random subsets (of some minimum size) $X, Y$ of $A,B$ respectively and even then find that the edge density between these subsets is almost the same as the edge density in the original bipartite graph. In effect this implies that if a bipartite graph is epsilon regular then all the edges between the the two disjoint sets are distributed uniformly.

Definition 3: An equitable partition of the vertex set $V$ of a graph $G = (V,E)$ is a partition of $V$ such that all the classes $C_0, C_1, \dots, C_k$ are pairwise disjoint. And that all classes $C_i$ with $1 \leq i \leq k$ have the same cardinality. It is noteworthy that oftentimes the vertex set might not have a cardinality that could be equally divided into the said number of classes, thus $C_0$ is present for a technical reason: To ensure that all the classes have the same cardinality.

Definition 4: For every equitable partition $P$ of the vertex set $V$ of $G = (V,E)$ into classes $C_0, C_1, \dots, C_k$ we associate a measure called the potential or the index of the partition $P$, which is defined as:

$\displaystyle ind(P) = \frac{1}{k^2} \sum_{1 \leq r \leq s \leq k} d(C_r, C_s)^2$

This measure just defines how close a partition is close to a regular one.

Definition 5: An equitable partition $C_0, C_1, \dots, C_k$ of the vertex set of graph $G$ given by $V$, where $C_0$ is the exceptional set, is called $\varepsilon$-regular if $C_0 < \varepsilon |V|$ and all but $\varepsilon k^2$ of the pairs $(C_i,C_j)$ are $\varepsilon$-regular such that $1 \leq i < j \leq k$.

We are now in a position to state the regularity lemma:

Theorem 1: [Regularity Lemma]
For every positive $\varepsilon$ and positive integer m, there are positive integers $N = N(\varepsilon, m)$ and $M = M(\varepsilon, m)$ such that the following property holds: For all graphs $G$ with $|V|\geq N$, there is an $\varepsilon$-regular partition of $G(V,E)$ into $k+1$ classes such that $m\leq k \leq M$.

The beauty of the regularity lemma is in the point that that the approximation for any graph does not depend on the number of vertices it has, but only on the error in approximation (represented by $\varepsilon$).

In the proof of the regularity lemma we start with an initial partition (with low potential) and then iteratively refine the partition in such a way that the potential increases to a point such that the partition is epsilon regular. However, this leads to a major problem (the first of our concerns) – In refining the partitions iteratively the number of classes increases exponentially in each iteration. We then end up with a partition in which the number of classes is a tower function – usually an astronomical figure. Clearly, this does not work very well for graphs arising in practice. This same point is made by Prof. Trevisan:

[…] one may imagine using the Regularity Lemma as follows: to prove an approximate quantitative statement about arbitrary graphs, we only need to prove a related statement for the finite set of objects that approximate arbitrary graphs up to a certain level of accuracy. The latter, completely finite, goal may be performed via a computer search. Unfortunately, the size of the objects arising in the Regularity Lemma grow so astronomically fast in the approximation that this approach is completely impractical.

As an aside: For a long while it was thought that maybe a tower function is not necessary. However in a remarkable paper Timothy Gowers constructed graphs that demonstrated that functions of the tower type were indeed necessary.

In any case, how can we get around this problem for approximating most graphs so that it could be useful in applications such as clustering? A possible solution to this problem in the context of clustering was proposed by Sperotto and Pelillo. They make some interesting points, however do not provide many details on their approach. We will get back to this problem in a short while.

But the first problem that we face is the following: The Szemeredi Lemma as originally proposed is an existential predicate. It does not give a method to obtain the regular partition for a given graph, but only says that one must exist! So if we are to use the Lemma in any practical setting, we need an algorithmic version. There now exist two algorithmic versions:

1. One proposed in a paper by Alon et al in 1992.

2. Another proposed by Frieze and Kannan and is based on the very intriguing relationship between singular values and regularity!

We for now, focus on the Alon et al. version.

_________________________

Algorithmic Version of the Regularity Lemma:

Theorem 2: [A Constructive Version of the Regularity Lemma – Alon et al.]

For every $\varepsilon > 0$ and every positive integer $t$ there is an integer $Q = Q(\varepsilon, t)$ such that every graph with $n > Q$ vertices has an $\varepsilon$-regular partition into $k + 1$ classes, where $t \le k \le Q$. For every fixed $\varepsilon > 0$ and $t \ge 1$ such a partition can be found in $O(M(n))$ sequential time, where $M(n)$ is the time for multiplying two $n$ by $n$ matrices with 0,1 entries over the integers. It can also be found in time $O(logn)$ on an ERPW PRAM with a polynomial number of parallel processors.

To really understand how the above theorem would require the introduction of several supporting definitions and lemmas (including one from Szemeredi’s original paper). Since our focus is on the application, we’d just state one lemma using which the idea behind the constructive version could be revealed in a more concrete sense.

Lemma 1: [Alon et al.] Let $H$ be a bipartite graph with equally sized classes $|A| = |B| = n$. Let $\displaystyle 2n^{\frac{-1}{4}} < \varepsilon <\frac{1}{16}$. There is an $O(M(n))$ algorithm that verifies that $H$ is $\varepsilon$-regular or finds two subset $A' \subset A$, $B' \subset B$, $\displaystyle |A'| \ge \frac{{\varepsilon}^4}{16}n$, $\displaystyle |B'| \ge \frac{{\varepsilon}^4}{16}n$, such that $\displaystyle |d(A, B) - d(A', B')| \ge \varepsilon^4$. The algorithm can be parallelized and implemented in $NC$.

This lemma basically says that whether or not a bipartite graph is epsilon regular is a question that can be answered very quickly. If it is – then we can have an algorithm to say so and if it is not, it should return the answer with a certificate or proof that it is not so. This certificate is nothing but subsets of the classes from the original bipartite graph. The idea of the certificate is to help to proceed to the next step.

The general idea in the Alon algorithm is:

Start with a random equitable partition of the graph $C_0, C_1, \dots, C_b$, where $\displaystyle |C_i| = \lfloor\ \frac{n}{b} ;\rfloor$ and $i = 1, 2, \dots, b$. Also $|C_0| < b$. Also let $k_1 = b$.

Then, for each pair $C_r, C_s$ in the partition $P_i$ check for regularity. If it is regular report so, if not find the certificate for the pair. If out of all the pairs, only $\varepsilon k_i^2$ are not epsilon regular – then halt, the partition $P_i$ is epsilon regular. If not then we refine this partition using the information gained from the certificates in the cases when the pairs not epsilon regular. On refining this partition we obtain a new partition $P'$ with $1 + k_i4^{k_i}$ classes. We repeat this process till we hit a partition which is epsilon regular.

_________________________

Using the Constructive Version for Clustering:

A little earlier it was pointed out that there are two main difficulties in using the Szemeredi Lemma in practical settings. At one level, to use the lemma in a practical setting we need a constructive version. However the constructive versions are so designed that they work for all graphs. To illustrate how this is a problem, we quote a line from the algorithmic version based on singular values (Freize and Kannan):

The Algorithm finishes in atmost $O(\varepsilon^{-45})$ steps with an $\varepsilon$-regular partition

Now, consider what this implies – If $\varepsilon$ is $\frac{1}{16}$ (a typical value). Then the number of steps in which we are guaranteed to find a regular partition is 1.53249554 × 1054.

This is astonishingly large number. To make the lemma truly practical, we would have to give up the aim of making it work for all graphs. Instead we should be content that it works on most graphs. Most graphs would largely be graphs that appear in practice. Like was mentioned earlier some directions in this regard were provided by Sperotto and Pellilio.

Another problem (as was mentioned earlier too) was that the number of classes grows exponentially with each refinement step. We can not allow the number of classes to grow astronomically either because if we do so then consequently we would never be able to refine the partitions far enough. Here we would have to compromise on the generality again. Instead of allowing the number of classes to grow exponentially we could allow it to grow by a defined value in each iteration. This value could be chosen depending on a number of parameters, including the data set size.

Even in making such an approximation – the potential would always increase with each iteration albeit much slowly as compared to the methodology as originally described. Infact we would say that it would work for most graphs. In our work this approach seems to work quite well.

Given these two changes, what still remains is how could the lemma be used for clustering datasets? One possible way is suggested by another result called the Key Lemma and the implication that it might have. This is stated below:

Lemma 2: Given an arbitrary graph $G=(V,E)$, a partition of $V$ in $k$ clusters as in regularity lemma described above and two parameters $\varepsilon, d$, we describe the reduced graph $G^R$ as the graph whose vertices are associated to the clusters and whose edges are associated to the $\varepsilon$-regular pairs with density above $d$. If we have a coloring on the edges of $G$, then the edges of the reduced graph will be colored with a color that appears on most of the edges between the two clusters.

Using the properties of what is called the Key-Lemma/Blow-Up Lemma, it is understood that this reduced graph should retain properties of the original graph. And thus any changes made on this reduced graph would reflect on the original graph.

Thus, one possible way to cluster is using a two-phase clustering strategy: First, we use the modified regularity lemma to construct a reduced graph using the method above. In the second phase, we use a pairwise clustering method such as spectral clustering to do cluster the the reduced graph. The results obtained on this reduced graph are then projected back onto the original graph using Lemma 2.

Two Phase Strategy to use the Regularity Lemma

Infact such a methodology gives results quite encouraging as compared to other standard clustering methods. These results would be reported here once the paper under consideration is published. :) Another thing to be noted is that the reduced graph is usually quite small as compared to the original graph and working on this smaller – reduced graph is much faster.

_________________________

Other Possible Changes:

1. The Alon et al. algorithm is used and modified to use the regularity lemma for clustering. It would be interesting to explore the results given by the Frieze & Kannan method as it has a different way to find the certificate.

2. There has been some work on the sparse regularity lemma. This is something that would have a lot of practical value as in the above we construct a dense graph. Using the sparse version would allow us to use nearest neighbor graphs instead of dense graphs. This would reduce the computational burden significantly.

3. A recent paper by Fischer et al.Approximate Hypergraph Partitioning and Applications” has received a lot of attention. In this paper they give a new approach for finding regular partitions. All the previous ones work to find partitions of the tower type, while this paper gives a method to find a smaller regular partition if one exists in the graph. Employing this methodology for refinement instead of using an approximate version of the algorithmic regularity lemma could be a fruitful direction of work.

_________________________

Onionesque Reality Home >>

## Face Recognition/Authentication Using Support Vector Machines

This post is part of a series on face recognition, I have been posting on face recognition for a while. There would be at least 7-8 more posts in the near future on the topic. Though I can not promise a time frame within which all would be up.

Previous Related Posts:

3. A Huge Collection of Datasets (Post links to a number of face image databases)

This post would reference two of my posts. One on SVMs and the other on Face Recognition using Eigenfaces.

Note: This post focuses on the idea behind using SVMs for face recognition and authentication. In future posts I will cover the various packages that can be used to implement SVMs and how to go about using them, and specifically for face recognition. The same can be easily extended to other similar problems such as content based retrieval systems, speech recognition, character or signature verification systems as well.

_____

Difference between Face Authentication (Verification) and Face Recognition (also called identification):

This might seem like a silly thing to start with. But for the sake of completeness, It is a good point to start with.

Face Authentication can be considered a subset of face recognition. Though due to the small difference there are a few non-concurrent parts in both the systems.

Face Authentication (also called verification) involves a one to one check that compares an input image (also called a query image, probe image or simply probe) with only the image (or class) that the user claims to be. In simple words, if you stand in front of a face authentication system and claim to be a certain user, the system will ONLY check if you are that user or not.

Face Recognition (or Identification) is another thing, though ofcourse related. It involves a one to many comparison of the input image (or probe or query image) with a template library. In simple words, in a face recognition system the input image will be compared with ALL the classes and then a decision will be made so as to identify to WHO the the input image belongs to. Or if it does not belong to the database at all.

Like I just said before, though both Authentication and Recognition are related there are some differences in the method involved, which are obvious due to the different nature of both.

_____

A Touch-Up of Support Vector Machines:

A few posts ago I wrote a post on why Support Vector Machines had this rather “seemingly” un-intuitive name. It had a brief introduction to SVMs as well. For those completely new to Support Vector Machines this post should help. I’ll still add a little for this post.

Support Vector Machine is a binary classification method that finds the optimal linear decision surface between two classes. The decision surface is nothing but a weighted combination of the support vectors. In other words, the support vectors decide the nature of the boundary between the two classes. Take a look at the image below:

The SVM takes in labeled training examples $\{\; x_i, y_i \}$, where $x_i$ represents the features and $y_i$ the class label, that could be either 1 or -1.  On training we obtain a set of Support Vectors $m$, multipliers $\alpha_i$, $y_i$and the term $b$. To understand what $b$ does, look at the above figure. It is somewhat like the intercept term $c$ in the equation of a straight line, $y = mx + c$. The terms $w$ and $x$ determine the orientation of the hyperplane while $b$ determines the actual position of the hyperplane.

As is indicated in the diagram, the linear decision surface is :

$w\star x + b = 0 \qquad(1)$

where $\displaystyle w = \sum_{i=1}^m \alpha_i y_i s_i$

where $s_i$ are the support vectors.

The above holds when the data (classes) is linearly separable. Sometimes however, that’s not the case. Take the following example:

The two classes are indicated by the two different colors. The data is clearly not LINEARLY separable.

However when mapped onto two dimensions, a linear decision surface between them can be made with ease.

Take another example. In this example the data is not linearly separable in 2-D, so they are mapped onto three dimensions where a linear decision surface between the classes can be made.

By Cover’s Theorem it is more likely that a data-set not linearly separable in some dimension would be linearly separable  in a higher dimension. The above two examples are simple, sometimes the data might be linearly separable at very high dimensions, maybe at infinite dimensions.

But how do we realize it? This done by employing the beautiful Kernel Trick. In place of the inner products we use a suitable Mercer Kernel. I don’t believe it is a good idea to discuss kernels here, or it will be a needless digression from face recognition. I promise to discuss it some time later.

Thus the non-linear decision surface changes from $\qquad(1)$ to:

$\displaystyle w = \sum_{i=1}^m \alpha_i y_i K(s_i, x) +b = 0 \qquad(2)$

Where $K$ represents a Kernel. It could be a Radial Basis (Gaussian) Kernel, A linear Kernel, A polynomial Kernel or a custom Kernel. :)

_____

Face Authentication is a two class problem. As I have mentioned earlier, here the system is presented with a claimed identity and it has to make a decision whether the claimant is really that person or not. The SVM in such applications will have to be fed with the images of one person, which will constitute one class and the other class will consist of images of other people other than that person. The SVM will then generate a linear decision surface.

For a input/probe image $p$, the identity is accepted if:

$w \star p + b < 0$

Or it is rejected. We can parameterize the decision surface by modifying the above as:

$w \star x + b = \Delta$

Then, a claim will be accepted if for a probe, $p$

$w \star p + b < \Delta$

_____

Now face recognition is a $\mathcal{K}$ class problem. Where $\mathcal{K}$ is the number of classes (or individuals).  Whereas the traditional Support Vector Machine is a binary classifier. So we’ll make a few changes to the way we are representing the faces to suit our classifier. I will come back to this in a while.

Feature Extraction: The faces will have to be represented by some appropriate features, these could be weights obtained using the Eigenfaces method, or using gabor features or anything else. I have written a post earlier that talked of a face recognition system based on Eigenfaces. I would direct the reader to check face representation using Eigenfaces there.

Using Eigenfaces, each probe $\Phi$could be represented as a vector of weights:

$\Omega = \begin{bmatrix}w_1\\w_2\\ \vdots\\w_M \end{bmatrix}$

After obtaining such a weight vector for the input or probe image and for all the other images stored in the library, we were simply finding the Euclidean or the Mahalanobis distance of the weight vector of the probe image with those of the images in the template library.  And then were recognizing the probe as a face that gave the minimum score provided it was below a certain threshold. I have discussed this is much detail there. And since I have, I would not discuss this again here.

_____

Representation in Difference Space:

SVMs are binary classifiers, that is – they give the class which might be 1 or -1, so we would have to modify the representation of faces a little bit than what we were doing in that previous post to make it somewhat more desirable. In the previous approach that is “a view based or face space approach”, each image was encoded separately. Here, we would change the representation and encode faces into a difference space. The difference space takes into account the dissimilarities between faces.

In the difference space there can be two different classes.

1. The class that encodes the dissimilarities between different images of the same person,

2. The other class encodes the dissimilarities between images of other people. These two classes are then given to a SVM which then generates a decision surface.

As  I wrote earlier, Face recognition traditionally can be thought of as a $\mathcal{K}$ class problem and face authentication can be thought of as a $\mathcal{K}$ instances two class problem. To reduce it to a two class problem we formulate the problem into a difference space as I have already mentioned.

Now consider a training set $\mathcal{T} = \{ \;t_1, \ldots, t_M\}$ having ${M}$ training images belonging to $\mathcal{K}$ individuals. Each individual can have more than one image, that means $M > \mathcal{K}$ ofcourse. It is from $\mathcal{T}$ that we generate the two classes I mentioned above.

1. The within class differences set. This set takes into account the differences in the images of the same class or individual. In more formal terms:

$\mathcal{C}_1 = \{ \; t_i - t_j | t_i \backsim t_j \}$

Where $t_i$ and $t_j$ are images and $t_i \backsim t_j$ indicates that they belong to the same person.

This set contains the differences not just for one individual but for all $\mathcal{K}$ individuals.

2. The between class differences set. This set gives the dissimilarities of different images of different individually. In more formal terms:

$\mathcal{C}_2 = \{ \; t_i - t_j | t_i \nsim t_j\}$

Where $t_i$ and $t_j$ are images and $t_i \nsim t_j$ indicates that they do not belong to the same person.

_____

Face Authentication:

For Authentication the incoming probe $p$ and a claimed identity $i$ is presented.

Using this, we first find out the similarity score:

$\delta = \displaystyle \sum_{i=1}^m \alpha_i y_i K(s_i, ClaimedID - p) +b$

We then accept this claim if it lies below a certain threshold $\Delta$ or else reject it. I have discussed the need for a threshold at the end of this post, please have a look. $\Delta$ is to be found heuristically.

_____

Face Recognition:

Consider a set of images $\mathcal{T} = \{ \;t_1, \ldots, t_M\}$, and a probe $p$ which is to be indentified.

We take $p$ and score it with every image in the set $t_i$:

$\delta = \displaystyle \sum_{i=1}^m \alpha_i y_i K(s_i, t_i - p) + b$

The image with the lowest score but below a threshold is recognized. I have written at the end of this post explaining why this threshold is important. This threshold is mostly chose heuristically.

_____

References and Important Papers

1. Face Recognition Using Eigenfaces, Matthew A. Turk and Alex P. Pentland, MIT Vision and Modeling Lab, CVPR ‘91.

2. Eigenfaces Versus Fischerfaces : Recognition using Class Specific Linear Projection, Belhumeur, Hespanha, Kreigman, PAMI ‘97.

3. Eigenfaces for Recognition, Matthew A. Turk and Alex P. Pentland, Journal of Cognitive Neuroscience ‘91.

4. Support Vector Machines Applied to Face Recognition, P. J. Phillips, Neural Information Processing Systems ’99.

5. The Nature of Statistical Learning Theory (Book), Vladimir Vapnik, Springer ’99.

6. A Tutorial on Support Vector Machines for Pattern Recognition, Christopher J. C. Burges, Data Mining and Knowledge Discovery, ’99

_____

Onionesque Reality Home >>

Pete Skomoroch of Datawrangling admits that work and other commitments have made him cut down on blogging significantly. However every now and then, he comes up with posts that literally show the astonishing amounts of sustained efforts put in by him.

Though the posting frequency is very low, It still qualifies as one of my favorite blogs simply because it is of great help to me.

Last week, he posted a notification that he had updated his old post which indexed a very large number of datasets on various topics/fields/projects. This post is an absolute life saver.

### Some Datasets Available on the Web – Data Wrangling Blog

Click above to follow

Whenever I have searched for datasets for my problem/instructional/experimental requirements, I have almost always landed on Pete’s page.

Please check that link out for sure! I maintain a list largely constructed using the Data-Wrangling blog for face recognition. And since I have written two posts on face recognition before this one, it makes perfect sense to post that list.

_____

Some Datasets for Face Recognition/Authentication/Detection purposes:

6. The AR Face Database (126 People, over 4000 color images, different illumination conditions, facial expressions and occlusions, two sessions per person)

7. Olivetti Research Limited (a database of about 400 images).

8. University of Berne, Switzerland Face Database (Frontal images of 30 people, 10 images for each with different orientation. Profile images of the same 30 people, 5 images for each).

9. University of Oulu – Physics Face Database (125 faces in 16 different illumination and camera calibration conditions, additional 16 if the person wears glasses).

10. The Georgia Tech Face Database (Images of 50 different subjects taken over two three sessions)

11. The Yale Database (One of the most widely used! 165 grayscale images of 15 indivduals)

12. The Yale Database B (5760 single light source images of 10 subjects in 576 viewing conditions )

13. Labeled Faces in the Wild (U-Mass, For the problem of unconstrained facial recognition. With 13000 images collected from the web, two different images for some individuals)

15. University of Sheffield Face Database (564 images of 20 individuals, mixed race, gender and appearance)

16. University of Essex Face Database (total 7900 images of 395 individuals with 20 images each )

17. Indian Face Database (Collected at IIT-K, 40 subejcts with 11 different poses)

21. The MIT-CBCL Database (Used in a previous post by me. Face Images of 10 subject. Huge and rather simplistic database )

23. Facial Expression Database (Cohn-Kanade-AU-Coded Facial Expression Database. Image data consists of 500 image sequences from 100 subjects)

24. AT&T Database (10 different images each of 40 different subjects)

26. BJUT 3D Face Database (500 individuals – 250 males and 250 females, 3-D Face Data)

29. CAS-PEAL Face Database (99,594 images if 1040 individuals, 595 males and 445 females with varying pose, expression, accessory and lighting)

33. 3D_RMA (3-D Face Image Database)

34. Equinox (Human Identification at a Distance)

I would try to edit this post when I get the time and insert details for each database to facilitate ease in navigation. I apologize for not doing so right away, I did not do so as it is a very time consuming process.

_____

PS: I was tempted to re-blog the list by Pete, but I decided against it. It is his work, he deserves all the kudos!

_____

Onionesque Reality Home >>

## Face Recognition in Bees

I have been following Dr Adrian Dyer’s work for about a year. I discovered him and his work while searching on the Monash University website for research on vision, ML and related fields when a friend of mine was thinking of applying for a PhD there. Dr Dyer is a vision scientist and a top expert on Bees. His present research is based on investigating aspects of insect vision and  how insects make decisions in complex colored natural environments.

His current findings (published just about a  month or so back) could have important cues on how to make improvements in computer based facial recognition systems.

[Thermographic Image of a Bumblebee on a flower showing temperature difference : Image Source ]

_____

When i read about Dr Dyer’s most recent work I was instantly reminded of a rather similar experiment on pigeons done almost a decade and a half back by Watanabe et al [1] .

Digression: Van Gogh, Chagall and Pigeons

Allow me to take a brief digression and have a look at that classic experiment before returning to Bees. The experiment by Watanabe et al is my favorite when i have to initiate a general talk on WHAT is pattern recognition and generalization etc.

In the experiment, pigeons were placed in a skinner box and were trained to recognize paintings by two different artists (eg Van Gogh and Chagall). The training to recognize paintings by a particular artist was done by giving a reward on pecking when presented paintings by that artist.

_____

The pigeons were trained on a marked set of paintings by Van Gogh and Chagall. When tested on the SAME set they gave a discriminative ability of 95 %.  Then they were presented with paintings by the same artists that they had not previously seen. The ability of the pigeons to discriminate was still about 85%, which is a pretty high recognition rate. Interestingly this performance wasn’t much off human performance.

What this means is that pigeons do not memorize pictures but they can extract patterns (the “style” of painting etc) and can generalize from already seen data to make predictions on data never seen before. Isn’t this really what Artificial Neural Networks and Support Vector Machines do? And there is a lot of research going on to work to improve the generalization of these systems.

_____

Coming Back to Bees:

Dr Adrian Dyer’s group has been working on the visual information processing in miniature brains, brains that are much smaller than a primate brain. And honeybees provide a good model for working on such brains. One reason being that a lot about their behavior is known.

Their work has mostly been on color information processing in bees and how it could lend cues  for improvement in vision in robotic systems. Investigations have shown that the very small Bee brain can be taught to recognize extremely difficult/ complex objects.

_____

Color Vision in Bees [2] :

Research by Dr Dyer’s group on color vision in bees has provided key insights on the relationship between the perceptive abilities of bees and the flowers that they frequent. As an example it was known that flowers had evolved particular colors to suit the visual system of bees, but it was not known how it was that certain flower colors were extremely rare in nature. The color discrimination abilities of the Bees were tested by making a virtual meadow. There was high variability in the strategies used by bees to solve problems, however the discriminative abilities of the Bees were somewhat comparable to that of the Human sensory abilities. This research has shown not only how plants evolve flower colors to attract pollinators but also on what is the nature of color vision systems in nature and what are their limitations.

_____

Recognition of Rotated Objects (Faces) [3] :

The ability to recognize 3-Dimensional objects in natural environments is a challenging problem for visual systems especially in condition of rotation of object, variation in lighting conditions, occlusion etc. The effect of rotation on an object sometimes may cause it to look more dissimilar when compared to its rotated version than a non-rotated different object. Human and primate brains are known to recognize novel orientations of rotated objects by a method of interpolating between stored views. Primate brains perform better when presented with novel views that can be interpolated to from stored views than those novel views that required extrapolation beyond the stored views. Some animals, for example pigeons perform equally well on interpolated and extrapolated views.

To test how miniature brains, such as in Bees would deal with the problem posed by rotation of objects, Dr Dyer’s group presented Bees a face recognition task.  The bees were trained for different views (0, 30 and 60 degrees) of two similar faces S1 and S2, these are enough for getting an insight on how the brain solves the problem of a view with which it has no prior experience. The Bees were trained with face images from a standard test for studying vision for human infants.

[Image Source – 3]

Group 1 of bees was trained for 0 degree orientation and then was given a non-reward test on the same view and that of a novel 30 degree view.

Group 2 was trained for 60 degree orientation and then was tested on the same view (non-reward) and a novel 30 degree view.

Group 3 was trained for both the 0 degree and 60 degree orientations and then was tested on the same and novel 30 degree interpolation visual stimuli.

Group 4 was trained with both 0 degree and 30 degree orientations and then was tested on the same and also the novel 60 degree extrapolation view.

Bees in all of the four groups were able to recognize the trained target stimuli well above the chance values. But when presented with novel views Group 1 and Group 2 could not recognize novel views, while Group 3 could. They could do so by interpolating between the 0 degree and the 60 degree views. Group 4 too could not recognize novel views presented to it, indicating that Bees could not extrapolate from the 0 and 30 degree training to recognize faces oriented at 60 degrees.

The experiment puts forth that a miniature brain can learn to recognize complex visual stimuli by experience by a method of image interpolation and averaging. However the performance is bad when the stimuli required extrapolation.

_____

Bees performed poorly on images that were inverted. But in this regard they are in good company,  even humans do not perform well when images are inverted. As an illustration consider these two images:

How similar do these two inverted images look to be at first glance? Very similar? Let’s now have a look at the upright images. The one on the left below corresponds to the image on the left above.

The upright images look very different, even though the inverted images looked very similar.

I would recommend an exercise on rotating the image on the right to its upright position and back, please follow this link for doing so. This was a good illustration of the fact that the human visual system does not perform AS well with inverted images.

_____

The findings show that despite the constrained neural resources of the bee brain (about 0.01 percent as compared to humans) they can perform remarkably well at complex visual recognition tasks. Dr Dyer says this could lead to improved models in artificial systems as this test is evidence that face recognition does not require an extremely advanced nervous system.

The conclusion is that the Bee brain, which is small (about 850,000 neurons) can be simulated with relative ease (as compared to what was previously thought for complex pattern analysis and recognition) for performing rather complex pattern recognition tasks.

On a lighter note, Dr Dyer also says that his findings don’t justify the adage that you should not kill a bee otherwise its nest mates will remember you to take revenge. ;-)

_____

References and Recommended Papers:

[1] Van Gogh, Chagall and Pigeons: Picture Discrimination in Pigeons and Humans in “Animal Cognition”, Springer Varlag – Shigeru Watanabe.

[2] The Evolution of Flower Signals to Attract Pollinators – Adrian Dyer.

[3] Insect Brains use Image Interpolation Mechanisms to Recognise Rotated Objects in “PLos One” – Adrian Dyer, Quoc Vuong.

[4] Honeybees can recognize complex images of Natural scenes for use as Potential Landmarks – Dyer, Rosa, Reser.

_____

Onionesque Reality Home >>

## Face Recognition using Eigenfaces and Distance Classifiers: A Tutorial

Why?

Two Reasons:

1. Eigenfaces is probably one of the simplest face recognition methods and also rather old, then why worry about it at all? Because, while it is simple it works quite well. And it’s simplicity also makes it a good way to understand how face recognition/dimensionality reduction etc works.

2. I was thinking of writing a post based on face recognition in Bees next, so this should serve as a basis for the next post too. The idea of this post is to give a simple introduction to the topic with an emphasis on building intuition. For more rigorous treatments, look at the references.

_____

Introduction

Like almost everything associated with the Human body –  The Brain, perceptive abilities, cognition and consciousness, face recognition in humans is a wonder. We are not yet even close to an understanding of how we manage to do it. What is known is that it is that the Temporal Lobe in the brain is partly responsible for this  ability. Damage to the temporal lobe can result in the condition in which the concerned person can lose the ability to recognize faces. This specific condition where  an otherwise normal person who suffered some damage to a specific region in the temporal lobe loses the ability to recognize faces is called prosopagnosia. It is a very interesting condition  as  the perception of faces remains normal (vision pathways and perception is fine) and the person can recognize people by their voice but not by faces. In one of my previous posts, which had links to a series of lectures by Dr Vilayanur Ramachandran, I did link to one lecture by him in which he talks in detail about this condition. All this aside, not much is known how the perceptual information for a face is coded in the brain too.

_____

A Motivating Parallel

Eigenfaces has a parallel to one of the most fundamental ideas in mathematics and signal processing – The Fourier Series. This parallel is also very helpful to build an intuition to what Eigenfaces (or PCA) sort of does and hence must be exploited. Hence we review the Fourier Series in a few sentences.

Fourier series are named so in the honor of Jean Baptiste Joseph Fourier (Generally Fourier is pronounced as “fore-yay”, however the correct French pronunciation is “foor-yay”) who made important contributions to their development. Representation of a signal in the form of a linear combination of complex sinusoids is called the Fourier Series. What this means is that you can’t just split a periodic signal into simple sines and cosines, but you can also approximately reconstruct that signal given you have information how the sines and cosines that make it up are stacked.

More Formally: Put in more formal terms, suppose $f(x)$ is a periodic function with period $2\pi$ defined in the interval $c\leq x \leq c+2\pi$ and satisfies a set of conditions called the Dirichlet’s conditions:

1. $f(x)$ is finite, single valued and its integral exists in the interval.

2. $f(x)$ has a finite number of discontinuities in the interval.

3. $f(x)$ has a finite number of extrema in that interval.

then $f(x)$ can be represented by the trigonometric series

$f(x) = \displaystyle\frac{a_0}{2} + \displaystyle \sum_{n=1}^\infty [a_n cos(nx) + b_n sin(nx) ]\qquad(1)$

The above representation of $f(x)$ is called the Fourier series and the coefficients $a_0$, $a_n$ and $b_n$ are called the fourier coefficients and are determined from $f(x)$ by Euler’s formulae. The coefficients are given as :

$a_0 = \displaystyle \frac{1}{\pi} \int_{c}^{c+2\pi} f(x)dx$

$a_n = \displaystyle\frac{1}{\pi}\int_{c}^{c+2\pi} f(x)cos(nx)dx$

$\displaystyle b_n = \frac{1}{\pi}\int_{c}^{c+2\pi} f(x)sin(nx)dx$

Note: It is common to define the above using $c = -\pi$

An example that illustrates $\qquad (1)$ or the Fourier series is:

A square wave (given in black) can be approximated to by using a series of sines and cosines (result of this summation shown in blue). Clearly in the limiting case, we could reconstruct the square wave exactly with simply sines and cosines.

_____

Though not exactly the same, the idea behind Eigenfaces is similar. The aim is to represent a face as a linear combination of a set of basis images (in the Fourier Series the bases were simply sines and cosines). That is :

$\displaystyle\Phi_i = \displaystyle\sum_{j=1}^{K}w_j u_j$

Where $\displaystyle \Phi_i$ represents the $i^{th}$ face with the mean subtracted from it, $w_j$ represent weights and $u_j$ the eigenvectors. If this makes somewhat sketchy sense then don’t worry. This was just like mentioning at the start what we have to do.

The big idea is that you want to find a set of images (called Eigenfaces, which are nothing but Eigenvectors of the training data) that if you weigh and add together should give you back a image that you are interested in (adding images together should give you back an image, Right?). The way you weight these basis images (i.e the weight vector) could be used as a sort of a code for that image-of-interest and could be used as features for recognition.

This can be represented aptly in a figure as:

Click to Enlarge

In the above figure, a face that was in the training database was reconstructed by taking a weighted summation of all the basis faces and then adding to them the mean face. Please note that in the figure the ghost like basis images (also called as Eigenfaces, we’ll see why they are called so) are not in order of their importance. They have just been picked randomly from a pool of 70 by me. These Eigenfaces were prepared using images  from the MIT-CBCL database (also I have adjusted the brightness of the Eigenfaces to make them clearer after obtaining them, therefore the brightness of the reconstructed image looks different than those of the basis images).

_____

An Information Theory Approach:

First of all the idea of Eigenfaces considers face recognition as a 2-D recognition problem, this is based on the assumption that at the time of recognition, faces will be mostly upright and frontal. Because of this, detailed 3-D information about the face is not needed. This reduces complexity by a significant bit.

Before the method for face recognition using Eigenfaces was introduced, most of the face recognition literature dealt with local and intuitive features, such as distance between eyes, ears and similar other features. This wasn’t very effective. Eigenfaces inspired by a method used in an earlier paper was a significant departure from the idea of using only intuitive features. It uses an Information Theory appraoch wherein the most relevant face information is encoded in a group of faces that will best distinguish the faces. It transforms the face images in to a set of basis faces, which essentially are the principal components of the face images.

The Principal Components (or Eigenvectors) basically seek directions in which it is more efficient to  represent the data. This is particularly useful for reducing the computational effort. To understand this,  suppose we get 60 such directions, out of these about 40 might be insignificant and only 20 might represent the variation in data significantly, so for calculations it would work quite well to only use the 20 and leave out the rest.  This is illustrated by this figure:

Click to Enlarge

Such an information theory approach will encode not only the local features but also the global features. Such features may or may not be intuitively understandable. When we find the principal components or the Eigenvectors of the image set, each Eigenvector has some contribution from EACH face used in the training set. So the Eigenvectors also have a face like appearance. These look ghost like and are ghost images or Eigenfaces. Every image in the training set can be represented as a weighted linear combination of these basis faces.

The number of Eigenfaces that we would obtain therefore would be equal to the number of images in the training set. Let us take this number to be $M$. Like I mentioned one paragraph before, some of these Eigenfaces are more important in encoding the variation in face images, thus we could also approximate faces using only the $K$ most significant Eigenfaces.

_____

Assumptions:

1. There are $M$ images in the training set.

2. There are $K$ most significant Eigenfaces using which we can satisfactorily approximate a face. Needless to say K < M.

3. All images are $N \times N$ matrices, which can be represented as $N^2 \times 1$ dimensional vectors. The same logic would apply to images that are not of equal length and breadths. To take an example: An image of size 112 x 112 can be represented as a vector of dimension 12544 or simply as a point in a 12544 dimensional space.

_____

Algorithm for Finding Eigenfaces:

1. Obtain $M$ training images $I_1$, $I_2$$I_M$, it is very important that the images are centered.

2. Represent each image $I_i$ as a vector $\Gamma_i$  as discussed above.

$I_i = \begin{bmatrix} a_{11} & a_{12} &\ldots & a_{1N} \ a_{21} & a_{22} & \ldots & a_{2N} \ \vdots & \vdots & \ddots & \vdots \ a_{N1} & a_{N2} & \ldots & a_{NN}\end{bmatrix}_{N\times N} \xrightarrow{\rm concatenation} \begin{bmatrix}\ a_{11} \ \vdots \ a_{1N} \ \vdots\ a_{2N} \ \vdots \ a_{NN} \end{bmatrix}_{N^2\times 1} = \Gamma_i$

Note: Due to a recent WordPress $\LaTeX$ bug, there is some trouble with constructing matrices with multiple columns. To avoid confusion and to maintain continuity, for the time being I am posting an image for the above formula that’s showing an error message. Same goes for some formulae below in the post.

3. Find the average face vector $\Psi$.

$\Psi = \displaystyle\frac{1}{M}\sum_{i=1}^M\Gamma_i$

4. Subtract the mean face from each face vector $\Gamma_i$ to get a set of vectors $\Phi_i$. The purpose of subtracting the mean image from each image vector is to be left with only the distinguishing features from each face and “removing” in a way information that is common.

$\Phi_i = \Gamma_i - \Psi$

5. Find the Covariance matrix $C$:

$C = AA^T$, where $A=[\Phi_1, \Phi_2 \ldots \Phi_M]$

Note that the Covariance matrix has simply been made by putting one modified image vector obtained in  one column each.

Also note that $C$ is a $N^2 \times N^2$ matrix and $A$ is a $N^2\times M$ matrix.

6. We now need to calculate the Eigenvectors $u_i$ of $C$, However note that $C$ is a $N^2 \times N^2$ matrix and it would return $N^2$ Eigenvectors each being $N^2$ dimensional. For an image this number is HUGE.  The computations required would easily make your system run out of memory. How do we get around this problem?

7. Instead of the Matrix $AA^T$ consider the matrix $A^TA$. Remember $A$ is a $N^2\times M$ matrix, thus $A^TA$ is a $M\times M$ matrix. If we find the Eigenvectors of this matrix, it would return $M$ Eigenvectors, each of Dimension $M \times 1$, let’s call these Eigenvectors $v_i$.

Now from some properties of matrices, it follows that: $u_i = Av_i$. We have found out $v_i$ earlier. This implies that using $v_i$ we can calculate the M largest Eigenvectors of $AA^T$. Remember that $M\ll N^2$ as M is simply the number of training images.

8. Find the best $M$ Eigenvectors of $C=AA^T$ by using the relation discussed above. That is: $u_i = Av_i$. Also keep in mind that $\begin{Vmatrix}u_i\end{Vmatrix}=1$.

[6 Eigenfaces for the training set chosen from the MIT-CBCL database, these are not in any order]

9. Select the best $K$ Eigenvectors, the selection of these Eigenvectors is done heuristically.

_____

Finding Weights:

The Eigenvectors found at the end of the previous section, $u_i$ when converted to a matrix in a process that is reverse to that in STEP 2, have a face like appearance. Since these are Eigenvectors and have a face like appearance, they are called Eigenfaces. Sometimes, they are also called as Ghost Images because of their weird appearance.

Now each face in the training set (minus the mean), $\Phi_i$ can be represented as a linear combination of these Eigenvectors $u_i$:

$\Phi_i = \sum_{j=1}^{K}w_ju_j$m, where $u_j$ ‘s are Eigenfaces.

These weights can be calculated as :

$w_j = u_j^T\Phi_i$.

Each normalized training image is represented in this basis as a vector.

$\Omega_i = \begin{bmatrix}w_1\w_2\ \vdots \w_k\end{bmatrix}$

where i = 1,2… M. This means we have to calculate such a vector corresponding to every image in the training set and store them as templates.

_____

Now consider we have found out the Eigenfaces for the training images , their associated weights after selecting a set of most relevant Eigenfaces and have stored these vectors corresponding to each training image.

If an unknown probe face $\Gamma$ is to be recognized then:

1. We normalize the incoming probe $\Gamma$ as $\Phi = \Gamma - \Psi$.

2. We then project this normalized probe onto the Eigenspace (the collection of Eigenvectors/faces) and find out the weights.

$w_i = u_i^T\Phi$.

3. The normalized probe $\Phi$ can then simply be represented as:

$\Omega = \begin{bmatrix}w_1\w_2\vdots\w_K\end{bmatrix}$

After the feature vector (weight vector) for the probe has been found out, we simply need to classify it. For the classification task we could simply use some distance measures or use some classifier like Support Vector Machines (something that I would cover in an upcoming post). In case we use distance measures, classification is done as:

Find $e_r = min\begin{Vmatrix}\Omega - \Omega_i\end{Vmatrix}$. This means we take the weight vector of the probe we have just found out and find its distance with the weight vectors associated with each of the training image.

And if $e_r < \Theta$, where $\Theta$ is a threshold chosen heuristically, then we can say that the probe image is recognized as the image with which it gives the lowest score.

If however $e_r > \Theta$ then the probe does not belong to the database. I will come to the point on how the threshold should be chosen.

For distance measures the most commonly used measure is the Euclidean Distance. The other being the Mahalanobis Distance. The Mahalanobis distance generally gives superior performance. Let’s take a brief digression and look at these two simple distance measures and then return to the task of choosing a threshold.

_____

Distance Measures:

Euclidean Distance: The Euclidean Distance is probably the most widely used distance metric. It is a special case of a general class of norms and is given as:

$\displaystyle\begin{Vmatrix}x-y\end{Vmatrix}_e = \displaystyle\sqrt{\begin{vmatrix}x_i-y_i\end{vmatrix}^2}$

The Mahalanobis Distance: The Mahalanobis Distance is a better distance measure when it comes to pattern recognition problems. It takes into account the covariance between the variables and hence removes the problems related to scale and correlation that are inherent with the Euclidean Distance. It is given as:

$d(x,y) =\sqrt{ (x-y)^TC^{-1}(x-y)}$

Where $C$ is the covariance between the variables involved.

_____

Deciding on the Threshold:

Why is the threshold, $\Theta$ important?

Consider for simplicity we have ONLY 5 images in the training set. And a probe that is not in the training set comes up for the recognition task. The score for each of the 5 images will be found out with the incoming probe. And even if an image of the probe is not in the database, it will still say the probe is recognized as the training image with which its score is the lowest. Clearly this is an anomaly that we need to look at. It is for this purpose that we decide the threshold. The threshold $\Theta$ is decided heuristically.

Click to Enlarge

Now to illustrate what I just said, consider a simpson image as a non-face image, this image will be scored with each of the training images. Let’s say $S_4$ is the lowest score out of all. But the probe image is clearly not beloning to the database. To choose the threshold we choose a large set of random images (both face and non-face), we then calculate the scores for images of people in the database and also for this random set and set the threshold $\Theta$ (which I have mentioned in the “recognition” part above) accordingly.

_____

More on the Face Space:

To conclude this post, here is a brief discussion on the face space.

[Image Source – [1]]

Consider a simplified representation of the face space as shown in the figure above. The images of a face, and in particular the faces in the training set should lie near the face space. Which in general describes images that are face like.

The projection distance $e_r$ should be under a threshold $\Theta$ as already seen. The images of known individual fall near some face class in the face space.

There are four possible combinations on where an input image can lie:

1. Near a face class and near the face space : This is the case when the probe is nothing but the facial image of a known individual (known = image of this person is already in the database).

2. Near face space but away from face class : This is the case when the probe image is of a person (i.e a facial image), but does not belong to anybody in the database i.e away from any face class.

3. Distant from face space and near face class : This happens when the probe image is not of a face however it still resembles a particular face class stored in the database.

4. Distant from both the face space and face class: When the probe is not a face image i.e is away from the face space and is nothing like any face class stored.

Out of the four, type 3 is responsible for most false positives. To avoid them, face detection is recommended to be a part of such a system.

_____

References and Important Papers

1.Face Recognition Using Eigenfaces, Matthew A. Turk and Alex P. Pentland, MIT Vision and Modeling Lab, CVPR ’91.

2. Eigenfaces Versus Fischerfaces : Recognition using Class Specific Linear Projection, Belhumeur, Hespanha, Kreigman, PAMI ’97.

3. Eigenfaces for Recognition, Matthew A. Turk and Alex P. Pentland, Journal of Cognitive Neuroscience ’91.

_____

Related Posts:

_____

MATLAB Codes:

I suppose the MATLAB codes for the above are available on atleast 2-3 locations around the internet.

However it might be useful to put out some starter code. Note that the variables have the same names as those in the description above.You may use this mat file for testing. These are the labels for that mat file.

%This is the starter code for only the part for finding eigenfaces.
%In this case load the .mat filed attached above.
% use reshape command to see an image which is stored as a row in the above.

% The code is only skeletal.
% Find psi - mean image
Psi_train = mean(features_faces')';
% Find Phi - modified representation of training images.
% 548 is the total number of training images.
for i = 1:548
Phi(:,i) = raw_features(:,i) - Psi_train;
end
% Create a matrix from all modified vector images
A = Phi;
% Find covariance matrix using trick above
C = A'*A;
[eig_mat, eig_vals] = eig(C);
% Sort eigen vals to get order
eig_vals_vect = diag(eig_vals);
[sorted_eig_vals, eig_indices] = sort(eig_vals_vect,'descend');
sorted_eig_mat = zeros(548);
for i=1:548
sorted_eig_mat(:,i) = eig_mat(:,eig_indices(i));
end
% Find out Eigen faces
Eig_faces = (A*sorted_eig_mat);
size_Eig_faces=size(Eig_faces);
% Display an eigenface using the reshape command.

% Find out weights for all eigenfaces
% Each column contains weight for corresponding image
W_train = Eig_faces'*Phi;
face_fts = W_train(1:250,:)'; % features using 250 eigenfaces.

_____
Onionesque Reality Home >>