A semi-useful fact for people working in Machine Learning.
Blurb: Recently, my supervisor was kind enough to institute a small reading group on Deep Neural Networks to help us understand them better. A side product of this would hopefully be that I get to explore this old interest of mine (I have an old interest in Neural Nets and have always wanted to study them in detail, especially work from the mid 90s when the area matured and connections to many others areas such as statistical physics became apparent. But before I got to do that, they seem to be back in vogue! I dub this resurgence as Connectionism 2.0 Although the hipster in me dislikes hype, it is always lends a good excuse to explore an interest). I also attended a Summer School on the topic at UCLA, find slides and talks over here.
Coming back: we have been making our way through this Foundations and Trends volume by Yoshua Bengio (which I quite highly recommend). Bengio spends a lot of time giving intuitions and arguments on why and when Deep Architectures are useful. One particular argument (which is actually quite general and not specific to trees as it might appear) went like this:
Ensembles of trees (like boosted trees, and forests) are more powerful than a single tree. They add a third level to the architecture which allows the model to discriminate among a number of regions exponential in the number of parameters. As illustrated in […], they implicitly form a distributed representation […]
This is followed by an illustration with the following figure:
[Image Source: Learning Deep Architectures for AI]
and accompanying text to the above figure:
Whereas a single decision tree (here just a two-way partition) can discriminate among a number of regions linear in the number of parameters (leaves), an ensemble of trees can discriminate among a number of regions exponential in the number of trees, i.e., exponential in the total number of parameters (at least as long as the number of trees does not exceed the number of inputs, which is not quite the case here). Each distinguishable region is associated with one of the leaves of each tree (here there are three 2-way trees, each defining two regions, for a total of seven regions). This is equivalent to a multi-clustering, here three clusterings each associated with two regions.
Now, the following question was considered: Referring to the text in boldface above, is the number of regions obtained exponential in the general case? It is easy to see that there would be cases where it is not exponential. For example: the number of regions obtained by the three trees would be the same as those obtained by one tree if all three trees overlap, hence giving no benefit. The above claim refers to a paper (also by Yoshua Bengio) where he constructs examples to show the number of regions could be exponential in some cases.
But suppose for our satisfaction we are interested in actual numbers and the following general question, an answer to which should also answer the question raised above:
What is the maximum possible number of regions that can be obtained in by the intersection of
hyperplanes?
Let’s consider some examples in the case.
Where there is one hyperplace i.e. , then the maximum number of possible regions is obviously two.
Clearly, for two hyerplanes, the maximum number of possible regions is 4.
In case there are three hyperplanes, the maximum number of regions that might be possible would be 7 as illustrated in the first figure. For i.e. 4 hyerplanes, this number is 11 as shown below:
[4 Hyperplanes: 11 Regions]
On inspection we can see the number of regions with hyperplanes in ${\bf R}^2$ is given by:
Number of Regions = #lines + #intersections + 1
Now let’s consider the general case: What is the expression that would give us the maximum number of regions possibles with hyperplanes in
? Turns out that there is a definite expression for the same:
Let be an arrangement of
hyperplanes in
, then the maximum number of regions possible is given by
the number of bounded regions (closed from all sides) is given by:
The above expressions actually derive from a more general expression when is specifically a real arrangement.
I am not familiar with some of the techniques that are required to prove this in the general case ( is a set of affine hyperplanes in a vector space defined over some Field). The above might be proven by induction. However, it turns out that in the
case the result is related to the crossing lemma (see discussion over here), I think it was one of the most favourite results of one of my previous advisors and he often talked about it. This connection makes me want to study the details [1],[2], which I haven’t had the time to look at and I will post the proof for the general version in a future post.
________________
References:
1. An Introduction to Hyperplane Arrangements: Richard P. Stanley (PDF)
Related:
(Helpful for the proof)
2. Partially Ordered Sets: William Trotter (Handbook of Combinatorics) (PDF)
________________
You can find 2 different proofs for the general case in the book: Lectures on
Discrete Geometry by Jiff Matousek. chapter 6 section 6.1.