Feeds:
Posts

## The Origin of Kolmogorov Complexity

A mildly personal post.

The title does not imply that the lines quoted below correspond to the exact origin of Kolmogorov Complexity, though they are related and give away the essence.

Andrey Kolmogorov

Information theory must precede probability theory and not be based on it. By the very essence of this discipline, the foundations of information theory have a finite combinatorial character.

With my background in Electrical Engineering I had the opportunity to take courses in Information Theory and Coding which made the idea of Shannon’s Information Theory quite familiar. But there was a time when I had enough background to started noticing conversations that were perhaps relegated to the background before. Simply because I didn’t know enough to make any sense of them and hence these conversations were more or less noise to me. But these happened to be on Kolmogorov Complexity. I hadn’t sat down and studied it. But had been reading articles here and there that mentioned it even with ideas such as The Godel Incompleteness theorems and the Halting Problem. It created the impression that this area must be fundamental but not clearly why.

And then I came across the above rather cryptic lines by Kolmogorov. Used to the idea of entropy (defined in terms of probability) as information, they made my brain hurt. I spent a couple of days thinking about them and suddenly I realized WHY it was so fundamental. And things started making more sense. Ofcourse I didn’t know anything about it as such, but the two day thinking session convinced me enough, that in a sense it was as fundamental as calculus for me given the things I was interested in (along with Shannon‘s and Fisher’s ideas). It also convinced me enough to want to know more about it no matter what projects I was involved in and immediately bought a book that I have been trying my way through as an aside to what I have been working on (linked below).

I find such insightful one liners that happen to cause almost a phase transition or a complete change in the way you look at some thing (information theory in this case) quite remarkable, making the new view very beautiful. Ofcourse there is a “right” time for them to occur but this was certainly one of them. The lines below had an auxiliary effect too:

The applications of probability theory can be put on a uniform basis. It is always a matter of consequences of hypotheses about the impossibility of reducing in one way or another the complexity of the descriptions of the objects in question. Naturally this approach to the matter does not prevent the development of probability theory as a branch of mathematics being a special case of general measure theory.

The concepts of information theory as applied to infinite sequences give rise to very interesting investigations, which, without being indispensable as a basis of probability theory, can acquire a certain value in investigation of the algorithmic side of mathematics as a whole.

– Andrey Kolmogorov (1983)

While the above was a more personal story there are many other famous examples of cryptic one liners changing a view. Here’s a famous one:

A Famous Cryptic Comment:

Robert Fano

I remember reading a story about the great mathematician and electrical engineer Robert Fano. Around the same time the father of Cybernetics, Norbert Wiener was at MIT and was famous at the time for wandering around campus and then talking to anybody about anything that caught his fancy. There are stories on how graduate students would run away when Wiener was sighted coming to save their time. Wiener’s eccentricities are famous (recommendation [2] below) but let me not digress. In one of his routine days he appeared in the office of Fano and made a cryptic comment:

You know, information is entropy.

Fano spent a good time thinking about what this might mean and he has himself remarked that it was in part responsible for his developing, completely independently the first law of Shannon’s theory. Claude Shannon even cited Fano in his famous paper.

I can’t help thinking that such one liners are perhaps the best examples of information compression and Kolmogorov complexity.

_________________________

Recommendations:

1. An Introduction to Kolmogorov Complexity and its Applications – Ming Li and Paul Vitanyi (on the basis of the first third)

2. Dark Hero of the Information Age – Conway and Siegelman

_________________________

Earlier Related Post:

1. Ray Solomonoff is No More (has a short discussion on Solomonoff’s ideas in the same. It is noteworthy that Solomonoff published the first paper in what is today called Kolmogorov Complexity. His approach to the area was through Induction.  Kolmogorov and Chaitin approached it from randomness).

## Importing the Szemerédi Regularity Lemma into Machine Learning

Synopsis of a recent direction of work with Gábor Sárközy, Endre Szemerédi and Fei Song — “The Regularity Lemma is a deep result from extremal graph theory having great utility as a fundamental tool to prove theoretical results, but can it be employed in more “practical” settings?”

More specifically we are interested in the problem of harnessing the power of the regularity lemma to do clustering. This blog post is organized as follows: We first sketch the regularity lemma, we then see that it is an existential predicate and state an algorithmic version, we then look at how this constructive version may be used for clustering/segmentation. It must be noted that the idea seems to have originated from an earlier interesting work by Sperotto and Pellilio.

Before a brief introduction to the regularity lemma, I’d quote Luca Trevisan on (the related) Szemeredi’s Theorem from his blog

Szemeredi’s theorem on arithmetic progressions is one of the great triumphs of the “Hungarian approach” to mathematics: pose very difficult problems, and let deep results, connections between different areas of math, and applications, come out as byproducts of the search for a solution.

Even though I am nothing close to being a mathematician, but we’ll see that even an Electrical Engineer can somewhat understand and appreciate what Prof. Trevisan means in this specific context! ;-)

I will attempt to sketch a few things below with more emphasis on intuition than rigour! To be specific – intuition and then rigour.

_________________________

Introduction: One of the most fundamental and ingenious results in graph theory and discrete mathematics is the Szemeredi Regularity Lemma. It was originally advanced by Endre Szemeredi as an auxiliary result to prove a long standing conjecture of Erdős and Turán from 1936 (on the Ramsey properties of arithmetic progressions), which stated that sequences of integers with postive upper density must contain arbitrarily long arithmetic progressions (now called as Szemeredi’s theorem (1978)). Now the regularity lemma by itself is considered as one of the most important tools in graph theory.

Prof. Endre Szemeredi: Originator of the "Regularity Method"

A very rough statement of the regularity lemma could be made as follows:

Every graph can be approximated by random graphs. This is in the sense that every graph can be partitioned into a bounded number of equal parts such that:
1. Most edges run between different parts
2. And that these edges behave as if generated at random.

The fundamental nature of the Regularity Lemma as a tool in Graph Theory (and beyond) could be understood by the fact that two recent Fields Medals (Timothy Gowers, 1998 and Terence Tao, 2006) were awarded atleast in part for achieving breakthrough results on the regularity lemma and using those to prove fundamental results on arithmetic progressions amongst other results (Such as the Green-Tao Theorem).

_________________________

A Temporary Digression and Prof. Trevisan’s Statement: A first look at the statement of the regularity lemma above reminds one (especially those trained in signal processing I suppose) of the Fourier series. While this is extremely simplified – It’s not entirely incorrect to say that it points in part to a deeper relationship between combinatorics and analysis. Let’s look at it this way – Szemeredi employed the regularity lemma (a graph theoretic method) to prove what is now called Szemeredi’s theorem. The Szemeredi Theorem is a difficult one and it has four different proofs (graph-theoretic, ergodic, hypergraph-theoretic and lastly based on Fourier analysis)*.  Infact work on the Szemeredi theorem has contributed in better understanding connections between these areas (as an example – Timothy Gowers was awarded the Fields Medal for proving very deep results on the connections between analysis and combinatorics).

Terence Tao and Timothy Gowers have made breakthrough contributions to the regularity lemma and Szemeredi's Theorem amongst numerous others (Image Source: Republic of Math)

So it could be said that it all really started with Erdős and Turán posing a hard problem – The search for it’s solution has led to not only some good theorems but also some interesting connections between different areas of mathematics and thus a better understanding of all of them as a result. So we see what Prof. Trevisan means!

*While here the connection appears very sketchy though intuitive. The strong relationship between the Regularity Lemma and Analysis was shown in a breakthrough paper by László Lovász and Balázs Szegedy. In this paper they showed that the regularity lemma could be thought of as a result in analysis

Szemeredi Regularity Lemma for the Analyst (László Lovász and Balázs Szegedy)

_________________________

Given the background now we get to stating Szemeredi’s Regularity Lemma:

The Szemeredi Lemma: Expanding upon the rough statement mentioned a couple of paragraphs above, we could be a little more specific and say that

The Lemma states that every graph could be partitioned into a bounded number of quasi-random bi-partite graphs, called regular pairs and a few left over edges.

Now for a more precise statement, we introduce some definitions and notation.

Definition 1: If $G = (V,E)$ is a graph  and $A, B$ are disjoint subsets of $V$. Then, we denote the count of number edges with one endpoint in $A$ and another in $B$ by $e(A,B)$. The density of edges between $A$ and $B$ can then be defined as:

$\displaystyle d(A,B) = \frac{e(A,B)}{|A||B|}$

This just defines the edge density in a bipartite graph.

Definition 2: The bipartite graph $G = (A, B, E)$ is called $\varepsilon$-regular if for every $X \subset A$ and $Y \subset B$ satisfying

$|X| > \varepsilon |A|, |Y| > \varepsilon |B|$

we would have $|d(X,Y) - d(A,B)| < \varepsilon$

This means that a bipartite graph is epsilon regular if we were to take any random subsets (of some minimum size) $X, Y$ of $A,B$ respectively and even then find that the edge density between these subsets is almost the same as the edge density in the original bipartite graph. In effect this implies that if a bipartite graph is epsilon regular then all the edges between the the two disjoint sets are distributed uniformly.

Definition 3: An equitable partition of the vertex set $V$ of a graph $G = (V,E)$ is a partition of $V$ such that all the classes $C_0, C_1, \dots, C_k$ are pairwise disjoint. And that all classes $C_i$ with $1 \leq i \leq k$ have the same cardinality. It is noteworthy that oftentimes the vertex set might not have a cardinality that could be equally divided into the said number of classes, thus $C_0$ is present for a technical reason: To ensure that all the classes have the same cardinality.

Definition 4: For every equitable partition $P$ of the vertex set $V$ of $G = (V,E)$ into classes $C_0, C_1, \dots, C_k$ we associate a measure called the potential or the index of the partition $P$, which is defined as:

$\displaystyle ind(P) = \frac{1}{k^2} \sum_{1 \leq r \leq s \leq k} d(C_r, C_s)^2$

This measure just defines how close a partition is close to a regular one.

Definition 5: An equitable partition $C_0, C_1, \dots, C_k$ of the vertex set of graph $G$ given by $V$, where $C_0$ is the exceptional set, is called $\varepsilon$-regular if $C_0 < \varepsilon |V|$ and all but $\varepsilon k^2$ of the pairs $(C_i,C_j)$ are $\varepsilon$-regular such that $1 \leq i < j \leq k$.

We are now in a position to state the regularity lemma:

Theorem 1: [Regularity Lemma]
For every positive $\varepsilon$ and positive integer m, there are positive integers $N = N(\varepsilon, m)$ and $M = M(\varepsilon, m)$ such that the following property holds: For all graphs $G$ with $|V|\geq N$, there is an $\varepsilon$-regular partition of $G(V,E)$ into $k+1$ classes such that $m\leq k \leq M$.

The beauty of the regularity lemma is in the point that that the approximation for any graph does not depend on the number of vertices it has, but only on the error in approximation (represented by $\varepsilon$).

In the proof of the regularity lemma we start with an initial partition (with low potential) and then iteratively refine the partition in such a way that the potential increases to a point such that the partition is epsilon regular. However, this leads to a major problem (the first of our concerns) – In refining the partitions iteratively the number of classes increases exponentially in each iteration. We then end up with a partition in which the number of classes is a tower function – usually an astronomical figure. Clearly, this does not work very well for graphs arising in practice. This same point is made by Prof. Trevisan:

[…] one may imagine using the Regularity Lemma as follows: to prove an approximate quantitative statement about arbitrary graphs, we only need to prove a related statement for the finite set of objects that approximate arbitrary graphs up to a certain level of accuracy. The latter, completely finite, goal may be performed via a computer search. Unfortunately, the size of the objects arising in the Regularity Lemma grow so astronomically fast in the approximation that this approach is completely impractical.

As an aside: For a long while it was thought that maybe a tower function is not necessary. However in a remarkable paper Timothy Gowers constructed graphs that demonstrated that functions of the tower type were indeed necessary.

In any case, how can we get around this problem for approximating most graphs so that it could be useful in applications such as clustering? A possible solution to this problem in the context of clustering was proposed by Sperotto and Pelillo. They make some interesting points, however do not provide many details on their approach. We will get back to this problem in a short while.

But the first problem that we face is the following: The Szemeredi Lemma as originally proposed is an existential predicate. It does not give a method to obtain the regular partition for a given graph, but only says that one must exist! So if we are to use the Lemma in any practical setting, we need an algorithmic version. There now exist two algorithmic versions:

1. One proposed in a paper by Alon et al in 1992.

2. Another proposed by Frieze and Kannan and is based on the very intriguing relationship between singular values and regularity!

We for now, focus on the Alon et al. version.

_________________________

Algorithmic Version of the Regularity Lemma:

Theorem 2: [A Constructive Version of the Regularity Lemma – Alon et al.]

For every $\varepsilon > 0$ and every positive integer $t$ there is an integer $Q = Q(\varepsilon, t)$ such that every graph with $n > Q$ vertices has an $\varepsilon$-regular partition into $k + 1$ classes, where $t \le k \le Q$. For every fixed $\varepsilon > 0$ and $t \ge 1$ such a partition can be found in $O(M(n))$ sequential time, where $M(n)$ is the time for multiplying two $n$ by $n$ matrices with 0,1 entries over the integers. It can also be found in time $O(logn)$ on an ERPW PRAM with a polynomial number of parallel processors.

To really understand how the above theorem would require the introduction of several supporting definitions and lemmas (including one from Szemeredi’s original paper). Since our focus is on the application, we’d just state one lemma using which the idea behind the constructive version could be revealed in a more concrete sense.

Lemma 1: [Alon et al.] Let $H$ be a bipartite graph with equally sized classes $|A| = |B| = n$. Let $\displaystyle 2n^{\frac{-1}{4}} < \varepsilon <\frac{1}{16}$. There is an $O(M(n))$ algorithm that verifies that $H$ is $\varepsilon$-regular or finds two subset $A' \subset A$, $B' \subset B$, $\displaystyle |A'| \ge \frac{{\varepsilon}^4}{16}n$, $\displaystyle |B'| \ge \frac{{\varepsilon}^4}{16}n$, such that $\displaystyle |d(A, B) - d(A', B')| \ge \varepsilon^4$. The algorithm can be parallelized and implemented in $NC$.

This lemma basically says that whether or not a bipartite graph is epsilon regular is a question that can be answered very quickly. If it is – then we can have an algorithm to say so and if it is not, it should return the answer with a certificate or proof that it is not so. This certificate is nothing but subsets of the classes from the original bipartite graph. The idea of the certificate is to help to proceed to the next step.

The general idea in the Alon algorithm is:

Start with a random equitable partition of the graph $C_0, C_1, \dots, C_b$, where $\displaystyle |C_i| = \lfloor\ \frac{n}{b} ;\rfloor$ and $i = 1, 2, \dots, b$. Also $|C_0| < b$. Also let $k_1 = b$.

Then, for each pair $C_r, C_s$ in the partition $P_i$ check for regularity. If it is regular report so, if not find the certificate for the pair. If out of all the pairs, only $\varepsilon k_i^2$ are not epsilon regular – then halt, the partition $P_i$ is epsilon regular. If not then we refine this partition using the information gained from the certificates in the cases when the pairs not epsilon regular. On refining this partition we obtain a new partition $P'$ with $1 + k_i4^{k_i}$ classes. We repeat this process till we hit a partition which is epsilon regular.

_________________________

Using the Constructive Version for Clustering:

A little earlier it was pointed out that there are two main difficulties in using the Szemeredi Lemma in practical settings. At one level, to use the lemma in a practical setting we need a constructive version. However the constructive versions are so designed that they work for all graphs. To illustrate how this is a problem, we quote a line from the algorithmic version based on singular values (Freize and Kannan):

The Algorithm finishes in atmost $O(\varepsilon^{-45})$ steps with an $\varepsilon$-regular partition

Now, consider what this implies – If $\varepsilon$ is $\frac{1}{16}$ (a typical value). Then the number of steps in which we are guaranteed to find a regular partition is 1.53249554 × 1054.

This is astonishingly large number. To make the lemma truly practical, we would have to give up the aim of making it work for all graphs. Instead we should be content that it works on most graphs. Most graphs would largely be graphs that appear in practice. Like was mentioned earlier some directions in this regard were provided by Sperotto and Pellilio.

Another problem (as was mentioned earlier too) was that the number of classes grows exponentially with each refinement step. We can not allow the number of classes to grow astronomically either because if we do so then consequently we would never be able to refine the partitions far enough. Here we would have to compromise on the generality again. Instead of allowing the number of classes to grow exponentially we could allow it to grow by a defined value in each iteration. This value could be chosen depending on a number of parameters, including the data set size.

Even in making such an approximation – the potential would always increase with each iteration albeit much slowly as compared to the methodology as originally described. Infact we would say that it would work for most graphs. In our work this approach seems to work quite well.

Given these two changes, what still remains is how could the lemma be used for clustering datasets? One possible way is suggested by another result called the Key Lemma and the implication that it might have. This is stated below:

Lemma 2: Given an arbitrary graph $G=(V,E)$, a partition of $V$ in $k$ clusters as in regularity lemma described above and two parameters $\varepsilon, d$, we describe the reduced graph $G^R$ as the graph whose vertices are associated to the clusters and whose edges are associated to the $\varepsilon$-regular pairs with density above $d$. If we have a coloring on the edges of $G$, then the edges of the reduced graph will be colored with a color that appears on most of the edges between the two clusters.

Using the properties of what is called the Key-Lemma/Blow-Up Lemma, it is understood that this reduced graph should retain properties of the original graph. And thus any changes made on this reduced graph would reflect on the original graph.

Thus, one possible way to cluster is using a two-phase clustering strategy: First, we use the modified regularity lemma to construct a reduced graph using the method above. In the second phase, we use a pairwise clustering method such as spectral clustering to do cluster the the reduced graph. The results obtained on this reduced graph are then projected back onto the original graph using Lemma 2.

Two Phase Strategy to use the Regularity Lemma

Infact such a methodology gives results quite encouraging as compared to other standard clustering methods. These results would be reported here once the paper under consideration is published. :) Another thing to be noted is that the reduced graph is usually quite small as compared to the original graph and working on this smaller – reduced graph is much faster.

_________________________

Other Possible Changes:

1. The Alon et al. algorithm is used and modified to use the regularity lemma for clustering. It would be interesting to explore the results given by the Frieze & Kannan method as it has a different way to find the certificate.

2. There has been some work on the sparse regularity lemma. This is something that would have a lot of practical value as in the above we construct a dense graph. Using the sparse version would allow us to use nearest neighbor graphs instead of dense graphs. This would reduce the computational burden significantly.

3. A recent paper by Fischer et al.Approximate Hypergraph Partitioning and Applications” has received a lot of attention. In this paper they give a new approach for finding regular partitions. All the previous ones work to find partitions of the tower type, while this paper gives a method to find a smaller regular partition if one exists in the graph. Employing this methodology for refinement instead of using an approximate version of the algorithmic regularity lemma could be a fruitful direction of work.

_________________________

Onionesque Reality Home >>