Feeds:
Posts

## (Very) Informal thoughts on the No-Free-Lunch Theorem

The No-Free-Lunch theorems (Wolpert and Macready) are one of the mainstays in Machine Learning and optimization. Crudely, the NFL says that “for every learner there exists a distribution in which it fails”, giving a precise handle to reason about why a given learning algorithm might fail on certain tasks (or, “specialization” of a learning algorithm might just be accidental). It gives a good justification why it is useful to incorporate prior informaion about the problem at hand to perform better. Oftentimes the NFL is used to argue that a general purpose learning algorithm is theoretically impossible. This is incorrect. Indeed, universal learning algorithms exist and there is a (possibly very small) free lunch, at least for the case of “interesting problems” (thanks to Vick for pointing out these papers).

The view “for every learner there exists a distribution in which it fails” is very useful. However I had recently been thinking that is particularly instructive to think of No Free Lunch as a kind of diagnolization argument (this is just a different way of thinking about the prior statement). I am not completely sure if this works but it sounds plausible and makes me see NFL in a different light. It might also be that there is some subtle absrudity or flaw in this kind of thinking. However, I think the NFL theorem can be seen as a simple corollary of the following incompleteness theorem in Kolmogorov Complexity.

Theorem (Chaitin): There exists a constant L (which only depends on the particular axiomatic system and the choice of description language) such that there does not exist a string s for which the statement the “The Kolmogorov Complexity of s is greater than L” (as formalized in S) can be proven within the axiomatic system S.

Stated informally: “There’s a number L such that we can’t prove the Kolmogorov complexity of any specific string of bits is more than L”. In particular, you can think of any learning machine as a compression scheme (since compression implies learning, we can think of any learning algorithm as a way to compress the data) of some fixed Kolmogorov Complexity (consider the most optimal program for that machine, the programming language is irrelevant as long as it is Turing Complete. Consider the description length of it in bits. That is its Kolmogorov Complexity). Now for any such learning machine (or class of learning machines w.l.o.g) you can construct a data string which has Kolmogorov Complexity greater than what this learning machine has. i.e. this data string is random from the point of view of this learning machine as it is incompressible for it. In short IF the data string has some Kolmogorov Complexity K and your learning machine has Kolmogorov Complexity k and k < K then you will never be able to find any structure in your data string i.e. can never show that it has a KC greater than k. Thus the data string given to our learner would appear to be structureless. Thus no free lunch is just a simple diagonalization argument in the spirit of incompleteness results. One can always construct a string that is Kolmogorov Random or structureless w.r.t our encoding scheme. Ofcourse the Kolmogorov Complexity is uncomputable. However I think the above argument works even when one is talking of upper bounds on the actual Kolmogorov Complexity, but I am not very sure about it or if there are any problems with it.

Also, if one assumes that that there is no reason to believe that structureless functions exist (if we assume that the data is generated by a computational process then there is probably no reason to believe that it is structureless i.e. the problem is “interesting”). And if this is the case, then you would always have a machine in your library of learning algorithms that will be able to find some upper bound on the Kolmogorov Complexity of the data string (much less than the machine itself). This search might be done using Levin’s universal search, for example. I think this line of reasoning leads naturally, to thinking why NFL doesn’t say anything about whether universal learning algorithms can exist.

PS: This thought train about NFL as diagonalization was inspired by a post by David McAllester on what he calls the “Free Lunch Theorem” over here. Some of this appeared as a comment there. However, now I have been inspired enough to study some of the papers concerned including his. (Unrelated to all this. I particularly found likening Chomsky’s Universal Grammar as simply an invocation of NFL on information theoretic grounds in his post very enlightening).