Feeds:
Posts

Book Review: “Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World” by Leslie Valiant

Darwinian Evolution is a form of Computational Learning

The punchline of this book is perhaps: “Changing or increasing functionality of circuits in biological evolution is a form of computational learning“; although it also speaks of topics other than evolution, the underlying framework is of the Probably Approximately Correct model [1] from the theory of Machine Learning, from which the book gets its name.

“Probably Approximately Correct” by Leslie Valiant

[Clicking on the image above will direct you to the amazon page for the book]

I had first heard of this explicit connection between Machine Learning and Evolution in 2010 and have been quite fascinated by it since. It must be noted, however, that similar ideas have appeared in the past. It won’t be incorrect to say that usually they have been in the form of metaphor. It is another matter that this metaphor has generally been avoided for reasons I underline towards the end of this review. When I first heard about the idea it immediately made sense and like all great ideas, in hindsight looks completely obvious. Ergo, I was quite excited to see this book and preordered it months in advance.

Before attempting to sketch a skiagram of the main content of the book: One of the main subthemes of the book, constantly emphasized is to look at computer science as a kind of an enabling tool to study natural science. This is oft ignored, perhaps because of the reason that CS curricula are rarely designed with any natural science component in them and hence there is no impetus for aspiring computer scientists to view them from the computational lens. On the other hand,  the relation of computer science with mathematics has become quite well established. As a technology the impact of Computer Science has been tremendous. All this is quite remarkable given the fact that just about a century ago the notion of a computation was not even well defined. Unrelated to the book: More recently people have started taking the idea of digital physics (i.e. physics from a solely computable/digital perspective) seriously. But in the other natural sciences its usage is still woeful. Valiant draws upon the example of Alan Turing as a natural scientist and not just as a computer scientist to make this point. Alan Turing was more interested in natural phenomenon (intelligence, limits of mechanical calculation, pattern formation etc) and used tools from Computer Science to study them, a fact that is evident from his publications. That Turing was trying to understand natural phenomenon was remarked in his obituary by Max Neumann by summarizing the body of his work as: “the extent and the limitations of mechanistic explanations of nature”.

________________

The book begins with a delightful and quite famous quote by John von Neumann (through this paragraph I also discovered the context to the quote). This paragraph also adequately summarizes the motivation for the book very well:

“In 1947 John von Neumann, the famously gifted mathematician, was keynote speaker at the first annual meeting of the Association for Computng Machinery. In his address he said that future computers would get along with just a dozen instruction types, a number known to be adequate for expressing all of mathematics. He went on to say that one need not be surprised at this small number, since 1000 words were known to be adequate for most situations in real life, and mathematics was only a small part of life, and a very simple part at that. The audience responded with hilarity. This provoked von Neumann to respond: “If people do not believe that mathematics is simple, it is only because they do not realize how complicated life is”

Though counterintuitive, von Neumann’s quip contains an obvious truth. Einstein’s theory of general relativity is simple in the sense that one can write the essential content on one line as a single equation. Understanding its meaning, derivation, and consequences requires more extensive study and effort. However, this formal simplicity is striking and powerful. The power comes from the implied generality, that knowledge of one equation alone will allow one to make accurate predictions about a host of situations not even connected when the equation was first written down.

Most aspects of life are not so simple. If you want to succeed in a job interview, or in making an investment, or in choosing a life partner, you can be quite sure that there is no equation that will guarantee you success. In these endeavors it will not be possible to limit the pieces of knowledge that might be relevant to any one definable source. And even if you had all the relevant knowledge, there may be no surefire way of combining it to yield the best decision.

This book is predicated on taking this distinction seriously […]”

In a way, aspects of life as mentioned above appear theoryless, in the sense that there seems to be no mathematical or scientific theory like relativity for them. Something which is quantitative, definitive and short. Note that these are generally not “theoryless” in the sense that there exists no theory at all since obviously people can do all the tasks mentioned in a complex world quite effectively. A specific example is of how organisms adapt and species evolve without having a theory of the environment as such. How can such coping mechanisms come into being in the first place is the main question asked in the book.

Let’s stick to the specific example of biological evolution. Clearly, it is one of the central ideas in biology and perhaps one of the most important theories in science that changed the way we look at the world. But inspite of its importance, Valiant claims (and correctly in my opinion) that evolution is not understood well in a quantitative sense. Evidence that convinces us of its correctness is of the following sort: Biologists usually show a sequence of evolving objects; stages, where the one coming later is more complex than the previous. Since this is studied mostly via the fossil record there is always a lookout for missing links between successive stages (As a side note: Animal eyes is an extremely fascinating and very readable book that delves with this question but specifically dealing with the evolution of the eye. This is particularly interesting given that due to the very nature of the eye, there can be no fossil records for them). Darwin had remarked that numerous successive paths are necessary – that is, if it was not possible to trace a path from an earlier form to a more complicated form then it was hard to explain how it came about. But again, as Valiant stresses, this is not really an explanation of evolution. It is more of an “existence proof” and not a quantitative explanation. That is, even if there is evidence for the existence of a path, one can’t really say that a certain path is likely to be followed just because it exists. As another side note: Watch this video on evolution by Carl Sagan from the beautiful COSMOS

Related to this, one of the first questions that one might ask, and indeed was asked by Darwin himself: Why has evolution taken as much time as it has? How much time would have sufficed for all the complexity that we see around us to evolve? This question infact bothered Darwin a lot in his day. In On the Origins of Species he originally gave an estimate of the Earth’s age to be at least 300 million years, implying indirectly, that there was enough time. This estimate was immediately thrashed by Lord Kelvin, one of the pre-eminent British physicists of the time, who estimated the age of the Earth to be only about 24 million years. This caused Darwin to withdraw the estimate from his book. However, this estimate greatly worried Darwin as he thought 24 million years just wasn’t enough time. To motivate on the same theme Valiant writes the following:

“[…] William Paley, in a highly influential book, Natural Theology (1802) , argued that life, as complex as it is, could not have come into being without the help of a designer. Numerous lines of evidence have become available in the two centuries since, through genetics and the fossil record, that persuade professional biologists that existing life forms on Earth are indeed related and have indeed evolved. This evidence contradicts Paley’s conclusion, but it does not directly address his argument. A convincing direct counterargument to Paley’s would need a specific evolution mechanism to be demonstrated capable of giving rise to the quantity and quality of the complexity now found in biology, within the time and resources believed to have been available. […]”

A specific, albeit more limited version of this question might be: Consider the human genome, which has about 3 billion base pairs. Now, if evolution is a search problem, as it naturally appears to be, then why did the evolution of this genome not take exponential time? If it would have taken exponential time then clearly such evolution could not have happened in any time scale since the origin of the earth. Thus, a more pointed question to ask would be: What circuits could evolve in sub-exponential time (and on a related note, what circuits are evolvable only in exponential time?). Given the context, the idea of thinking about this in circuits might seem a little foreign. But on some more thought it is quite clear that the idea of a circuit is natural when it comes to modeling such systems (at least in principle). For example: One might think of the genome as a circuit, just as how one might think of the complex protein interaction networks and networks of neurons in the brain as circuits that update themselves in response to the environment.

The last line is essentially the key idea of adaptation, that entities interact with the environment and update themselves (hopefully to cope better) in response. But the catch is that the world/environment is far too complex for simple entities (relatively speaking), with limited computational power, to have a theory for. Hence, somehow the entity will have to cope without really “understanding” the environment (it can only be partially modeled) and improve their functionality. The key thing to pay attention here is the interaction or the interface between the limited computational entity in question and the environment. The central idea in Valiant’s thesis is to think of and model this interaction as a form of computational learning. The entity will absorb information about the world and “learn” so that in the future it “can do better”. A lot of Biology can be thought of as the above: Complex behaviours in environments. Wherein by complex behaviours we mean that in different circumstances, we (or alternately our limited computational entity) might behave completely differently. Complicated in the sense that there can’t possibly be a look up table for modeling such complex interactions. Such interactions-responses can just be thought of as complicated functions e.g. how would a deer react could be seen as a function of some sensory inputs. Or for another example: Protein circuits. The human body has about 25000 proteins. How much of a certain protein is expressed is a function depending on the quantities of other proteins in the body.  This function is quite complicated, certainly not something like a simple linear regression. Thus there are two main questions to be asked: One, how did these circuits come into being without there being a designer to actually design them? Two, How do such circuits maintain themselves? That is, each “node” in protein circuits is a function and as circumstances change it might be best to change the way they work. How could have such a thing evolved?

Given the above, one might ask another question: At what rate can functionality increase or adapt under the Darwinian process? Valiant (like many others, such as Chaitin. See this older blog post for a short discussion) comes to the conclusion that the Darwinian evolution is a very elegant computational process. And since it is so, with the change in the environment there has to be a quantitative way of saying how much rate of change can be kept up with and what environments are unevolvable for the entity. It is not hard to see that this is essentially a question in computer science and no other discipline has the tools to deal with it.

In so far that (biological) interactions-responses might be thought of as complicated functions and that the limited computational entity that is trying to cope has to do better in the future, this is just machine learning! This idea, that changing or increasing functionality of circuits in biological evoution is a form of computational learning, is perhaps very obvious in hindsight. This (changing functionality) is done in Machine Learning in the following sense: We want to acquire complicated functions without explicitly programming for them, from examples and labels (or “correct” answers). This looks at exactly at the question at how complex mechanisms can evolve without someone designing it (consider a simple perceptron learning algorithm for a toy example to illustrate this). In short: We generate a hypothesis and if it doesn’t match our expectations (in performance) then we update the hypothesis by a computational procedure. Just based on a single example one might be able to change the hypothesis. One could draw an analogy to evolution where “examples” could be experiences, and the genome is the circuit that is modified over a period of time. But note that this is not how it looks like in evolution because the above (especially drawing to the perceptron example) sounds more Lamarckian. What the Darwinian process says is that we don’t change the genome directly based on experiences. What instead happens is that we make a lot of copies of the genome which are then tested in the environment with the better one having a higher fitness. Supervised Machine Learning as drawn above is very lamarckian and not exactly Darwinian.

Thus, there is something unsatisfying in the analogy to supervised learning. There is a clear notion of a target in the same. Then one might ask, what is the target of evolution? Evolution is thought of as an undirected process. Without a goal. This is true in a very important sense however this is incorrect. Evolution does not have a goal in the sense that it wants to evolve humans or elephants etc. But it certainly does have a target. This target is “good behaviour” (where behaviour is used very loosely) that would aid in the survival of the entity in an environment. Thus, this target is the “ideal” function (which might be quite complicated) that tells us how to behave in what situation. This is already incorporated in the study of evolution by notions such as fitness that encode that various functions are more beneficial. Thus evolution can be framed as a kind of machine learning problem based on the notion of Darwinian feedback. That is, make many copies of the “current genome” and let the one with good performance in the real world win. More specifically, this is a limited kind of PAC Learning.  If you call your current hypothesis your genome, then your genome does not depend on your experiences. Variants to this genome are generated by a polynomial time randomized Turing Machine. To illuminate the difference with supervised learning, we come back to a point made earlier that PAC Learning is essentially Lamarckian i.e. we have a hidden function that we want to learn. One however has examples and labels corresponding to this hidden function, these could be considered “queries” to this function. We then process these example/label pairs and learn a “good estimate” of this hidden function in polynomial time. It is slightly different in Evolvability. Again, we have a hidden “ideal” function f(x). The examples are genomes. However, how we can find out more about the target is very limited, since one can empirically only observe the aggregate goodness of the genome (a performance function). The task then is to mutate the genome so that it’s functionality improves. So the main difference with the usual supervised learning is that one could query the hidden function in a very limited way: That is we can’t act on a single example and have to take aggregate statistics into account.

Then one might ask what can one prove using this model? Valiant demonstrates some examples. For example, Parity functions are not evolvable for uniform distribution while monotone disjunctions are actually evolvable. This function is ofcourse very non biological but it does illustrate the main idea: That the ideal function together with the real world distribution imposes a fitness landscape and that in some settings we can show that some functions can evolve and some can not. This in turn illustrates that evolution is a computational process independent of substrate.

In the same sense as above it can be shown that Boolean Conjunctions are not evolvable for all distributions. There has also been similar work on real valued functions off late which is not reported in detail in the book. Another recent work that is only mentioned in passing towards the end is the study of multidimensional space of actions that deal with sets of functions (not just one function) that might be evolvable together. This is an interesting line of work as it is pretty clear that biological evolution deals with the evolution of a set of functions together and not functions in isolation.

________________

Overall I think the book is quite good. Although I would rate it 3/5. Let me explain. Clearly this book is aimed at the non expert. But this might be disappointing to those who bought the book because of the fact that this recent area of work, of studying evolution through the lens of computational learning, is very exciting and intellectually interesting. The book is also aimed at biologists, and considering this, the learning sections of the book are quite dumbed down. But at the same time, I think the book might fail to impress most of them any way. I think this is because generally biologists (barring a small subset) have a very different way of thinking (say as compared to the mathematicians or computer scientists) especially through the computational lens. I have had some arguments about the main ideas in the book over the past couple of years with some biologist friends who take the usage of “learning” to mean that what is implied is that evolution is a directed process. It would have been great if the book would have spent more time on this particular aspect. Also, the book explicitly states that it is about quantitative aspects of evolution and has nothing to do with speciation, population dynamics and other rich areas of study. However, I have already seen some criticism of the book by biologists on this premise.

As far as I am concerned, as an admirer of Prof. Leslie Valiant’s range and quality of contributions, I would have preferred if the book went into more depth. Just to have a semi-formal monograph on the study of evolution using the tools of PAC Learning right from the person who initiated this area of study. However this is just a selfish consideration.

________________

References:

[1] A Theory of the Learnable, Leslie Valiant, Communications of the ACM, 1984 (PDF).

[2] Evolvability, Leslie Valiant, Journal of the ACM, 2007 (PDF).

________________

“Darwinian Evolution is a form of PAC (Machine) Learning”

Changing or increasing functionality of circuits in biological evolution is a form of computational learning. – Leslie Valiant

The title of this post comes from Prof. Leslie Valiant‘s The ACM Alan M. Turing award lecture titled “The Extent and Limitations of Mechanistic Explanations of Nature”.

Prof. Leslie G. Valiant

Click on the image above to watch the lecture

[Image Source: CACM “Beauty and Elegance”]

Short blurb: Though the lecture came out sometime in June-July 2011, and I have shared it (and a paper that it quotes) on every online social network I have presence on, I have no idea why I never blogged about it.

The fact that I have zero training (and epsilon knowledge of) in biology that has not stopped me from being completely fascinated by the contents of the talk and a few papers that he cites in it. I have tried to see the lecture a few times and have also started to read and understand some of the papers he mentions. Infact, the talk has inspired me enough to know more about PAC Learning than the usual Machine Learning graduate course might cover. Knowing more about it is now my “full time side-project” and it is a very exciting side-project to say the least!

_________________________

It is widely accepted that Darwinian Evolution has been the driving force for the immense complexity observed in life or how life evolved. In this beautiful 10 minute video Carl Sagan sums up the timeline and the progression:

There is however one problem: While evolution is considered the driving force for such complexity, there isn’t a satisfactory explanation of how 13.75 billion years of it could have been enough. Many have often complained that this reduces it to a little more than an intuitive explanation. Can we understand the underlying mechanism of Evolution (that can in turn give reasonable time bounds)? Valiant makes the case that this underlying mechanism is of computational learning.

There have been a number of computational models that have been based on the general intuitive idea of Darwinian Evolution. Some of these include: Genetic Algorithms/Programming etc. However, people like Valiant amongst others find such methods useful in an engineering sense but unsatisfying w.r.t the question.

In the talk Valiant mentions that this question was asked in Darwin’s day as well. To which Darwin proposed a bound of 300 million years for such evolution to occur. This immediately fell into a problem as Lord Kelvin, one of the leading physicists of the time put the figure of the age of Earth to be 24 million years. Now obviously this was a problem as evolution could not have happened for more than 24 million years according to Kelvin’s estimate. The estimate of the age of the Earth is now much higher. ;-)

The question can be rehashed as: How much time is enough? Can biological circuits evolve in sub-exponential time?

For more I would point out to his paper:

Evolvability: Leslie Valiant (Journal of the ACM – PDF)

Towards the end of the talk he shows a Venn diagram of the type usually seen in complexity theory text books for classes P, NP, BQP etc but with one major difference: These subsets are fact and not unproven:

$Fact: Evolvability \subseteq SQ Learnable \subseteq PAC Learnable$

*SQ or Statistical Query Learning is due to Michael Kearns (1993)

Coda: Valiant claims that the problem of evolution is no more mysterious than the problem of learning. The mechanism that underlies biological evolution is “evolvable target pursuit”, which in turn is the same as “learnable target pursuit”.

_________________________

Onionesque Reality Home >>

Union Bound, Hoeffding Inequality and Some Bounds in Learning Theory – Part I

Well again parts of my notes (modified suitably to be blog posts) for a discussion session!

This post would be the first in a series of four posts. The objective of each post would be as follows:

1. This post would introduce Learning Theory, the bias-variance trade-off and sum up the need of learning theory.

2. This would discuss two simple lemmas : The Union Bound and the Hoeffding inequality and then use them to get to some very deep results in learning theory. It would also introduce and discuss Empirical Risk Minimization.

3. Continuing from the previous discussion this post would derive results on uniform convergence, tie the discussions into a theorem. From this theorem we would have made formal the bias-variance trade-off discussed in the first post.

4. Will talk about VC Dimension and the VC bound.

Basically all the results are derived using two very simple lemmas, hence the name of these posts.

______

Introduction:

Learning theory helps give a researcher applying machine learning algorithms  some rules of the thumb that tell how to best apply the algorithms that he/she has learnt.

Dr Andrew Ng likens knowing machine learning algorithms to a carpenter acquiring a set of tools. However the difference between a good carpenter and not so good one is the skill in using those tools. In choosing which one to use and how. In the same way Learning Theory gives a “machine-learnist” some crude intuitions about how a ML algorithm would work and helps in applying them better.

A lot of people still think of learning theory as a method for getting papers published (I’d like to use that method, I need papers ;-), as it is considered abstruse by many and not of much practical value. A good refutation of this tendency can be seen here on John Langford’s fantastic web-log.

______

As put in a popular tutorial by Olivier Bousquet, the process of inductive learning can be summarized as:

1. Observe a phenomenon.

2. Construct a model of that phenomenon.

3. Make predictions using this model.

Dr Bousquet puts it very tersely that the above process can actually said to be the aim of ALL natural sciences. Machine learning aims to automate the process and learning theory tries to formalize it. I think the above gives a reasonable idea about what learning theory deals with.

Learning theory formalizes terms like generalization, over-fitting and under-fitting. This series of posts (read notes) aims to introduce these terms and then jump to a recap of some important error bounds in learning theory.

______

Training Error, Generalization Error and The Bias-Variance Tradeoff:

For simplicity let’s take something as simple as linear regression. And since I want this piece to be accessible, I assume no knowledge of linear regression either.

Linear Regression essentially models the relationship between one variable $X$ and another variable $Y$ such that the model itself depends linearly on the unknown parameters to be estimated from the data. Let’s have a look at what this means:

Suppose you have a habit of collecting weird datasets and you end up collecting up a dataset that gives the circumference of biceps of many men and the distance a javelin is thrown by each of them. And you want to predict for an unknown individual, given the circumference of his biceps how far can he throw the javelin.

Ofcourse there would be a number of reasons that would affect the distance a javelin would go, such as skill (which is essentially non-quantitative?), height, the kid of footwear worn, run-up distance, state of health etc. These would be the some of the many features that would affect that end result (distance a javelin is thrown). What I essentially mean is that the circumference of the biceps isn’t a realistic feature to predict how far a javelin can be thrown. But let’s assume that there is only one feature and it can make reasonable predictions. This over-simplification is only made so that the process can be visualized in a graph.

Suppose you collect about 80 such examples (which you call the training examples) and plot your data as such:

Now the problem given to you is: Given you have the bicep-circumference measurement of an unknown individual, predict how far he can throw the javelin.

How would one do it?

What we would do is to fit in some curve in the above training set (the above plot). And when we have to make a prediction we simply plug in that value in our curve and find the corresponding value for the distance. Something illustrated below.

The curve can be represented in a number of ways. However, if the curve was to be represented linearly (that’s why it’s called linear regression) it could be written as :

$h(x) = \theta_0 + \theta_1 x$

Where $h(x)$ is the hypothesis, $\theta_0$ and  $\theta_1$ are unknown parameters which are to be learnt from the data and $x$ is the input feature. It is noteworthy that this is like the slope intercept form of the line.

In the above, for simplicity I considered only one feature, there could be many more. In the more general case:

$h_\theta(x) = \theta_0 + \theta_1 x_1 + \dotsb + \theta_i x_i \cdots (1)$

The $\theta$‘s are called the parameters (to be learnt from the data) that will decide the nature of the curve.

We see that the equation involves features of the training examples ($x$‘s), therefore using this, the task of the learning algorithm will be to decide the most optimum values of $\theta_i$ using the training set. This can be easily done by something like Gradient Descent.

For any new example, we’d have the features $x$ and parameters would already be known by running gradient descent using the training set. We simply have to plug in the value of $x$ in equation $(1)$ to get a prediction.

To sum up : Like I mentioned, we use the training set to fit in a optimal curve and then try to predict unseen inputs by simply plugging in its values to the “equation of the curve”.

Now, it goes without saying that we could fit in a “simple” model to the training set or a more “complex” model. A simple model would be linear say something like:

$y=\theta_0 + \theta_1x$

and a complex model could be something like this:

$y=\theta_0 + \theta_1 x + \dotsb + \theta_5 x^5$.

It’s to be noted that in the above the same feature $x$ is used in different ways, the second model uses $x$ to create more features such as $x^2, x^3$, and so on. Clearly the second representation is more complex than the first as it will exploit more patterns in the data (it has more parameters).

However this increase in complexity can lead to problems, in the same way if the model is too simple it can lead to problems. This is illustrated below:

[Fig 1 (Left) and Fig 2 (Right)]

The figure on the left has a “simple model” fit into the training set. Clearly there are patterns in the data that the model would never take into account, no matter how big the training set goes. Paraphrasing this in more concrete terms, it’s clear that the relationship between $x$ and $y$ is not linear. So if we try to fit in a linear model to it, not matter how much we train it, there would always be some patterns in the data that the model would fail to subsume.

What this means is, what is learnt from the training set will not be generalised well to unknown examples (this is because, it might be that the unknown example comes from that part of the distribution that the model fails to account for and thus the prediction for it would be very inaccurate).

The figure on the right has a “complex” model fit into the same set, clearly the model fits the data very well. But again it is not a good predictor as it does not represent the general nature of the spread of the data but rather takes into account the idiosyncrasies of the same. This model would make very good predictions on the data from the training set itself, but it would not generalize well to unknown examples.

A more appropriate fit would be something like this :

Now we can move to a definition of the generalization error, The generalization error of a hypothesis is its expected error on examples that are not from the training set. For an example on understanding generalization refer to the part labeled “Van-Gogh Chagall and Pigeons” in this post.

The models shown in figures 1 and 2 have HIGH generalization errors. However each suffer from entirely different problems.

______

Bias : Like already mentioned : In the model shown in fig. 1, no matter how much the model is trained, There would always be some patterns in the data that the model would fail to capture. This is because the model has a high BIAS. Bias of a model is the expected generalization error even if we were to fit in a very large training-set.

Thus the linear model shown in figure 1 suffers from high bias and will underfit the data.

Variance : Apart from bias there is another component that has a bearing on the generalization error. That is the variance of the model fit into the training set.

This is shown in fig. 2. We see that even though that the model fits in very well in the training set, there is the risk that we are fitting patterns that are idiosyncratic to the training examples and may not represent the general pattern between $x$ and $y$.

Since we might be fitting spurious patters and exaggerating minor fluctuations in the data, such a model would still give a high generalization error and will over-fit the data. In such a case we say that the model has a high variance.

The Trade-off : When deciding on a model to fit onto the training set, there is a trade-off between the bias and the variance. If either is high that would mean the generalizing ability of the model would be low (generalization error would be high). In other words, if the model is too simple i.e if it has too few parameters it would have a high bias and if the model is too complex it would have a high variance. While deciding on a model we have to strike a balance between the two.

A very famous example that illustrates this trade-off goes like this:

[Suppose there is an exacting biologist who studies and classifies green trees in detail. He would be the example of an over-trained or over-fit model and would declare if he sees a tree with non-green leaves like above that it is not a tree at all]

[An under-trained or under-fit model would be like the above biologist’s lazy brother, who on seeing a cucumber which is green declares that it is a tree]

Both of the above have poor generalization. We wish to select a model that has an appropriate trade-off between the two.

______

So why do we need Learning Theory?

Learning theory is an interesting subject in its own right. It, however also hones our intuitions on how to apply learning algorithms properly  giving us a set of rules of the thumb that guide us on how to apply learning algorithms well.

Learning theory can answer quite a few questions :

1. In the previous section there was a small discussion on bias and variance and the trade-off between the two. The discussion sounds logical, however there is no meaning to it unless it is formalized. Learning theory can formalize the bias variance trade-off. This helps as we can then make a choice on choosing the model with just the right bias and variance.

2. Learning Theory leads to model selection methods by which we can choose automatically what model would be appropriate for a certain training set.

3. In Machine Learning, models are fit on the training set. So what we essentially get is the training error. But what we really care about is the generalization ability of the model or the ability to give good predictions on unseen data.

Learning Theory relates the training error on the training set and the generalization error and it would tell us how doing well on the training set might help us get better generalization.

4. Learning Theory actually proves conditions in which the learning algorithms will actually work well. It proves bounds on the worst case performance of models giving us an idea when the algorithm would work properly and when it won’t.

The next post would answer some of the above questions.

______

Onionesque Reality Home >>