Feeds:
Posts

## Where does the Sigmoid in Logistic Regression come from?

Note: The title of this post is circular. But I use/abuse it because of the post linked below.

I noticed on the Hacker News front page (and via multiple reshares on twitter), a discussion on why logistic regression uses a sigmoid. The article linked in the story talks about the log-odds ratio, and how it leads to the sigmoid (and gives a good intuitive plug on it).

However, I think that the more important question is – Why do you care about log-odds? Why do you use log-odds and not something else? The point of this quick post is to write out why using the log-odds is infact very well motivated in the first place, and once it is modeled by a linear function, what you get is the logistic function.

Beginning with log-odds would infact be begging the question, so let us try to understand.

____________________________

To motivate and in order to define the loss etc, suppose we had a linear classifier: $\hat{y} = h(\mathbf{x}) = sign(w_0 + \mathbf{w}^T\mathbf{x})$. This just means that for a vector input, we take the dot product with the parameters $\mathbf{w}$ and take the sign.

The learning problem would be to figure out a good direction $\mathbf{w}$ and a good location of the decision boundary $w_0$.

____________________________

We want to figure these out so as to minimize the expected 0-1 loss (or expected number of mistakes) for the classifier $h: \mathcal{X} \rightarrow \mathcal{Y}$. The 0-1 loss for a datapoint-label pair $\mathbf{x},y$ is simply:

$\displaystyle L(h(\mathbf{x}),y) = \begin{cases} 0, & \text{if } h(\mathbf{x}) = y \\ 1, & \text{if } h(\mathbf{x}) \neq y \end{cases}$

Now, the next question we would like to ask. What is the risk of this classifier that we want to minimize? The risk is the expected loss. That is, if we draw a random sample from the (unknown) distribution $p(\mathbf{x},y)$, what would be the expected error? More concretely:

$R(h) = \mathbb{E}_{\mathbf{x},y} [L(h(\mathbf{x}),y)]$

Writing out the expectation:

$\displaystyle R(h) =\int_x \sum_{c=1}^C L(h(\mathbf{x}),c) p(\mathbf{x}, y=c) d\mathbf{x}$

Using the chain rule this becomes:

$\displaystyle R(h) =\int_x \sum_{c=1}^C L(h(\mathbf{x}),c) p(y=c|\mathbf{x}) p(\mathbf{x})d\mathbf{x}$

It is important to understand this expression. This is not assuming anything about the data. However, it is this expression that we want to minimize if we want to get a good classifier. To minimize this expression, it suffices to simply minimize for the conditional risk for any point $\mathbf{x}$ (i.e. the middle part of the above expression):

$\displaystyle R(h|\mathbf{x}) =\sum_{c=1}^C L(h(\mathbf{x}),c) p(y=c|\mathbf{x})$

But this conditional risk can be written as:

$\displaystyle R(h|\mathbf{x}) =0 \times p(y=h(\mathbf{x})|\mathbf{x}) + 1 \times \sum_{c \neq h(\mathbf{x})} p(y=c|\mathbf{x})$

Note that, $\displaystyle \sum_{c \neq h(\mathbf{x})} p(y=c|\mathbf{x}) = 1 - p(y=h(\mathbf{x})|\mathbf{x})$

Therefore, the conditional risk is simply:

$\displaystyle R(h|\mathbf{x}) = 1 - p(y=h(\mathbf{x})|\mathbf{x})$

Now, it is this conditional risk that we want to minimize given a point $\mathbf{x}$. And in order to do so, looking at the expression above, the classifier must make the following decision:

$\displaystyle h(\mathbf{x}) = \arg\max_c p(y=c| \mathbf{x})$

It is again important to note that so far we have made absolutely no assumptions about the data. So the above classifier is the best classifier that we can have in terms of generalization, in the sense of what might be the expected loss on a new sample point. Such a classifier is called the Bayes Classifier or sometimes called the Plug-in classifier.

But the optimal decision rule mentioned above i.e. $\displaystyle h(\mathbf{x}) = \arg\max_c p(y=c| \mathbf{x})$ is equivalent to saying that:

$\displaystyle h(\mathbf{x}) = c^\ast \iff \frac{p(y = c^\ast|\mathbf{x})}{p(y=c|\mathbf{x})} \geq 1 \text{ }\forall c$

by taking log, this would be:

$\displaystyle h(\mathbf{x}) = c^\ast \iff \log \frac{p(y = c^\ast|\mathbf{x})}{p(y=c|\mathbf{x})} \geq 0 \text{ }\forall c$

If, we were only dealing with binary classification, this would imply:

$\displaystyle h(\mathbf{x}) = 1 \iff \log \frac{p(y = 1|\mathbf{x})}{p(y=0|\mathbf{x})} \geq 0$

Notice that by making no assumptions about the data, simply by writing out the conditional risk, the log-odds ratio has fallen out directly. This is not an accident, because the optimal bayes classifier has this form for binary classification. But the question still remains, how do we model this log-odds ratio? The simplest option is to consider a linear model (there is no reason to stick to a linear model, but due to some reasons, one being convexity, we stick to a linear model):

$\displaystyle log \frac{p(y = 1|\mathbf{x})}{p(y=0|\mathbf{x})} = w_0 + \mathbf{w}^T \mathbf{x} = 0$

Now, we know that $p(y=1|\mathbf{x}) = 1 - p(y=0|\mathbf{x})$, plugging this in the above, and exponentiating, we have:

$\displaystyle log \frac{p(y = 1|\mathbf{x})}{1 - p(y=0|\mathbf{x})} = \exp(w_0 + \mathbf{w}^T \mathbf{x}) = 1$

Rearranging, yields the familiar logistic model (and the sigmoid):

$\displaystyle p(y = 1|\mathbf{x}) = \frac{1}{1+ \exp(- w_0 - \mathbf{w}^T\mathbf{x})} = \frac{1}{2}$.

As noted in the post linked in the beginning, the logistic model, $\sigma(x) = \frac{1}{1+ e^{-x}}$, which for any $x$ is, $0 \leq \sigma(x) \leq 1$, and is monotonic $\sigma(-\inf) = 0, \sigma(+\inf) = 1$.

____________________________

This derivation shows that the log-odds is not an arbitrary choice, infact a very natural choice. The sigmoid is simply a consequence of modeling the log-odds with a linear function (infact logistic regression is arguably the simplest example of a log-linear model, if we had structured outputs, a natural extension of such a model would be the Conditional Random Field. The choice of using a linear function is simply to make the optimization convex, amongst some other favourable properties).

____________________________

Note: This post was inspired by some very succinct notes by Gregory Shakhnarovich from his Machine Learning class, that I both took and served as a TA for.

## Conditional Random Fields: A Beginner’s Survey

One interesting project that I am involved in these days involves certain problems in Intelligent Tutors. It turns out that perhaps one of the best ways to tackle them is by using Conditional Random Fields (CRFs). Many attempts to solving these problems still involve Hidden Markov Models (HMMs). Since I have never really been a Graphical Models guy (though I am always fascinated) so I found the going on studying CRFs quite difficult. Now that the survey is more or less over, here are my suggestions for beginners to go about learning them.

Tutorials and Theory

1. Log-Linear Models and Conditional Random Fields (Tutorial by Charles Elkan)

Log-linear Models and Conditional Random Fields
Charles Elkan

6 videos: Click on Image above to view

Two directions of approaching CRFs are especially useful to get a good perspective on their use. One of these is considering CRFs as an alternate to Hidden Markov Models (HMMs) while another is to think of CRFs building over Logistic Regression.

This tutorial makes an approach from the second direction and is easily one of the most basic around. Most people interested in CRFs would ofcourse be familiar with ideas of maximum likelihood, logistic regression etc. This tutorial does a good job, starting with the absolute basics – talking about logistic regression (for a two class problem) to a more general multi-label machine learning problem with a structured output (outputs having a structure). I tried reading a few tutorials before this one, but found this to be the most comprehensive and the best place to start. It however seems that there is one lecture missing in this series which (going by the notes) covered more training algorithms.

2. Survey Papers on Relational Learning

These are not really tutorials on CRFs, but talk of sequential learning in general. For beginners, these surveys are useful to clarify the range of problems in which CRFs might be useful while also discussing other methods for the same briefly. I would recommend these two tutorials to help put CRFs in perspective in the broader machine learning sub-area of Relational Learning.

— Machine Learning for Sequential Learning: A Survey (Thomas Dietterich)

PDF

This is a very broad survey that talks of sequential learning, defines the problem and some of the most used methods.

— An Introduction to Structured Discriminative Learning (R Memisevic)

PS

This tutorial is like the above, however focuses more on comparing CRFs with large margin methods such as SVM. Giving yet another interesting perspective in placing CRFs.

3. Comprehensive CRF Tutorial (Andrew McCallum and Charles Sutton)

PDF

This tutorial is the most compendious tutorial available for CRF. While it claims to start from the bare bone basics, I found it hard for a start and took it on third (after the above two). It is potentially the starting and ending point for a more advanced Graphical Models student. It is extensive (90 pages) and gives a feeling of comfort with CRFs when done. It is definitely the best tutorial available though by no means the most easiest point to start if you have never done any sequential learning before.

This might be considered an extension to this tutorial by McCallum et al : CRFs for Relational Learning (PDF)

4. Original CRF Paper (John Lafferty et al.)

PDF

Though not necessary to learn CRFs given many better tutorials, this paper is still recommended, being the first on CRFs.

5. Training/Derivations (Rahul Gupta)

PDF

This report is good for the various training methods and for one to go through the derivations associated.

6. Applications to Vision (Nowozin/Lampert)

If your primary focus is using structured prediction in Computer Vision/Image Analysis then a good tutorial (with a large section on CRFs) can be found over here:

Structured prediction and learning in Computer Vision (Foundations and Trends Volume).

PDF

___________________

Extensions to the CRF concept

There are a number of extensions to CRFs. The two that I have found most helpful in my work are (these are easy to follow given the above):

Both of these extensions work to include hidden variables in the CRF framework.

___________________

Software Packages

1. Kevin Murphy’s CRF toolbox (MATLAB)

2. MALLET (I haven’t used MALLET, it is Java based)

3. HCRF – LDCRF Library (MATLAB, C++, Python). As as the name suggests, this package is for HCRF and LDCRF, though can be used as a standalone package for CRF as well.