Archive for May, 2016

Note: The title of this post is circular. But I use/abuse it because of the post linked below.

I noticed on the Hacker News front page (and via multiple reshares on twitter), a discussion on why logistic regression uses a sigmoid. The article linked in the story talks about the log-odds ratio, and how it leads to the sigmoid (and gives a good intuitive plug on it).

However, I think that the more important question is – Why do you care about log-odds? Why do you use log-odds and not something else? The point of this quick post is to write out why using the log-odds is infact very well motivated in the first place, and once it is modeled by a linear function, what you get is the logistic function.

Beginning with log-odds would infact be begging the question, so let us try to understand.


To motivate and in order to define the loss etc, suppose we had a linear classifier: \hat{y} = h(\mathbf{x}) = sign(w_0 + \mathbf{w}^T\mathbf{x}). This just means that for a vector input, we take the dot product with the parameters \mathbf{w} and take the sign.

The learning problem would be to figure out a good direction \mathbf{w} and a good location of the decision boundary w_0.


We want to figure these out so as to minimize the expected 0-1 loss (or expected number of mistakes) for the classifier h: \mathcal{X} \rightarrow \mathcal{Y} . The 0-1 loss for a datapoint-label pair \mathbf{x},y is simply:

\displaystyle L(h(\mathbf{x}),y) = \begin{cases} 0, & \text{if } h(\mathbf{x}) = y \\ 1, & \text{if } h(\mathbf{x}) \neq y \end{cases}

Now, the next question we would like to ask. What is the risk of this classifier that we want to minimize? The risk is the expected loss. That is, if we draw a random sample from the (unknown) distribution p(\mathbf{x},y), what would be the expected error? More concretely:

R(h)  = \mathbb{E}_{\mathbf{x},y} [L(h(\mathbf{x}),y)]

Writing out the expectation:

\displaystyle R(h) =\int_x \sum_{c=1}^C L(h(\mathbf{x}),c) p(\mathbf{x}, y=c) d\mathbf{x}

Using the chain rule this becomes:

\displaystyle R(h) =\int_x \sum_{c=1}^C L(h(\mathbf{x}),c) p(y=c|\mathbf{x}) p(\mathbf{x})d\mathbf{x}

It is important to understand this expression. This is not assuming anything about the data. However, it is this expression that we want to minimize if we want to get a good classifier. To minimize this expression, it suffices to simply minimize for the conditional risk for any point \mathbf{x} (i.e. the middle part of the above expression):

\displaystyle R(h|\mathbf{x}) =\sum_{c=1}^C L(h(\mathbf{x}),c) p(y=c|\mathbf{x})

But this conditional risk can be written as:

\displaystyle R(h|\mathbf{x}) =0 \times p(y=h(\mathbf{x})|\mathbf{x}) + 1 \times \sum_{c \neq h(\mathbf{x})} p(y=c|\mathbf{x})

Note that, \displaystyle \sum_{c \neq h(\mathbf{x})} p(y=c|\mathbf{x}) = 1 - p(y=h(\mathbf{x})|\mathbf{x})

Therefore, the conditional risk is simply:

\displaystyle R(h|\mathbf{x})  = 1 - p(y=h(\mathbf{x})|\mathbf{x})

Now, it is this conditional risk that we want to minimize given a point \mathbf{x}. And in order to do so, looking at the expression above, the classifier must make the following decision:

\displaystyle h(\mathbf{x}) = \arg\max_c p(y=c| \mathbf{x})

It is again important to note that so far we have made absolutely no assumptions about the data. So the above classifier is the best classifier that we can have in terms of generalization, in the sense of what might be the expected loss on a new sample point. Such a classifier is called the Bayes Classifier or sometimes called the Plug-in classifier.

But the optimal decision rule mentioned above i.e. \displaystyle h(\mathbf{x}) = \arg\max_c p(y=c| \mathbf{x}) is equivalent to saying that:

\displaystyle h(\mathbf{x}) = c^\ast \iff \frac{p(y = c^\ast|\mathbf{x})}{p(y=c|\mathbf{x})} \geq 1 \text{  }\forall c

by taking log, this would be:

\displaystyle h(\mathbf{x}) = c^\ast \iff \log \frac{p(y = c^\ast|\mathbf{x})}{p(y=c|\mathbf{x})} \geq 0 \text{  }\forall c

If, we were only dealing with binary classification, this would imply:

\displaystyle h(\mathbf{x}) = 1 \iff \log \frac{p(y = 1|\mathbf{x})}{p(y=0|\mathbf{x})} \geq 0

Notice that by making no assumptions about the data, simply by writing out the conditional risk, the log-odds ratio has fallen out directly. This is not an accident, because the optimal bayes classifier has this form for binary classification. But the question still remains, how do we model this log-odds ratio? The simplest option is to consider a linear model (there is no reason to stick to a linear model, but due to some reasons, one being convexity, we stick to a linear model):

\displaystyle log \frac{p(y = 1|\mathbf{x})}{p(y=0|\mathbf{x})} = w_0 + \mathbf{w}^T \mathbf{x} = 0

Now, we know that p(y=1|\mathbf{x}) = 1 - p(y=0|\mathbf{x}), plugging this in the above, and exponentiating, we have:

\displaystyle log \frac{p(y = 1|\mathbf{x})}{1 - p(y=0|\mathbf{x})} = \exp(w_0 + \mathbf{w}^T \mathbf{x}) = 1

Rearranging, yields the familiar logistic model (and the sigmoid):

\displaystyle p(y = 1|\mathbf{x}) = \frac{1}{1+ \exp(- w_0 - \mathbf{w}^T\mathbf{x})} = \frac{1}{2}.

As noted in the post linked in the beginning, the logistic model, \sigma(x) = \frac{1}{1+ e^{-x}}, which for any x is, 0 \leq \sigma(x) \leq 1, and is monotonic \sigma(-\inf) = 0, \sigma(+\inf) = 1.


This derivation shows that the log-odds is not an arbitrary choice, infact a very natural choice. The sigmoid is simply a consequence of modeling the log-odds with a linear function (infact logistic regression is arguably the simplest example of a log-linear model, if we had structured outputs, a natural extension of such a model would be the Conditional Random Field. The choice of using a linear function is simply to make the optimization convex, amongst some other favourable properties).


Note: This post was inspired by some very succinct notes by Gregory Shakhnarovich from his Machine Learning class, that I both took and served as a TA for.


Read Full Post »