Where does the Sigmoid in Logistic Regression come from?

May 18, 2016 by Shubhendu Trivedi

Note: The title of this post is circular. But I use/abuse it because of the post linked below.

I noticed on the Hacker News front page (and via multiple reshares on twitter), a discussion on why logistic regression uses a sigmoid. The article linked in the story talks about the log-odds ratio, and how it leads to the sigmoid (and gives a good intuitive plug on it).

However, I think that the more important question is – Why do you care about log-odds? Why do you use log-odds and not something else? The point of this quick post is to write out why using the log-odds is infact very well motivated in the first place, and once it is modeled by a linear function, what you get is the logistic function.

Beginning with log-odds would infact be begging the question, so let us try to understand.

____________________________

To motivate and in order to define the loss etc, suppose we had a linear classifier: $\hat{y} = h(\mathbf{x}) = sign(w_0 + \mathbf{w}^T\mathbf{x})$ . This just means that for a vector input, we take the dot product with the parameters $\mathbf{w}$ and take the sign.

The learning problem would be to figure out a good direction $\mathbf{w}$ and a good location of the decision boundary $w_0$ .

____________________________

We want to figure these out so as to minimize the expected 0-1 loss (or expected number of mistakes) for the classifier $h: \mathcal{X} \rightarrow \mathcal{Y}$ . The 0-1 loss for a datapoint-label pair $\mathbf{x},y$ is simply:

$\displaystyle L(h(\mathbf{x}),y) = \begin{cases} 0, & \text{if } h(\mathbf{x}) = y \\ 1, & \text{if } h(\mathbf{x}) \neq y \end{cases}$

Now, the next question we would like to ask. What is the risk of this classifier that we want to minimize? The risk is the expected loss. That is, if we draw a random sample from the (unknown) distribution $p(\mathbf{x},y)$ , what would be the expected error? More concretely:

$R(h) = \mathbb{E}_{\mathbf{x},y} [L(h(\mathbf{x}),y)]$

Writing out the expectation:

$\displaystyle R(h) =\int_x \sum_{c=1}^C L(h(\mathbf{x}),c) p(\mathbf{x}, y=c) d\mathbf{x}$

Using the chain rule this becomes:

$\displaystyle R(h) =\int_x \sum_{c=1}^C L(h(\mathbf{x}),c) p(y=c|\mathbf{x}) p(\mathbf{x})d\mathbf{x}$

It is important to understand this expression. This is not assuming anything about the data. However, it is this expression that we want to minimize if we want to get a good classifier. To minimize this expression, it suffices to simply minimize for the conditional risk for any point $\mathbf{x}$ (i.e. the middle part of the above expression):

$\displaystyle R(h|\mathbf{x}) =\sum_{c=1}^C L(h(\mathbf{x}),c) p(y=c|\mathbf{x})$

But this conditional risk can be written as:

$\displaystyle R(h|\mathbf{x}) =0 \times p(y=h(\mathbf{x})|\mathbf{x}) + 1 \times \sum_{c \neq h(\mathbf{x})} p(y=c|\mathbf{x})$

Note that, $\displaystyle \sum_{c \neq h(\mathbf{x})} p(y=c|\mathbf{x}) = 1 - p(y=h(\mathbf{x})|\mathbf{x})$

Therefore, the conditional risk is simply:

$\displaystyle R(h|\mathbf{x}) = 1 - p(y=h(\mathbf{x})|\mathbf{x})$

Now, it is this conditional risk that we want to minimize given a point $\mathbf{x}$ . And in order to do so, looking at the expression above, the classifier must make the following decision:

$\displaystyle h(\mathbf{x}) = \arg\max_c p(y=c| \mathbf{x})$

It is again important to note that so far we have made absolutely no assumptions about the data. So the above classifier is the best classifier that we can have in terms of generalization, in the sense of what might be the expected loss on a new sample point. Such a classifier is called the Bayes Classifier or sometimes called the Plug-in classifier.

But the optimal decision rule mentioned above i.e. $\displaystyle h(\mathbf{x}) = \arg\max_c p(y=c| \mathbf{x})$ is equivalent to saying that:

$\displaystyle h(\mathbf{x}) = c^\ast \iff \frac{p(y = c^\ast|\mathbf{x})}{p(y=c|\mathbf{x})} \geq 1 \text{ }\forall c$

by taking log, this would be:

$\displaystyle h(\mathbf{x}) = c^\ast \iff \log \frac{p(y = c^\ast|\mathbf{x})}{p(y=c|\mathbf{x})} \geq 0 \text{ }\forall c$

If, we were only dealing with binary classification, this would imply:

$\displaystyle h(\mathbf{x}) = 1 \iff \log \frac{p(y = 1|\mathbf{x})}{p(y=0|\mathbf{x})} \geq 0$

Notice that by making no assumptions about the data, simply by writing out the conditional risk, the log-odds ratio has fallen out directly. This is not an accident, because the optimal bayes classifier has this form for binary classification. But the question still remains, how do we model this log-odds ratio? The simplest option is to consider a linear model (there is no reason to stick to a linear model, but due to some reasons, one being convexity, we stick to a linear model):

$\displaystyle log \frac{p(y = 1|\mathbf{x})}{p(y=0|\mathbf{x})} = w_0 + \mathbf{w}^T \mathbf{x} = 0$

Now, we know that $p(y=1|\mathbf{x}) = 1 - p(y=0|\mathbf{x})$ , plugging this in the above, and exponentiating, we have:

$\displaystyle log \frac{p(y = 1|\mathbf{x})}{1 - p(y=0|\mathbf{x})} = \exp(w_0 + \mathbf{w}^T \mathbf{x}) = 1$

Rearranging, yields the familiar logistic model (and the sigmoid):

$\displaystyle p(y = 1|\mathbf{x}) = \frac{1}{1+ \exp(- w_0 - \mathbf{w}^T\mathbf{x})} = \frac{1}{2}$ .

As noted in the post linked in the beginning, the logistic model, $\sigma(x) = \frac{1}{1+ e^{-x}}$ , which for any $x$ is, $0 \leq \sigma(x) \leq 1$ , and is monotonic $\sigma(-\inf) = 0, \sigma(+\inf) = 1$ .

____________________________

This derivation shows that the log-odds is not an arbitrary choice, infact a very natural choice. The sigmoid is simply a consequence of modeling the log-odds with a linear function (infact logistic regression is arguably the simplest example of a log-linear model, if we had structured outputs, a natural extension of such a model would be the Conditional Random Field. The choice of using a linear function is simply to make the optimization convex, amongst some other favourable properties).

____________________________

Note: This post was inspired by some very succinct notes by Gregory Shakhnarovich from his Machine Learning class, that I both took and served as a TA for.

Posted in Machine Learning | Tagged Bayes Classifiers, Bayes Risk, Log Linear Models, Logistic Regression, Machine Learning, Statistics | 4 Comments

4 Responses

on June 5, 2016 at 6:06 am | Reply CAA

It falls out naturally using the maximum entropy approach.
on June 5, 2016 at 12:14 pm | Reply Alex Keil

But this doesn’t seem to answer the actual question. Your mapping of the conditional risk for y=c to the conditional odds of y=c is the key point, but it’s an arbitrary step. The same mapping could be made for any monotonic transformation of the risk, such as a probit function or simply a log-function. The reason for going to the log-odds, historically, is that, unlike a linear risk model, the parameters of the classifier are not bound by [0,1] and unlike the log-risk model, it’s not bound above. The parameters of a linear-logistic model (and probit) live on (-infty,infty), so computational algorithms have an easy time coming up with a solution that converges. That’s the real reason for using logistic versus, say, a linear or log-linear risk model. This just seems like a lot of hand-waving around why the logistic function is chosen versus any other arbitrary monotonic transformation of the optimal decision rule.
- on June 5, 2016 at 3:48 pm | Reply Shubhendu Trivedi
  
  I am aware of what you are saying, however I don’t think that was the point of the post, while making the post I thought about it and decided to not write about it. I think the main point was to show how the logistic falls off when you write down the Bayes risk. I do mention in passing both the efficiency considerations and that one could use any other likn function. Also, I don’t see how the mapping to the odds is arbitrary.
on August 9, 2019 at 4:07 pm | Reply Deep

Great post! Just one comment though: There are typos in your steps when you are assuming the log-odds as a linear function and carrying on the rest of the derivations.

Cheers!

Comments RSS

	Vihan on 1966 Film on John von Neu…
	Shubhendu Trivedi on Strangeness Minus Three (Richa…
	Shubhendu Trivedi on “Darwinian Evolution is…
	Deep on “Darwinian Evolution is…
	Deep on Strangeness Minus Three (Richa…
	Ofir Gagulashvily on Maximum Number of Regions in A…
	Perception and Visio… on Blind Geometers
	Andrea on A Note on the Graph Lapla…
	eJournal (Final Entr… on A Fractal with Zero Area and a…
	Deep on Where does the Sigmoid in Logi…

Onionesque Reality

..::A Random Walk::..