Logistic Regression


There are many cases when the output is discrete rather than continuous. And many problems that have binary outcomes. And it would be a great deal if we can assign probabilities over the binary labels rather giving straight answer from the set {0,1}. In this way, we could be more confident about the labels further away from the decision boundary more likely to be correct. In short we want to have the conditional distribution of the label Y given the input variables, P(Y|X).

Now instead of generating 0 or 1, we output P(Y=1|X). But how to model this probability as a function of X. We cannot use direct linear function here which is unbounded whereas we need a function bounded by 0 and 1. Instead we use logistic function, $ log \frac{p}{1-p}$ . This function can be used as linear function of x.

$ log \frac{p(y|x)}{1 - p(y|x)} = \theta_0 + \theta x $

On solving for probability, we can get an equation,

$ P(y=1|X, \theta , \theta_0) = \sigma (\theta_0 + \theta X) $ where $ \sigma (\theta_0 + \theta X) = \frac{1}{1 + e^{-(\theta_0 + \theta X)}} $ , is called sigmoid function

The graph of $ \sigma $ with x is shown below. As you can see, the value of $ \sigma $ is bounded by 0 and 1, therefore can be interpreted as probability.

logistic regression

To train such models, we try to maximize the probability that we predict the correct label. Assuming the observations are independent of each other, the conditional likelihood function is given by

$ \prod _{i=1}^{n}P(Y=y_i|X_i) = \prod p^{y_i}(1-p)^{1-y_i}$ .

Taking log both side gives the log likelihood (product might be too low, so we take log),

$ l(\theta , \theta_0) = \sum_{i=1}^{n} y_i log p(x_i) + (1 - y_i) log (1 - p(x_i)) $

= $ \sum_{i=1}^{n} log (1 - p(x_i)) + \sum_{i=1}^{n} y_i log \frac{p(x_i)}{1 - p(x_i)} $

= $ \sum_{i=1}^{n} log (1-p(x_i)) + \sum_{i=1}^{n} y_i (\theta_0 + x_i \theta) $

= $ \sum_{i=1}^{n} - log (1 + e^{\theta_0 + x_i \theta}) + \sum_{i=1}^{n} y_i (\theta_0 + x_i \theta) $

To find the Maximum likelihood estimate we differentiate log-likelihood with respect to the parameters,

$ \frac{\partial l}{\partial \theta_j} = \sum_{i=1}^{n} -x_{ij} (y_i - P(y_i | x_i, \theta , \theta_0)) $

Now we cannot put above derivative equal to zero. It has to be solved using numerical optimization. There are different methods proposed for solving it e.g. Newton's Method, Conjugate Gradient, L-BFGS and Trust Region(liblinear). But before that lets talk about Multinomial Logistic Regression or Softmax Regression.

Softmax Regression In logistic regression $ y_i \epsilon \{0, 1\}$ while in softmax regression $ y_i \epsilon \{1, 2,...., K \}$. Given a test input x, we want our hypothesis to estimate P(y=k | x) for each k = 1,2,...,K. Therefore the output is a K-dimensional vector which sum to 1. Probability in softmax is given by

$ P(y^{(i)} = k | x^{(i)} ; \theta) = \frac{exp(\theta^{(k)T} x^{(i)})}{\sum_{j=1}^{K}exp(\theta^{(j)T} x^{(i)})} $

$ h_{\theta}(x) = \begin{bmatrix} P(y = 1 | x)\\ P(y = 2 | x)\\ .\\ .\\ .\\ P(y = K | x)\\ \end{bmatrix} = \frac{1}{\sum_{i=1}^{K}exp(\theta^{(i)T} x)} \begin{bmatrix} exp(\theta^{(1)T} x)\\ exp(\theta^{(2)T} x)\\ .\\ .\\ .\\ exp(\theta^{(K)T} x)\\ \end{bmatrix} $

The likelihood is given by

$ l( \theta ) = - \left [ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\{ y^{(i)} = k\} log \frac{exp(\theta^{(k)T}x^{(i)})}{\sum_{j=1}^{K} exp(\theta^{(j)T} x^{(i)})} \right ] $

The gradient with respect to theta parameters,

$ \frac{\partial l}{\partial \theta_k} = \sum_{i=1}^{n} -x_{ij} (1\{y_i = k \} - P(y_i = k | x_i, \theta , \theta_0)) $

Given the derivative, we can now use numerical optimization algorithm like L-BFGS, Gradient Descent to minimize negative log-likelihood. For example in gradient descent,

$ \theta_j = \theta_j - \alpha \triangledown_{\theta_j}l(\theta) $

Regularization

Large parameters often lead to overfitting. We penalize large parameters values by adding a regularization or weight decay term $ \frac{\lambda}{2} \sum_{i=1}^{K} \sum_{j=0}^{n} \theta_{ij}^2 $ to the log likelihood or cost function. The gradient then looks like

$ \frac{\partial l}{\partial \theta_k} = \sum_{i=1}^{n} -x_{ij} (1\{y_i = k \} - P(y_i = k | x_i, \theta , \theta_0)) + \lambda \theta_k$

Below is a very simple implementation of Logistic Regression using Gradient Descent.


function [theta,cost] = logistic_regression_train(X,y,num_iter)
	
    %X = featureNorm(X);
    [m,n] = size(X);
    X = [ones(m,1), X];
    theta = ones(n+1,1);
    cost = zeros(1,500);
    for i=1:num_iter
        alpha = 1/sqrt(i); %try changing different values like 1/sqrt(i),2/i...
        h = sigmoid(X * theta);
        theta = theta - alpha * X' * (h - y);
        cost(i) = sum(abs(h - y)) / m;
    end

end
However, we can use built-in libraries of scikit-learn in python.

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(penalty='l2', dual=False, tol=0.0001, \
                            C=1.0, fit_intercept=True, intercept_scaling=1, \
                            class_weight=None, random_state=None, \
                            solver='liblinear',max_iter=100, \
                            multi_class='ovr', verbose=0)

We can tune parameters once we know the fundamentals of Logistic Regression. Like we can set regularization type using penalty parameter. Optimization algorithm can be set using solver parameter. Here it is set liblinear which implements Trust Region Method.

References: