Logistic Regression implementation in Python from scratch




Logistic regression is another classification algorithm used in machine learning which is straight forward and efficient. Unlike the linear regression, it has binary or categorical dependent variable. In this article we’ll cover the case where dependent variable is binary but for cases where dependent variable has more than two categories multinomial logistic regression will be used which is out of scope for now. 


Similar to naive bayes algorithm, logistic regression can also take continuous and categorical variables  as input and outputs a probability for an input vector belonging to a particular class. 

So for understanding the logistic regression we first solve the problem by hand (i.e. solve it mathematically) and then write the Python implementation.

1. Why Logistic Regression
We start by looking at a very basic example of a tumor being malignant or not. The outcome is in a binary format i.e. 1 if the tumor is malignant and 0 if it is benign. Now if we fit a straight line using the linear regression and set a threshold at point M, we see that the results are not bad at all. In fact for this case it worked quite decently and allow us to determine whether the tumor is malignant or not based on tumor size. So, tumor size > M is a malignant tumor and vice versa. This is a very simple example just to let the readers know that linear regression is not something completely wrong in fitting a line to the categorical dependent variables.

 Now lets extend this example and add one more data point like the following and then fit a line on the data and set the threshold to M. The results are now quite bad because we can see from the graph that the below line wrongly classifies the tumor in some cases.

One more problem is usually faced while fitting a line to such data i.e. the output we observe in the training data is either 0 or 1 and we expect the output between 0 and 1 if we fit a straight line. But that is not the case and we can expect outputs greater than 1 for large tumor sizes which is not desired. For these reasons, we need a model that fits well on such kind of data.

If we look at the some mathematical functions we’ll realize that “sigmoid function” or “logistic function” below solves both of our problems i.e. it fits the data which output variable either 0 or 1 and results a probability value for each data point. Secondly, it is asymptotic at 0 and 1 which makes sure the resulting value will not exceed this limit.

In machine learning the method that fits the logistic function is called logistic regression. Fitting a logistic function have some other advantages as well like continuity and easy derivatives. So we see why we need logistic regression to solve a classification problem.

High level overview

After realizing the necessity of using the logistic function, we’ll now see a high level overview of the problem. The logistic function can be written as \[P = \frac{1}{1-e^{-X}}\] If we write it in terms of linear regression then it can be written as \[P = \frac{1}{1-e^{-\theta^TX}}\] Here $\theta^TX$ is the dot product of the parameter $(\theta)$ and input vector $X$. Input vectors will be given to us (both for training and testing data) to train the model and then later calculate the probability for a new input (test data) and $\theta$ vector will be estimated by using a training set of that particular data set. $\theta$ can be optimized using various optimization functions but generally we use gradient descent algorithm to do this. The complete process of optimizing the $\theta$ will be explained in the next section. So, we can see that how easy the method becomes once we optimize $\theta$. We just need to calculate the dot product between the new input vector and $\theta$, then plugin that value in the logistic function to calculate the probability of X. 

Cost function optimization for optimal parameters


An integral objective in learning algorithms is to optimize the parameters according to the dataset. For this reason, cost functions are used. If we remember from the linear regression the cost function is simply the sum of squared errors which needed to minimized. Following is the cost function for linear regression \[J(\theta) = \frac{1}{2m}\sum_{i-1}^{m}(h_\theta(x^i)-y^i)^2\]  From the function we can analyze that it is a simple quadratic function and can be minimized easily using gradient descent because it has the same local and global minimum. On the other hand, if we use the same cost function for logistic regression, we would end up with a weird non-convex function which is not guaranteed to reach global optimum using gradient descent and therefore, we’ll not be certain if our trained parameters are optimal or not. 

For this reason, the following cost function is used to minimize the parameters. The idea behind the following function is quite simple. If we plot the following cost function, we end up seeing the graphs being convex and can be cost function can be optimized by using gradient descent to guarantee global optimum. 

The above formula can be written in a simplified form \[J(\theta) = -ylog(h_\theta(x))-(1-y)log(1- h_\theta(x))\] and it can be used to calculate the total cost minimization of all data points in the training set of size m with the following mathematical form and then minimized $min_\theta J(\theta)$ to get the optimal parameter vector for $\theta$. \[J(\theta) = \frac{-1}{m}[\sum_{i=1}^{m}y^ilog(h_\theta (x^i)) - (1-y^i)log(1-h_\theta(x^i)) ]\]

A detailed implementation for logistic regression in Python

We start by loading the data from a csv file. The data is quite easy with a couple of independent variable so that we can better understand the example and then we can implement it with more complex datasets. Data consists of two types of grades i.e. grade1 and grade2 which results in the student of being admitted and not admitted i.e. admitted=1, non admitted=0.
We are using pandas library to handle data in dataframes which is much faster and efficient. First, we define a dataframe (df) from the csv file. Then we applied few preprocessing steps to clean the data and transform it. Data cleaning and transformation is done using pandas and scikit-learn which can be understood from here. Now we have independent variables separated as X and dependent varaibles separated as Y.


We then introduce the logistic (sigmoid) function along with the hypothesis of calculating $\theta^TX$ which is simply the linear combination of parameter vector $\theta$ and input vector $X$. 

After this, we will implement the cost function we discussed above in the detail. So in short the cost function will calculate $h_\theta(x^i)$ and returns the sum of errors. We sum the errors (cost) for each data point and put large penalty for the data point to estimate the wrong class i.e. in our case is Y=1 or 0. So if Y=1, the second part of the cost function will be penalized and gets equal to zero and vice versa.

Now in order to apply gradient descent algorithm we need to calculate the partial derivatives of the cost function for each $\theta$. For this reason we implement the following derivative cost function.

Next we create the implementation for gradient descent which will use the partial derivative function above and optimize it using fixed amount of iterations. The main idea behind the gradient descent can be understood from this animation.


After the implementation of gradient descent, we are all set to implement the main logistic regression method and optimize $\theta$ vector by applying gradient descent and minimize the cost function $J(\theta)$. For a fixed number of iterations we run gradient descent and check the cost function value.


Now here is the main method that will basically set the initial parameters and run the above method which will run their respective implemented methods. The complete implementation with the data to test the algorithm is available here on github.



1 comments:

Post a Comment