Support Vector Machines - Understanding the theory and implementation in Python


Support Vector Machine (SVM) is a classification algorithm which separates the data generally into 2 classes depending on the problem definition. There are many classification algorithms including Naive Bayes, logistic regression, nueral nets etc but SVM is one of the sophisticated methods and a must have tool in a data scientist toolkit.
Today, we’ll first talk about the mathematical perspective of SVM to better understand what exactly is going on and then we’ll solve a classification problem with our conventional Pima Indians Diabetes dataset and determine our model accuracy.


1. General Overview
Lets consider that we have a space which consist of negative (-) and positive (+) examples. So, how can we separate + and – examples by just using a straight line (hyperplane)? For example, let us look at the following graph

From the above graph we can separate the space using various hyperplanes. But the question now is which hyperplane to choose for separating the classes in the data in an optimal way ?
Fortunately we have a very descent answer to this question which was given to us by the discoverer of this algorithm and the answer is that the hyperplane which separates both the classes keeping the farthest margin between the closest data point(s) that separates the two classes. Confusing ? Lets see that in the following graph to understand what I just said.
As discussed, the optimal hyperplane can be represented in an infinite number. Even if we determine the direction of the hyperplane then still we have infinite number of ways by which we can scale coefficients with different magnitudes. From many possibilities, lets assume that the optimal hyperplane is
\[\left | \beta_0 + \beta^T = 1 \right | --- (1)\]
Where x are the training examples and  $\beta^{T}$ is the weight vector and $\beta_{0}$ is the intercept (or bias). Also, the training examples that are closest to the hyperplane are called support vectors. So generally, we optimize the distance (margins) between the support vectors which is done using the basic distance formula from point to linearly \[distance = \frac{\left | \beta_0 + \beta^T \right |}{\left \| \beta \right \|} --- (2)\]
From eq(1) and eq(2) \[distance = \frac{1}{\left \| \beta \right \|} --- (3)\] Note that this distance is only from one point (one class point) to the hyperplane. In order to to calculate the total distance between closet points of the two classes we can multiply it by 2 and we get \[distance(M) = \frac{2}{\left \| \beta \right \|} --- (4)\]
Where M is the margin or perpendicular distance between the closest + and – examples. Now lets look at eq(4), minimizing $\left \| \beta \right \| $ will lead to the farthest margin and that is our objective too. So eq(4) can be written as follows as a nonlinear optimization problem
\[min_{\beta_0,\beta^T} L(B) = \frac{1}{2}\left \| \beta \right \|^2\] subject to $y_i (\beta_0+\beta^Tx) \geq 1 --- (5)$

Where $y_i$ are labels of training examples generally 0 or 1. This problem can be solved by using Lagrange multipliers which will give us optimal $\beta_0$ and $\beta^T$. 


 2. Advanced Mathematical Concepts
This section is optional for those who do not want to go in the mathematical details regarding SVMs. Because I am trying to explain the above concepts in more detail therefore I'll repeat few concepts explained above already so don't mind if you feel repetition. So lets start by assuming that we have the following scenario where $\bar{x}$ is an unknown vector and $\beta^T$ is the weight vector perpendicular to the median line such that
If $\beta^T.\bar{x} \geq \beta_0$, then the sample is + otherwise - or in other words, $\beta_0+\beta^Tx \geq 0$ for positive samples. Now in this case we do not know which $\beta^T$ and $\beta_0$ to choose. We have this information that $\beta^T$ has to be perpendicular to the median line but there are many $\beta^T$ which can be perpendicular to it because $\beta^T$ can be of any length.

We move forward by introducing the following few constraints in order to solve our problem

\[\beta^T.\bar{x}_+ + \beta_0 \geq 1 --- (6) \]
\[\beta^T.\bar{x}_- + \beta_0 \leq -1 --- (7) \]
Where $\bar{x}_+$ and $\bar{x}_-$ are for + and - samples. In other words, for an unknown $\bar{x}$, either eq(6) or eq(7) has to be satisfied for concluding that $\bar{x}$ is + or -. At this point we introduce a little trick of involving a variable $y$ such that $y_i = +1 $ for + samples and $y_i = -1 $ for - samples. Now eq(6) and (7) will become
\[\beta^T.\bar{x}_+ + \beta_0 \geq 1 \rightarrow y_i(\beta^T.\bar{x}_+ + \beta_0 \geq 1)\]\[\beta^T.\bar{x}_- + \beta_0 \leq -1 \rightarrow y_i(\beta^T.\bar{x}_- + \beta_0 \geq 1)\]
We can see that introducing this trick makes equations the same. So our final constraint will be \[y_i(\beta^T.\bar{x} + \beta_0) - 1 \geq 0 --- (8) \]. We can also derive from eq(8) that \[y_i(\beta^T.\bar{x} + \beta_0) - 1 = 0 --- (9)\] if $x_i$ lies on the separating hyperplanes (Gutter).
Here we redraw the initial samples graph
If $\bar{x}_+$ is the sample exactly on the right side of the street and $\bar{x}_-$ on the left side, we can determine the width of the street. \[Width = (\bar{x}_+ - \bar{x}_-).\frac{\beta^T}{\left \| \beta^T \right \|} \] \[Width = \bar{x}_+ .\frac{\beta^T}{\left \| \beta^T \right \|} - \bar{x}_- .\frac{\beta^T}{\left \| \beta^T \right \|} --- (10)\] Now from eq(9) , putting $y_i = 1$ for + samples and $y_i=-1$ for - samples in eq(10) will lead to \[Width = \frac{(1-\beta_0)-(1+\beta_0)}{\left \| \beta^T \right \|} = \frac{2}{\left \| \beta^T \right \|} --- (11)\] In other words eq(11) is the width of the street we plan to maximize. \[max \frac{1}{\left \| \beta^T \right \|} \rightarrow min \left \| \beta^T \right \|\rightarrow min \frac{1}{2} \left \| \beta^T \right \|^2 --- (12)\] and the reason for this conversion is to make (11) mathematically convenient to solve.
Now we can use (12) as the quadratic optimization function with (9) being the constraints. We know from calculus that Lagrange multipliers is the best way to solve such problem. Therefore \[L = \frac{1}{2}\left \| \beta^T \right \|^2 - \sum \alpha_i[y_i(\beta^Tx+\beta_0) -1] --- (13)\] Where $\aplha_i$ are the Lagrange multipliers. Now we can calculate the extremum by calculating partial derivatives and setting them to zero like follows \[\frac{\partial L}{\partial \beta^T} = \bar{\beta^T} - \sum \alpha_iy_ix_i = 0\] \[\bar{\beta^T} =\sum \alpha_iy_ix_i --- (14)\] Similarly, \[\frac{\partial L}{\partial \beta_0} = - \sum \alpha_iy_i= 0\] \[\sum \alpha_iy_i = 0 --- (15)\] Eq(14) gives us a very interesting result i.e. $\beta^T$ is actually the linear sum of samples in the training data set. We can put (14) and (15) in (13) which makes it \[L = \frac{1}{2} (\sum \alpha_iy_ix_i)(\sum \alpha_jy_jx_j) - (\sum \alpha_iy_ix_i)(\sum \alpha_jy_jx_j) - \sum \alpha_iy_i\beta_0 + \sum\alpha_i\] \[L = \sum\alpha_i - \frac{1}{2} (\sum \alpha_iy_ix_i)(\sum \alpha_jy_jx_j)\] \[L = \sum\alpha_i - \frac{1}{2} \sum \sum \alpha_i\alpha_jy_iy_j(x_i.x_j) --- (16)\] From (16) we can see that the dependence of the whole complex equation is only on the dot product of the two samples. We can now use (16) to solve our problem and determine the coefficients of the hyperplane.
Note: The above explanation is to obtain a hyperplane when the classes in the data can be linearly separable. But if they not linearly separable then we use kernel functions to convert the data into linearly separable. 


3. A detailed example of SVM with implementation in Python using Scikit-learn

We continue to use Pima Indians Diabetes Dataset for this problem as well so we can compare different model performances later if we want. The data and its meta information is available on the here. We start by loading the data and splitting into training and testing data. In this example we are using scikit-learn, therefore it requires data in a specific format. For example, explanatory variables must be provided in a 2D list format and the dependent (classification variable) must be defined in a 1D list with classes as either 1 or 0. Following code will do that implementations


def loadData(filename):
    lines = csv.reader(open(filename, "rt"))
    dataset = list(lines)
    for i in range(len(dataset)):
        dataset[i] = [float(x) for x in dataset[i]]
    return dataset

def train_test_data(dataset, splitRatio):
    trainData = int(len(dataset) * splitRatio)
    trainSet = []
    copy = list(dataset)
    while len(trainSet) < trainData:
        index = random.randrange(len(copy))
        trainSet.append(copy.pop(index))
    X_train = [trainSet[i][:-1] for i in range(len(trainSet)) ]
    y_train = [trainSet[i][-1] for i in range(len(trainSet)) ]
    X_test = [copy[i][:-1] for i in range(len(copy)) ]
    y_test = [copy[i][-1] for i in range(len(copy)) ]
    return X_train, y_train, X_test, y_test


Now after splitting the data set, we need to use two significant methods in scikit-learn i.e. fit() and predict(). I am using the built-in methods from the library for now but a very good implementation of these methods from the scratch is available here. For those who are interested in the nitty-gritty of SVM algorithm specially the quadratic optimization to obtain optimal weights, must go through it because it is written in a very simple way. But for now lets use the following code from scikit-learn to learn our model and get the results. 

from sklearn import svm
from sklearn.metrics import accuracy_score, confusion_matrix
def main():
   filename = 'dataset.csv'
   splitRatio = 0.8 # 80% training and 20% testing
   dataset = loadData(filename)
   X_train, y_train, X_test, y_test = train_test_data(dataset, splitRatio)
   model = svm.SVC(kernel='linear' , C=1, gamma=1)
   model.fit(X_train, y_train)
   predicted = model.predict(X_test)
   accuracy = accuracy_score(y_test, predicted)
   confusion = confusion_matrix(y_test, predicted)
   print(accuracy)
   print(confusion)

After running the model to learn the decision boundary using support vectors, we calculate the accuracy of our model by using the test data which we split in the beginning. The accuracy of our model after running several is almost 75% on average with a linear kernel. We can also try kernel='rbf' or kernel='poly' which are radial basis function kernel and polynomial function kernel. Note that the model accuracy will vary a bit each time because we take random numbers to split our data which will not be the same every time. Secondly, model accuracy will also be different for different kernels.

Naive Bayes implementation in Python from scratch

Image result for naive bayes classifier
Naive Bayes (NB) is considered as one of the basic algorithm in the class of classification algorithms in machine learning. It is famous because it is not only straight forward but also produce effective results sometimes in hard problems. In this blog, I am trying to explain NB algorithm from the scratch and make it very simple even for those who have very little background in machine learning. Later, I’ll compare the results we obtained using our custom algorithm with the ones obtained from scikit-learn.

There are two main classes of classification algorithms i.e. discriminative and generative. NB class of algorithms lie in the category of generative learning algorithms. The good thing about generative algorithms is that they are really fast and efficient with small number of training examples and therefore, can be implemented very easily at scale.


Introduction

The basis of NB algorithm comes from Bayes rule which is \[Pr(h|D) = \frac{Pr(D|h)Pr(h)}{Pr(D))}\] Where h is the hypothesis and D is the data with one or more features impacting hypothesis. In simple words, probability (Pr ) of a hypothesis h given data D is determined by the above formula.is the likelihood andis the posterior probability.
In NB algorithm we have a very strong assumption that all the prior events are independent of each other. Based on the bayes rule we can compute maximum a posteriori (MAP) hypothesis for any data set \[h_{MAP} = argmax_{h\in H}Pr(h|D)\] \[h_{MAP} = argmax_{h\in H}\frac{Pr(D|h)Pr(h)}{Pr(D)} --- (1)\] \[h_{MAP} = argmax_{h\in H}Pr(D|h)Pr(h) --- (2)\]
We can eliminate Pr(D) due to independence assumption. Further if we have class probabilities P(h) uniformly distributed then the formula can be further simplified as \[h_{MAP} = argmax_{h\in H}Pr(D|h) --- (3)\]


Understanding Naive Bayes using examples

The above formula can be understood better if we solve a couple of simple examples with the data.

Example 1
Days
Weather
Play ?
1
SUNNY
YES
2
RAINY
YES
3
CLOUDY
NO
4
SUNNY
YES
5
RAINY
NO
6
SUNNY
YES
7
SUNNY
YES
8
CLOUDY
NO
9
SUNNY
YES
10
RAINY
YES


Find → P(RAINY|Play=YES)= P(Play=YES|RAINY).P(YES) / P(RAINY)
→ P(RAINY|Play=YES) = (2/7*7/10)/3/10 = 0.667
→ P(RAINY|Play=NO) = (3/10*1/3)/3/10 = 0.333

The numbers in red came from RAINY days when Play was done (YES). The numbers in blue simply came from total probability of a play (YES) which 7 days out of total 10 days. Similarly, red numbers came from total probability of being RAINY which was 3 days out of total 10 days. The resulting probability is 0.667 which shows that there is 66.7% chances that there will be play if there is a rainy weather.
Now from (2) argmax(0.667, 0.333) = 0.667 which corresponds to class (Play=YES)


Example 2

Lets add some more features in the data that might be responsible for making decision whether to play or not. We have added 2 more features in the same data i.e. temperature and windy

Days
Outlook
Temperature
Windy
Play ?
1
SUNNY
HOT
TRUE
YES
2
RAINY
MILD
TRUE
YES
3
CLOUDY
HOT
FALSE
NO
4
SUNNY
HOT
TRUE
YES
5
RAINY
MILD
FALSE
NO
6
SUNNY
HOT
TRUE
YES
7
SUNNY
COOL
FALSE
YES
8
CLOUDY
HOT
TRUE
NO
9
SUNNY
COOL
TRUE
YES
10
RAINY
HOT
TRUE
YES

Now we have on day 11 the following features observed outlook=RAINY, temperature=MILD and windy=TRUE. We have to determine the posterior probability that there will be play or not.
P(Play=YES| outlook=RAINY, temperature=MILD and windy=TRUE) = P(Play=YES)*P(RAINY|YES)*P(MILD|YES)*P(TRUE|YES)

→ 7/10*2/7*1/7*6/7 = 0.0244
→ 3/10*1/3*1/3*2/3 = 0.0222

We have calculated the above probabilities using the independence assumptions from (3). Using (2) again argmax(0.0244, 0.0222) = 0.0244 which corresponds to class (Play=YES).

A relatively detailed example

Lets try to work with a relatively large, practical and detailed example so we can understand how to implement Naive Bayes in an effective way. In our example of Bayes algorithm implementation, we’ll use Pima Indians Diabetes problem data set. From this file you can download the whole data to your local drive. Metadata can be found in this file.
A small description about the data set is that it contains 768 observations of Pima Indian patients. Each row of the data set contains the following measurements from the patients:

1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)

There are various types of variables that can exist with combination in a data set but for simplicity our dataset contains only numerical variables that leads to an outcome i.e. either a patient is diabetic or not.


Python implementation of Naive Bayes Algorithm

Using the above example, we can write a Python implementation of the above problem. With real datasets we have to first work hard in preprocessing i.e. to clean the data such that it makes sense but in our example, we are already provided with a clean data set which have at least reduced 50% of our work. So now, we are using the following steps

1. Loading data

We are starting by loading the data into memory from a csv. For this purpose, there are many Python built-in packages but we are using csv package. We iterate through each line in the data file and converting the whole data set into list of lists. Each sublist is a single line with all features as floats.
The next thing we do straight away is to split our data set into training and testing data set. In this example, our split ratio is 70/30 i.e. 70% data will randomly selected for training the model and 30% for testing the model. Following is the code that does this

import csv
import random
import math, sys
# load the data into memoery and return list of lists
def loadData(filename):
    lines = csv.reader(open(filename, "rt"))
    dataset = list(lines)
    for i in range(len(dataset)):
        dataset[i] = [float(x) for x in dataset[i]]
    return dataset
# split the data into training and testing
def train_test_data(dataset, splitRatio):
    trainData = int(len(dataset) * splitRatio)
    trainSet = []
    copy = list(dataset)
    while len(trainSet) < trainData:
        index = random.randrange(len(copy))
        trainSet.append(copy.pop(index))
    return [trainSet, copy]


2. Data Summary
We discussed earlier that Naive Bayes is a generative algorithm so it is comprised of summaries in the data set for each feature which will then be used for making predictions for a new data point. The summary consist of mean and standard deviation for each attribute. Remember that the summary will be different for different types of features. For example, continuous variables mean and standard deviation are simple. But for boolean variables, binomial distribution is used and therefore its mean and standard deviation will calculated from its own formula. Similarly for categorical variables, multinomial distribution is used.

We start by separating by each class and caculating the summary separately for each class. Then calculate the mean and standard deviation for each attribute. statistics_continuous() method calculates the summary for continuous variables and statistics_discrete() method calculates the probabilities for each category in the variable. After that we summarize the features by class. If there are any categorical variables we have to specify them in the summarize method with their index in a list. Following is the code that does this

def splitByClass(dataset):
    separated = {}
    for i in range(len(dataset)):
        vector = dataset[i]
        if (vector[-1] not in separated):
            separated[vector[-1]] = []
        separated[vector[-1]].append(vector)
    return separated

def statistics_continuous(numbers):
    mean = sum(numbers) / float(len(numbers))
    variance = sum([pow(x - mean, 2) for x in numbers]) / float(len(numbers) - 1)
    return [sum(numbers) / float(len(numbers)), math.sqrt(variance)]

def statistics_discrete(feature):
    n = len(feature)
    vals = list(set(feature))
    feature_stats = {}
    for val in vals:
        n_i = feature.count(val)
        p_i = n_i/n # probability for a specific object
        feature_stats[val] = (p_i)
    return feature_stats

def summarize(dataset, categoricalList=[]):
    unpack_data = [x for x in zip(*dataset)]
    summaries = []
    if categoricalList != []:
        for cat in categoricalList:
            summaries.insert(cat, statistics_discrete(unpack_data[cat]))
            # summaries.insert(cat, ())
    continuous_var = list(set([x for x in range(len(unpack_data))]) - set(categoricalList))
    continuous_data = [x for i, x in enumerate(unpack_data) if i in continuous_var]
    for i in range(len(continuous_data)):
        mean, stdev = statistics_continuous(continuous_data[i])
        summaries.insert(continuous_var[i], (mean, stdev))
    del summaries[-1] # Decision variable must be the last column of the dataset
    return summaries

def summarizeByClass(dataset, categoricals):
    separated = splitByClass(dataset)
    summaries = {}
    for classValue, instances in separated.items():
        summaries[classValue] = summarize(instances, categoricals) # put here [] is the index of categorical variable(s)
    return summaries



3. Learning the model

After summarizing each class we have to calculate Gaussian probabilities for each input vector for which we have calculated the mean and variance from the training dataset. Gaussian probabilities are calculated by using probability density function (PDF) of Gaussian distribution. Following is the code that does this

def estimateProbability_gaussian(x, mean, stdev):
    exponent = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev, 2))))
    return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent


We’ll then calculate the conditional probabilities of an instance belonging to a particular class by multiplying the individual probabilities of each feature. This will result in the map of class values to probabilities. The probabilities for categorical variables were calculated earlier in the method statistics_discrete() method and for a new input vector we determine the category of the variable and used its prior probability in the model. If the categorical variable has not been observed in the training data, the probability for that category will be determined by 1/lengthOfTrainingData but we have used 0.0001 for simplicity. Following is the code that does this

def calculateClassProbabilities(summaries, inputVector):
    probabilities = {}
    for classValue, classSummaries in summaries.items():
        probabilities[classValue] = 1
        for i in range(len(classSummaries)):
            try:
                mean, stdev = classSummaries[i]
                x = inputVector[i]
                probabilities[classValue] *= estimateProbability_gaussian(x, mean, stdev)
            except ValueError:
                category = inputVector[i]
                try:
                    probability_categorical = classSummaries[i][category]
                    probabilities[classValue] *= probability_categorical
                except KeyError: # If the key or category has not observed yet
                    probabilities[classValue] *= 0.0001 
    return probabilities

After calculating the class probabilities, we’ll determine the highest probability for an input vector for its association with a particular class using predict() and classify the new input vector to that class. Further we used getPredictions() to record all classifications in a list. Following is the code that does this

def predict(summaries, inputVector):
    probabilities = calculateClassProbabilities(summaries, inputVector)
    bestLabel, bestProb = None, -1
    for classValue, probability in probabilities.items():
        if bestLabel is None or probability > bestProb:
            bestProb = probability
            bestLabel = classValue
    return bestLabel


def getPredictions(summaries, testSet):
    predictions = []
    for i in range(len(testSet)):
        result = predict(summaries, testSet[i])
        predictions.append(result)
    return predictions


4. Accuracy

The final step while learning any machine learning model is to determine its accuracy and so it is for Naive Bayes. We’ll compare the results we have got in the list from getPredictions() with original testing data and determine how well our model has performed. Following is the code that does this

def getAccuracy(testSet, predictions):
    correct = 0
    for i in range(len(testSet)):
        if testSet[i][-1] == predictions[i]:
            correct += 1
    return (correct / float(len(testSet))) * 100.0


Conclusion

There are many Naive Bayes implementations but I have added a very common scenario of working with categorical variable features. I did not find enough examples like this so I thought I should add one. This will allow the reader to use either continuous, boolean or categorical variables in the feature set by just defining the boolean or categorical feature columns in a list. 


Note: If you want to understand the above implementation and run the code yourself, then download the data from the link above. After that use this github link to clone the repository on your computer and paste the data file in the same repository. 

Logistic Regression implementation in Python from scratch




Logistic regression is another classification algorithm used in machine learning which is straight forward and efficient. Unlike the linear regression, it has binary or categorical dependent variable. In this article we’ll cover the case where dependent variable is binary but for cases where dependent variable has more than two categories multinomial logistic regression will be used which is out of scope for now. 


Similar to naive bayes algorithm, logistic regression can also take continuous and categorical variables  as input and outputs a probability for an input vector belonging to a particular class. 

So for understanding the logistic regression we first solve the problem by hand (i.e. solve it mathematically) and then write the Python implementation.

1. Why Logistic Regression
We start by looking at a very basic example of a tumor being malignant or not. The outcome is in a binary format i.e. 1 if the tumor is malignant and 0 if it is benign. Now if we fit a straight line using the linear regression and set a threshold at point M, we see that the results are not bad at all. In fact for this case it worked quite decently and allow us to determine whether the tumor is malignant or not based on tumor size. So, tumor size > M is a malignant tumor and vice versa. This is a very simple example just to let the readers know that linear regression is not something completely wrong in fitting a line to the categorical dependent variables.

 Now lets extend this example and add one more data point like the following and then fit a line on the data and set the threshold to M. The results are now quite bad because we can see from the graph that the below line wrongly classifies the tumor in some cases.

One more problem is usually faced while fitting a line to such data i.e. the output we observe in the training data is either 0 or 1 and we expect the output between 0 and 1 if we fit a straight line. But that is not the case and we can expect outputs greater than 1 for large tumor sizes which is not desired. For these reasons, we need a model that fits well on such kind of data.

If we look at the some mathematical functions we’ll realize that “sigmoid function” or “logistic function” below solves both of our problems i.e. it fits the data which output variable either 0 or 1 and results a probability value for each data point. Secondly, it is asymptotic at 0 and 1 which makes sure the resulting value will not exceed this limit.

In machine learning the method that fits the logistic function is called logistic regression. Fitting a logistic function have some other advantages as well like continuity and easy derivatives. So we see why we need logistic regression to solve a classification problem.

High level overview

After realizing the necessity of using the logistic function, we’ll now see a high level overview of the problem. The logistic function can be written as \[P = \frac{1}{1-e^{-X}}\] If we write it in terms of linear regression then it can be written as \[P = \frac{1}{1-e^{-\theta^TX}}\] Here $\theta^TX$ is the dot product of the parameter $(\theta)$ and input vector $X$. Input vectors will be given to us (both for training and testing data) to train the model and then later calculate the probability for a new input (test data) and $\theta$ vector will be estimated by using a training set of that particular data set. $\theta$ can be optimized using various optimization functions but generally we use gradient descent algorithm to do this. The complete process of optimizing the $\theta$ will be explained in the next section. So, we can see that how easy the method becomes once we optimize $\theta$. We just need to calculate the dot product between the new input vector and $\theta$, then plugin that value in the logistic function to calculate the probability of X. 

Cost function optimization for optimal parameters


An integral objective in learning algorithms is to optimize the parameters according to the dataset. For this reason, cost functions are used. If we remember from the linear regression the cost function is simply the sum of squared errors which needed to minimized. Following is the cost function for linear regression \[J(\theta) = \frac{1}{2m}\sum_{i-1}^{m}(h_\theta(x^i)-y^i)^2\]  From the function we can analyze that it is a simple quadratic function and can be minimized easily using gradient descent because it has the same local and global minimum. On the other hand, if we use the same cost function for logistic regression, we would end up with a weird non-convex function which is not guaranteed to reach global optimum using gradient descent and therefore, we’ll not be certain if our trained parameters are optimal or not. 

For this reason, the following cost function is used to minimize the parameters. The idea behind the following function is quite simple. If we plot the following cost function, we end up seeing the graphs being convex and can be cost function can be optimized by using gradient descent to guarantee global optimum. 

The above formula can be written in a simplified form \[J(\theta) = -ylog(h_\theta(x))-(1-y)log(1- h_\theta(x))\] and it can be used to calculate the total cost minimization of all data points in the training set of size m with the following mathematical form and then minimized $min_\theta J(\theta)$ to get the optimal parameter vector for $\theta$. \[J(\theta) = \frac{-1}{m}[\sum_{i=1}^{m}y^ilog(h_\theta (x^i)) - (1-y^i)log(1-h_\theta(x^i)) ]\]

A detailed implementation for logistic regression in Python

We start by loading the data from a csv file. The data is quite easy with a couple of independent variable so that we can better understand the example and then we can implement it with more complex datasets. Data consists of two types of grades i.e. grade1 and grade2 which results in the student of being admitted and not admitted i.e. admitted=1, non admitted=0.
We are using pandas library to handle data in dataframes which is much faster and efficient. First, we define a dataframe (df) from the csv file. Then we applied few preprocessing steps to clean the data and transform it. Data cleaning and transformation is done using pandas and scikit-learn which can be understood from here. Now we have independent variables separated as X and dependent varaibles separated as Y.


We then introduce the logistic (sigmoid) function along with the hypothesis of calculating $\theta^TX$ which is simply the linear combination of parameter vector $\theta$ and input vector $X$. 

After this, we will implement the cost function we discussed above in the detail. So in short the cost function will calculate $h_\theta(x^i)$ and returns the sum of errors. We sum the errors (cost) for each data point and put large penalty for the data point to estimate the wrong class i.e. in our case is Y=1 or 0. So if Y=1, the second part of the cost function will be penalized and gets equal to zero and vice versa.

Now in order to apply gradient descent algorithm we need to calculate the partial derivatives of the cost function for each $\theta$. For this reason we implement the following derivative cost function.

Next we create the implementation for gradient descent which will use the partial derivative function above and optimize it using fixed amount of iterations. The main idea behind the gradient descent can be understood from this animation.


After the implementation of gradient descent, we are all set to implement the main logistic regression method and optimize $\theta$ vector by applying gradient descent and minimize the cost function $J(\theta)$. For a fixed number of iterations we run gradient descent and check the cost function value.


Now here is the main method that will basically set the initial parameters and run the above method which will run their respective implemented methods. The complete implementation with the data to test the algorithm is available here on github.