Support Vector Machines - Understanding the theory and implementation in Python


Support Vector Machine (SVM) is a classification algorithm which separates the data generally into 2 classes depending on the problem definition. There are many classification algorithms including Naive Bayes, logistic regression, nueral nets etc but SVM is one of the sophisticated methods and a must have tool in a data scientist toolkit.
Today, we’ll first talk about the mathematical perspective of SVM to better understand what exactly is going on and then we’ll solve a classification problem with our conventional Pima Indians Diabetes dataset and determine our model accuracy.


1. General Overview
Lets consider that we have a space which consist of negative (-) and positive (+) examples. So, how can we separate + and – examples by just using a straight line (hyperplane)? For example, let us look at the following graph

From the above graph we can separate the space using various hyperplanes. But the question now is which hyperplane to choose for separating the classes in the data in an optimal way ?
Fortunately we have a very descent answer to this question which was given to us by the discoverer of this algorithm and the answer is that the hyperplane which separates both the classes keeping the farthest margin between the closest data point(s) that separates the two classes. Confusing ? Lets see that in the following graph to understand what I just said.
As discussed, the optimal hyperplane can be represented in an infinite number. Even if we determine the direction of the hyperplane then still we have infinite number of ways by which we can scale coefficients with different magnitudes. From many possibilities, lets assume that the optimal hyperplane is
\[\left | \beta_0 + \beta^T = 1 \right | --- (1)\]
Where x are the training examples and  $\beta^{T}$ is the weight vector and $\beta_{0}$ is the intercept (or bias). Also, the training examples that are closest to the hyperplane are called support vectors. So generally, we optimize the distance (margins) between the support vectors which is done using the basic distance formula from point to linearly \[distance = \frac{\left | \beta_0 + \beta^T \right |}{\left \| \beta \right \|} --- (2)\]
From eq(1) and eq(2) \[distance = \frac{1}{\left \| \beta \right \|} --- (3)\] Note that this distance is only from one point (one class point) to the hyperplane. In order to to calculate the total distance between closet points of the two classes we can multiply it by 2 and we get \[distance(M) = \frac{2}{\left \| \beta \right \|} --- (4)\]
Where M is the margin or perpendicular distance between the closest + and – examples. Now lets look at eq(4), minimizing $\left \| \beta \right \| $ will lead to the farthest margin and that is our objective too. So eq(4) can be written as follows as a nonlinear optimization problem
\[min_{\beta_0,\beta^T} L(B) = \frac{1}{2}\left \| \beta \right \|^2\] subject to $y_i (\beta_0+\beta^Tx) \geq 1 --- (5)$

Where $y_i$ are labels of training examples generally 0 or 1. This problem can be solved by using Lagrange multipliers which will give us optimal $\beta_0$ and $\beta^T$. 


 2. Advanced Mathematical Concepts
This section is optional for those who do not want to go in the mathematical details regarding SVMs. Because I am trying to explain the above concepts in more detail therefore I'll repeat few concepts explained above already so don't mind if you feel repetition. So lets start by assuming that we have the following scenario where $\bar{x}$ is an unknown vector and $\beta^T$ is the weight vector perpendicular to the median line such that
If $\beta^T.\bar{x} \geq \beta_0$, then the sample is + otherwise - or in other words, $\beta_0+\beta^Tx \geq 0$ for positive samples. Now in this case we do not know which $\beta^T$ and $\beta_0$ to choose. We have this information that $\beta^T$ has to be perpendicular to the median line but there are many $\beta^T$ which can be perpendicular to it because $\beta^T$ can be of any length.

We move forward by introducing the following few constraints in order to solve our problem

\[\beta^T.\bar{x}_+ + \beta_0 \geq 1 --- (6) \]
\[\beta^T.\bar{x}_- + \beta_0 \leq -1 --- (7) \]
Where $\bar{x}_+$ and $\bar{x}_-$ are for + and - samples. In other words, for an unknown $\bar{x}$, either eq(6) or eq(7) has to be satisfied for concluding that $\bar{x}$ is + or -. At this point we introduce a little trick of involving a variable $y$ such that $y_i = +1 $ for + samples and $y_i = -1 $ for - samples. Now eq(6) and (7) will become
\[\beta^T.\bar{x}_+ + \beta_0 \geq 1 \rightarrow y_i(\beta^T.\bar{x}_+ + \beta_0 \geq 1)\]\[\beta^T.\bar{x}_- + \beta_0 \leq -1 \rightarrow y_i(\beta^T.\bar{x}_- + \beta_0 \geq 1)\]
We can see that introducing this trick makes equations the same. So our final constraint will be \[y_i(\beta^T.\bar{x} + \beta_0) - 1 \geq 0 --- (8) \]. We can also derive from eq(8) that \[y_i(\beta^T.\bar{x} + \beta_0) - 1 = 0 --- (9)\] if $x_i$ lies on the separating hyperplanes (Gutter).
Here we redraw the initial samples graph
If $\bar{x}_+$ is the sample exactly on the right side of the street and $\bar{x}_-$ on the left side, we can determine the width of the street. \[Width = (\bar{x}_+ - \bar{x}_-).\frac{\beta^T}{\left \| \beta^T \right \|} \] \[Width = \bar{x}_+ .\frac{\beta^T}{\left \| \beta^T \right \|} - \bar{x}_- .\frac{\beta^T}{\left \| \beta^T \right \|} --- (10)\] Now from eq(9) , putting $y_i = 1$ for + samples and $y_i=-1$ for - samples in eq(10) will lead to \[Width = \frac{(1-\beta_0)-(1+\beta_0)}{\left \| \beta^T \right \|} = \frac{2}{\left \| \beta^T \right \|} --- (11)\] In other words eq(11) is the width of the street we plan to maximize. \[max \frac{1}{\left \| \beta^T \right \|} \rightarrow min \left \| \beta^T \right \|\rightarrow min \frac{1}{2} \left \| \beta^T \right \|^2 --- (12)\] and the reason for this conversion is to make (11) mathematically convenient to solve.
Now we can use (12) as the quadratic optimization function with (9) being the constraints. We know from calculus that Lagrange multipliers is the best way to solve such problem. Therefore \[L = \frac{1}{2}\left \| \beta^T \right \|^2 - \sum \alpha_i[y_i(\beta^Tx+\beta_0) -1] --- (13)\] Where $\aplha_i$ are the Lagrange multipliers. Now we can calculate the extremum by calculating partial derivatives and setting them to zero like follows \[\frac{\partial L}{\partial \beta^T} = \bar{\beta^T} - \sum \alpha_iy_ix_i = 0\] \[\bar{\beta^T} =\sum \alpha_iy_ix_i --- (14)\] Similarly, \[\frac{\partial L}{\partial \beta_0} = - \sum \alpha_iy_i= 0\] \[\sum \alpha_iy_i = 0 --- (15)\] Eq(14) gives us a very interesting result i.e. $\beta^T$ is actually the linear sum of samples in the training data set. We can put (14) and (15) in (13) which makes it \[L = \frac{1}{2} (\sum \alpha_iy_ix_i)(\sum \alpha_jy_jx_j) - (\sum \alpha_iy_ix_i)(\sum \alpha_jy_jx_j) - \sum \alpha_iy_i\beta_0 + \sum\alpha_i\] \[L = \sum\alpha_i - \frac{1}{2} (\sum \alpha_iy_ix_i)(\sum \alpha_jy_jx_j)\] \[L = \sum\alpha_i - \frac{1}{2} \sum \sum \alpha_i\alpha_jy_iy_j(x_i.x_j) --- (16)\] From (16) we can see that the dependence of the whole complex equation is only on the dot product of the two samples. We can now use (16) to solve our problem and determine the coefficients of the hyperplane.
Note: The above explanation is to obtain a hyperplane when the classes in the data can be linearly separable. But if they not linearly separable then we use kernel functions to convert the data into linearly separable. 


3. A detailed example of SVM with implementation in Python using Scikit-learn

We continue to use Pima Indians Diabetes Dataset for this problem as well so we can compare different model performances later if we want. The data and its meta information is available on the here. We start by loading the data and splitting into training and testing data. In this example we are using scikit-learn, therefore it requires data in a specific format. For example, explanatory variables must be provided in a 2D list format and the dependent (classification variable) must be defined in a 1D list with classes as either 1 or 0. Following code will do that implementations


def loadData(filename):
    lines = csv.reader(open(filename, "rt"))
    dataset = list(lines)
    for i in range(len(dataset)):
        dataset[i] = [float(x) for x in dataset[i]]
    return dataset

def train_test_data(dataset, splitRatio):
    trainData = int(len(dataset) * splitRatio)
    trainSet = []
    copy = list(dataset)
    while len(trainSet) < trainData:
        index = random.randrange(len(copy))
        trainSet.append(copy.pop(index))
    X_train = [trainSet[i][:-1] for i in range(len(trainSet)) ]
    y_train = [trainSet[i][-1] for i in range(len(trainSet)) ]
    X_test = [copy[i][:-1] for i in range(len(copy)) ]
    y_test = [copy[i][-1] for i in range(len(copy)) ]
    return X_train, y_train, X_test, y_test


Now after splitting the data set, we need to use two significant methods in scikit-learn i.e. fit() and predict(). I am using the built-in methods from the library for now but a very good implementation of these methods from the scratch is available here. For those who are interested in the nitty-gritty of SVM algorithm specially the quadratic optimization to obtain optimal weights, must go through it because it is written in a very simple way. But for now lets use the following code from scikit-learn to learn our model and get the results. 

from sklearn import svm
from sklearn.metrics import accuracy_score, confusion_matrix
def main():
   filename = 'dataset.csv'
   splitRatio = 0.8 # 80% training and 20% testing
   dataset = loadData(filename)
   X_train, y_train, X_test, y_test = train_test_data(dataset, splitRatio)
   model = svm.SVC(kernel='linear' , C=1, gamma=1)
   model.fit(X_train, y_train)
   predicted = model.predict(X_test)
   accuracy = accuracy_score(y_test, predicted)
   confusion = confusion_matrix(y_test, predicted)
   print(accuracy)
   print(confusion)

After running the model to learn the decision boundary using support vectors, we calculate the accuracy of our model by using the test data which we split in the beginning. The accuracy of our model after running several is almost 75% on average with a linear kernel. We can also try kernel='rbf' or kernel='poly' which are radial basis function kernel and polynomial function kernel. Note that the model accuracy will vary a bit each time because we take random numbers to split our data which will not be the same every time. Secondly, model accuracy will also be different for different kernels.

1 comments:

Post a Comment