Support Vector
Machine (SVM) is a
classification algorithm which separates the data generally into 2 classes
depending on the problem definition. There
are many classification algorithms including Naive Bayes, logistic
regression, nueral nets etc but SVM is one of the sophisticated
methods and a must have tool in a data scientist toolkit.
Today,
we’ll first talk about the mathematical perspective of SVM to
better understand what exactly is going on and then we’ll solve a
classification problem with our conventional Pima Indians Diabetes
dataset and determine our model accuracy.
1.
General Overview
Lets
consider that we have a space which consist of negative (-) and
positive (+) examples. So, how can we separate + and – examples by
just using a straight line (hyperplane)? For example, let us look at
the following graph
From
the above graph we can separate the space using various hyperplanes.
But the question now is which hyperplane to choose for separating the
classes in the data in an optimal way ?
Fortunately
we have a very descent answer to this question which was given to us
by the discoverer of this algorithm and the answer is that the
hyperplane
which separates both the classes keeping the farthest margin between
the closest data point(s)
that separates the two classes. Confusing
? Lets see that in the following graph to understand what I just
said.
As
discussed, the optimal hyperplane can be represented in an infinite
number. Even if we determine the direction of the hyperplane then
still we have infinite number of ways by which we can scale
coefficients with different magnitudes. From many possibilities, lets
assume that the optimal hyperplane is
\[\left | \beta_0 + \beta^T = 1 \right | --- (1)\]
Where
x are the training
examples and
$\beta^{T}$ is the weight
vector and $\beta_{0}$ is the intercept (or bias).
Also, the training examples that are closest to the hyperplane are
called support vectors. So generally, we optimize the distance
(margins) between the support vectors which is done using the basic
distance formula from point to linearly
\[distance = \frac{\left | \beta_0 + \beta^T \right |}{\left \| \beta \right \|} --- (2)\]
Where $y_i$ are labels of training examples generally 0 or 1. This
problem can be solved by using Lagrange multipliers which will give
us optimal $\beta_0$ and $\beta^T$.
From
eq(1) and eq(2)
\[distance = \frac{1}{\left \| \beta \right \|} --- (3)\]
Note
that this distance is only from one point (one class point) to the
hyperplane. In order to to calculate the total distance between
closet points of the two classes we can multiply it by 2 and we get \[distance(M) = \frac{2}{\left \| \beta \right \|} --- (4)\]
Where
M is the margin or
perpendicular distance between the closest + and – examples. Now
lets look at eq(4), minimizing $\left \| \beta \right \| $ will lead to the farthest margin and
that is our objective too. So eq(4) can be written as follows as a
nonlinear optimization problem
\[min_{\beta_0,\beta^T} L(B) = \frac{1}{2}\left \| \beta \right \|^2\]
subject to $y_i (\beta_0+\beta^Tx) \geq 1 --- (5)$
2. Advanced Mathematical Concepts
This section is optional for those who do not want to go in the
mathematical details regarding SVMs. Because I am trying to explain the above concepts in more detail therefore I'll repeat few concepts explained above already so don't mind if you feel repetition. So lets start by assuming that we have the following scenario where $\bar{x}$ is an unknown vector and $\beta^T$ is the weight vector perpendicular to the median line such that
If $\beta^T.\bar{x} \geq \beta_0$, then the sample is + otherwise - or in other words, $\beta_0+\beta^Tx \geq 0$ for positive samples. Now in this case we do not know which $\beta^T$ and $\beta_0$ to choose. We have this information that $\beta^T$ has to be perpendicular to the median line but there are many $\beta^T$ which can be perpendicular to it because $\beta^T$ can be of any length.
We move forward by introducing the following few constraints in order to solve our problem
\[\beta^T.\bar{x}_+ + \beta_0 \geq 1 --- (6) \]
\[\beta^T.\bar{x}_- + \beta_0 \leq -1 --- (7) \]
If $\beta^T.\bar{x} \geq \beta_0$, then the sample is + otherwise - or in other words, $\beta_0+\beta^Tx \geq 0$ for positive samples. Now in this case we do not know which $\beta^T$ and $\beta_0$ to choose. We have this information that $\beta^T$ has to be perpendicular to the median line but there are many $\beta^T$ which can be perpendicular to it because $\beta^T$ can be of any length.
We move forward by introducing the following few constraints in order to solve our problem
\[\beta^T.\bar{x}_+ + \beta_0 \geq 1 --- (6) \]
\[\beta^T.\bar{x}_- + \beta_0 \leq -1 --- (7) \]
\[\beta^T.\bar{x}_+ + \beta_0 \geq 1 \rightarrow y_i(\beta^T.\bar{x}_+ + \beta_0 \geq 1)\]\[\beta^T.\bar{x}_- + \beta_0 \leq -1 \rightarrow y_i(\beta^T.\bar{x}_- + \beta_0 \geq 1)\]
We can see that introducing this trick makes equations the same. So our final constraint will be \[y_i(\beta^T.\bar{x} + \beta_0) - 1 \geq 0 --- (8) \]. We can also derive from eq(8) that \[y_i(\beta^T.\bar{x} + \beta_0) - 1 = 0 --- (9)\] if $x_i$ lies on the separating hyperplanes (Gutter).
Here we redraw the initial samples graph
If $\bar{x}_+$ is the sample exactly on the right side of the street and $\bar{x}_-$ on the left side, we can determine the width of the street. \[Width = (\bar{x}_+ - \bar{x}_-).\frac{\beta^T}{\left \| \beta^T \right \|} \] \[Width = \bar{x}_+ .\frac{\beta^T}{\left \| \beta^T \right \|} - \bar{x}_- .\frac{\beta^T}{\left \| \beta^T \right \|} --- (10)\] Now from eq(9) , putting $y_i = 1$ for + samples and $y_i=-1$ for - samples in eq(10) will lead to \[Width = \frac{(1-\beta_0)-(1+\beta_0)}{\left \| \beta^T \right \|} = \frac{2}{\left \| \beta^T \right \|} --- (11)\] In other words eq(11) is the width of the street we plan to maximize. \[max \frac{1}{\left \| \beta^T \right \|} \rightarrow min \left \| \beta^T \right \|\rightarrow min \frac{1}{2} \left \| \beta^T \right \|^2 --- (12)\] and the reason for this conversion is to make (11) mathematically convenient to solve.
Now we can use (12) as the quadratic optimization function with (9) being the constraints. We know from calculus that Lagrange multipliers is the best way to solve such problem. Therefore \[L = \frac{1}{2}\left \| \beta^T \right \|^2 - \sum \alpha_i[y_i(\beta^Tx+\beta_0) -1] --- (13)\] Where $\aplha_i$ are the Lagrange multipliers. Now we can calculate the extremum by calculating partial derivatives and setting them to zero like follows \[\frac{\partial L}{\partial \beta^T} = \bar{\beta^T} - \sum \alpha_iy_ix_i = 0\] \[\bar{\beta^T} =\sum \alpha_iy_ix_i --- (14)\] Similarly, \[\frac{\partial L}{\partial \beta_0} = - \sum \alpha_iy_i= 0\] \[\sum \alpha_iy_i = 0 --- (15)\] Eq(14) gives us a very interesting result i.e. $\beta^T$ is actually the linear sum of samples in the training data set. We can put (14) and (15) in (13) which makes it \[L = \frac{1}{2} (\sum \alpha_iy_ix_i)(\sum \alpha_jy_jx_j) - (\sum \alpha_iy_ix_i)(\sum \alpha_jy_jx_j) - \sum \alpha_iy_i\beta_0 + \sum\alpha_i\] \[L = \sum\alpha_i - \frac{1}{2} (\sum \alpha_iy_ix_i)(\sum \alpha_jy_jx_j)\] \[L = \sum\alpha_i - \frac{1}{2} \sum \sum \alpha_i\alpha_jy_iy_j(x_i.x_j) --- (16)\] From (16) we can see that the dependence of the whole complex equation is only on the dot product of the two samples. We can now use (16) to solve our problem and determine the coefficients of the hyperplane.
Note: The above explanation is to obtain a hyperplane when the classes in the data can be linearly separable. But if they not linearly separable then we use kernel functions to convert the data into linearly separable.
3. A
detailed example of SVM with implementation in Python using
Scikit-learn
We
continue to use Pima Indians Diabetes Dataset for this problem as
well so we can compare different model performances later if we want.
The data and its meta information is available on the here.
We start by loading the data
and splitting into training and testing data. In this example we are
using scikit-learn, therefore it requires data in a specific format.
For example, explanatory variables must be provided in a 2D list
format and the dependent (classification variable) must be defined in
a 1D list with classes as either 1 or 0. Following code will do that
implementations
Now after splitting the data set, we need to use two significant methods in scikit-learn i.e. fit() and predict(). I am using the built-in methods from the library for now but a very good implementation of these methods from the scratch is available here. For those who are interested in the nitty-gritty of SVM algorithm specially the quadratic optimization to obtain optimal weights, must go through it because it is written in a very simple way. But for now lets use the following code from scikit-learn to learn our model and get the results.
After running the model to learn the decision boundary using support vectors, we calculate the accuracy of our model by using the test data which we split in the beginning. The accuracy of our model after running several is almost 75% on average with a linear kernel. We can also try kernel='rbf' or kernel='poly' which are radial basis function kernel and polynomial function kernel. Note that the model accuracy will vary a bit each time because we take random numbers to split our data which will not be the same every time. Secondly, model accuracy will also be different for different kernels.
def loadData(filename): lines = csv.reader(open(filename, "rt")) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [float(x) for x in dataset[i]] return dataset def train_test_data(dataset, splitRatio): trainData = int(len(dataset) * splitRatio) trainSet = [] copy = list(dataset) while len(trainSet) < trainData: index = random.randrange(len(copy)) trainSet.append(copy.pop(index)) X_train = [trainSet[i][:-1] for i in range(len(trainSet)) ] y_train = [trainSet[i][-1] for i in range(len(trainSet)) ] X_test = [copy[i][:-1] for i in range(len(copy)) ] y_test = [copy[i][-1] for i in range(len(copy)) ] return X_train, y_train, X_test, y_test
Now after splitting the data set, we need to use two significant methods in scikit-learn i.e. fit() and predict(). I am using the built-in methods from the library for now but a very good implementation of these methods from the scratch is available here. For those who are interested in the nitty-gritty of SVM algorithm specially the quadratic optimization to obtain optimal weights, must go through it because it is written in a very simple way. But for now lets use the following code from scikit-learn to learn our model and get the results.
from sklearn import svm from sklearn.metrics import accuracy_score, confusion_matrix def main(): filename = 'dataset.csv' splitRatio = 0.8 # 80% training and 20% testing dataset = loadData(filename) X_train, y_train, X_test, y_test = train_test_data(dataset, splitRatio) model = svm.SVC(kernel='linear' , C=1, gamma=1) model.fit(X_train, y_train) predicted = model.predict(X_test) accuracy = accuracy_score(y_test, predicted) confusion = confusion_matrix(y_test, predicted) print(accuracy) print(confusion)
After running the model to learn the decision boundary using support vectors, we calculate the accuracy of our model by using the test data which we split in the beginning. The accuracy of our model after running several is almost 75% on average with a linear kernel. We can also try kernel='rbf' or kernel='poly' which are radial basis function kernel and polynomial function kernel. Note that the model accuracy will vary a bit each time because we take random numbers to split our data which will not be the same every time. Secondly, model accuracy will also be different for different kernels.