What are Kernel functions ?
To understand kernel functions in a very general way, is that it converts a non-linearly separable space into linearly separable space. It also measures the similarity between two points so what is the difference between similarity functions and kernel functions ? Well, a straight forward answer is that other similarity functions determines similarity on the same space and dimension (say S and D) in which the data originally is but the kernel functions transforms the data into another space ${S}’$ with dimensions ${D}’$.
Also the notion of similarity here is task independent. So, for instance, if your task is object recognition, then a good kernel will assign a high score to a pair of images that contain the same objects, and a low score to a pair of images with different objects. Note that such a kernel captures much more abstract notion of similarity, compared to a similarity function that just compares two images pixel-by-pixel.
Mathematically, kernel functions are defined as $K(x, y) = <\phi(x), \phi(y)>$ . Here K is the kernel function, x, y are n dimensional inputs. \phi is a map from n-dimensions to m-dimensional space. $<x,y>$ denotes the dot product and usually m is much larger than n.
Types of kernel functions
Usually the following kernel functions are used for transforming the data such that it becomes linearly separable.
Linear: $K(x,y)=x^Ty$
Polynomial: $K(x,y)= (1+ x^Ty)^d$
Sigmoid: $K(x,y)= tanh(ax^Ty + b)$
Radial Basis Function (RBF): $K(x,y)= exp^{(-\gamma\left \| x-y \right \|)^2}$
Intuition behind kernel functions
A general opinion about using the kernels as compared to feature vector is that computing kernels is easy but on the other hand, computing the feature vector corresponding to the kernel is hard. Even the feature vector grows exponentially, but the kernel computing time grows linearly and even for kernels like RBF where the feature vector is infinite dimensional, the kernel computation is fairly easy.
All the machine learning algorithms that are dependent only on dot products can be replaced by kernels and then we do not need to use the feature vectors at all. By doing this we can work with very complex, high performance and computationally efficient kernels without working with high dimensional feature vector. It is called the kernel trick.
A simple example of the kernel function which can make a non-linearly separable space into a linearly separable space is follows. We are using this data set as our example.
We can clearly see that data is divided into two classes, one with the ^ marker and the other with + marker. The challenge is that they are not separable linearly. On the other hand, by looking at the data and revise some analytical geometry from our high school we observe that the data can be approximated with an ellipse. So now just using elliptic kernel i.e. ${S}' = x_1^2 + x_2^2$ will allow us to visualize the same data in the following way.
Post a Comment