The kernel trick — What is it and Why does it matter? + Short Crash Course on SVM

Published in

Artificial Intelligence in Plain English

5 min readFeb 20, 2021

If you have dabbled in machine learning, you might have come across the word ‘kernel’ being thrown around casually. In the sklearn library there are options to specify the type of kernel you want to use in some classifiers such as SVMs (support vector machines). So what exactly is a kernel and why does it matter?

A kernel in the context of machine learning usually refers to a function that can be plugged into a classifier’s decision function and that only accesses the training data via inner products in the Euclidean vector space (i.e. real numbers) this usually refers to the dot product. The reason that this is useful is that in many applications, we may want to transform the data points into a different space. For example in classifier problems we may transform the data points to make them more easily separable. The classic illustration of the kernel trick is its use in Support Vector Machines. In support vector machines, the classifier tries to find a plane or (in 2D, a line) that separates the dataset into 2.

Generated from sklearn classifier comparison example

As you can see there is no way to linearly separate the 2 data sets cleanly. However, visually we can see that the blue points are all in the centre surrounded by the red points. So in another space, in this case, if we transform it into radial coordinates, the data point can be easily separated.

So the reason that kernels are so useful is because we don’t have to first transform the data point into the radial coordinates and then apply the classifier on it. We can directly apply the classifier. To illustrate how this is done with SVM we will first go through a super brief introduction of the math behind SVM.

So SVM works by finding the plane that best separates the points from different classes. After we find this plane, we can classify an unclassified point x, and dot product it with w which is the vector orthogonal to the plane. Hence the decision rule is:

decision rule of SVM

We can understand b as being the length of w, and x.w is the projection of the vector x onto w (above figure) hence when it crosses a certain threshold, the classification will change.

So how do we find this plane that best separates the points? We assume that the points at the boundaries (we will call them x+ and x-) are at least 1 unit distance away from the plane. Hence

The width is just the boundary point x+ - x- projected on the unit vector orthogonal to the plane

So to minimise the width of the boundary region and also to satisfy the constraints we put on the training points y(w.x+b)>1, we define the Lagrangian (can just think of this as the objective function) as:

The Lagrangian

To obtain w that minimise the Lagrangian, we differentiate L with respect to w

We now obtain an expression for w. Recall that we defined a decision rule for the classifier earlier. We can also write that rule as a decision function and by plugging in the expression we just obtained for w, we get:

We can see that the decision function of the SVM only depends on the dot product of the input data. Hence we can replace the dot product (which is a type of kernel called the linear kernel) with a more general kernel function k(x,x’).

So now that we got the decision function for SVM, let’s see how to define a kernel and why they are useful. Recall that we had wanted to separate some data points but we couldn’t do it with a straight line in this coordinate system? We can do it by first transforming the data points into the radial coordinates. We can define a mapping Phi(x,y)=sqrt(x²+y²)=r. Then we will be able to find a b that cleanly separate the data points. In this case, the transformation is simple. However, what if the transformation involves transforming to a higher dimension (rather than lower in this case)?

Suppose we need to consider non-linear higher order effects such as xy, x², y² in addition to x, y and some constant. That would be 6 dimensions, but we can also represent it as Phi(x,y) — >(x²,y²,xy,x,y,1). Phi(x,y)=(x.y+c)². If we apply a classifier like SVM onto x², y², xy, x, y, 1, it would have to perform 6i operations to feature 1 x feature 1', feature 2 x feature 2',…

Without kernel function we would have to perform 6 x i operations

If we instead ‘kernelise’ the decision function, we would realise that we only need to apply 1 function to 2 features x and y, hence 2i operations.

This greatly speeds up computation when there is a large number of dimensions. Being able to plug in any kernel into the decision function is also extremely convenient and allow us to turn linear classifiers into non-linear classifiers without ever needing to evaluate the mapping. More formally

Some common kernel functions are:

Hopefully, the short introduction has made it clearer what a kernel is and why it is useful. It is a very simple concept — directly using the training data to compute the final form of a decision function as if it were transformed to a higher space without ever having to do the transformation — but extremely powerful. Classifiers that can be kernelised are computationally efficient and very convenient. Now we are all set to apply kernels liberally to separate all sorts of funny data in weird spaces!

The kernel trick — What is it and Why does it matter? + Short Crash Course on SVM

Written by Genevieve