\(\newcommand{\t}[1]{ {#1}^T }\) \(\newcommand{\inv}[1]{ {#1}^{-1} }\) \(\newcommand{\vec}[1]{\boldsymbol{#1}}\) \(\newcommand{\diff}[1]{\mathop{}\!\mathrm{\partial}#1}\) \(\newcommand{\fun}[2]{#1\left(#2\right)}\) \(\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}\) \(\newcommand{\abs}[1]{\left| #1 \right|}\) \(\newcommand{\name}[1]{\mathrm{#1}}\) \(\newcommand{\new}{\name{new}}\) \(\DeclareMathOperator*{\argmax}{arg\,max}\) \(\DeclareMathOperator*{\argmin}{arg\,min}\) \(\DeclareMathOperator*{\diag}{diag}\)

Introduction

Similar to most of my previous posts, this article serves as a personal memoir to the principles and derivation of the Support Vector Machine.

Representation of Hyperplanes

In this section, I will discuss the mathematical representation of hyperplanes, the understanding of whom is pivotal to the formulation of the optimisation problem for deriving SVM later. In this discussion, to make things easier to understand, I chose to illustrate the discussions in a two-dimensional space where the “hyperplanes” are just straight lines, but the derivation presented here can be applied to hyperplanes defined in higher dimensions without loss of generality.

Traditionally, hyperplanes, particularly in 2D spaces, are often specified mathematically using the “slope-intercept form”:

\[ y = \beta x + \alpha \] where \(\beta\) specifies how much \(y\) changes per one unit change in \(x\), and \(\alpha\) represents the value \(y\) takes when \(x\) is zero. This representation, while intuitive, has two problems: first, it cannot represent “vertical” lines in which case the slope become infinite; second, it treats variables asymmetrically, i.e. \(y\) in this case is a special variable defined with respect to all the other variables — asymmetries in mathematics are generally much harder to deal with than symmetries.

The diagram below illustrates an alternative and arguably more elegant way of specifying a hyperplane. It is based on the simple observation that we can describe a hyperplane by first specifying its facing, which results in a hyperplane that passes the origin of the space it is in, and then moving the plane along the direction it is facing by a certain bias so that it lands on the desired position.

Using this observation, we can specify a hyperplane using two parameters:

The norm vector \(\vec{w}\), which specifies the facing of the hyperplane, and
The bias scalar \(b\), which specifies how much we should move the hyperplane to get it to where we want it to be; note that \(b\) is different from \(d\) because \(b\) is measured in terms of the length of \(\vec{w}\) while \(d\) is measured in absolute terms. We will come back to this later.

Under this formulation, a hyperplane can be represented using the following equation:

\[\t{\vec{w}} \vec{x} = b\]

The diagram below outlines the key elements in the specification of a hyperplane. The vector \(\vec{w}\) that specifies the facing of the hyperplane is also called the norm vector, and it is always perpendicular to the hyperplane.

An important quantity in the above diagram that will become relevant later is the distance from the hyperplane to the origin, i.e. \(\norm{\name{OB}}\). As mentioned earlier, \(\norm{\name{OB}}\) is proportional to the bias \(b\). To establish the connection between the \(\norm{\name{OB}}\) and \(b\), we first observe that for any point \(\name{A}\) on the hyperplane \(\t{\vec{w}} \vec{x} = b\), the projection of \(\vec{x}\), the vector that connects the origin and the point \(\name{A}\), onto the norm vector \(\vec{w}\) is a constant, i.e. \(\norm{\name{OB}}\) in the above diagram. Using the geometric interpretation of dot products:

\[ \begin{align*} \t{\vec{w}} \vec{x} &= \norm{\vec{w}} \norm{\vec{x}} \cos{\theta} \\ b &= \norm{\vec{w}} \norm{\name{OB}} \\ \frac{b}{\norm{\vec{w}}} &= \norm{\name{OB}} \end{align*} \]

Note that the quantity \(\frac{b}{\norm{\vec{w}}}\) can be negative — this happens when the hyperplane was moved to the opposite direction of the norm vector \(\vec{w}\). Also, when the norm vector is an unit vector, i.e. \(\norm{\vec{w}} = 1\), then \(b\) is just the geometric distance from the origin to the hyperplane.

An important characteristic of this representation of hyperplane is that one can find an infinite number of \(\vec{w}\) and \(b\) pairs that represent the same hyperplane just by scaling them with a coefficient; for example, all of the following equations describe the same hyperplane: \[ \begin{array}{rlr} \t{\vec{w}} &=& b \\ -\t{\vec{w}} &=& -b \\ 3\t{\vec{w}} &=& 3b \end{array} \] It is therefore often desirable to impose some kind of constraint on the length of the norm vector when dealing with hyperplanes; a common example would be to force the length of the norm vector to be \(1\), i.e. \(\norm{\vec{w}} = 1\), or more conveniently, \(\t{\vec{w}}\vec{w} = \vec{w}^2 = 1\). This observation will become relevant later when formulating the optimisation problem for SVM.

Distance from a Hyperplane to a Point

Another quantity that is worthy of some attention is the distance between a point and a hyperplane. The following diagram illustrate the relevant concepts:

Let \(\name{C}\) be a arbitrary point in the space of interest. In the above diagram, we are interested in the distance \(\norm{\name{BA}}\), which is the distance from the hyperplane \(\t{\vec{w}} \vec{x} = b\) to the point \(\name{C}\) (again, the quantity can be positive or negative). Observe that \(\norm{\name{BA}} = \norm{\name{OA}} - \norm{\name{OB}}\); we already know that \(\norm{\name{OB}} = \frac{b}{\norm{\vec{w}}}\); the distance from \(\name{O}\) to \(\name{A}\) can be easily derived by noting \(\norm{\name{OA}}\) is just the projection of \(\name{OC}\) on the norm vector \(\vec{w}\). Let \(\vec{x}_C\) denote the vector pointing from \(\name{O}\) to \(\name{C}\), then \(\norm{\name{OA}}\), the distance from the hyperplane and the point \(\name{C}\), is just:

\[ \norm{\name{OA}} = \norm{\name{OC}} \cos{\theta} = \frac{\t{\vec{w}}\vec{x}_C}{\norm{\vec{w}}} \]

and we can derive the distance from a hyperplane to a point (i.e. \(\norm{\name{BA}}\) in this case):

\[ \norm{\name{BA}} = \frac{\t{\vec{w}}\vec{x}_C}{\norm{\vec{w}}} - \frac{b}{\norm{\vec{w}}} = \frac{\t{\vec{w}}\vec{x}_C - b}{\norm{\vec{w}}} \]

Now that we have dealt with the basics, it is time to move on to discuss the specification of the optimisation problem for deriving SVMs.

Specification of the Support Vector Machine

A support vector machine (SVMs) is a supervised binary classifier. The setting is that we have a dataset with positive and negative samples, and we want to learn a boundary that best separates the positive and the negative samples. The simplest situation is illustrated by the diagram below:

The white dots represent positive samples and the black dots represent negative samples. Of course, the positive-negative designation is completely symmetric — we can swap the labels of the samples without affecting the formulation of and the solution to the problem.

We will first discuss the linear-separable case where the positive and negative samples can be perfectly separated by a hyperplane. There can be many ways to separate the positive samples from the negative samples using a hyperplane — each of the lines in the above graph represents one such choice. Note that fitting a logistic regression model (or a single-layer neural network, a.k.a. perceptron) on the dataset corresponds to finding one such hyperplane; what distinguishes SVM from the other methods is that SVM does not find just any hyperplane that separates the two classes of samples, it finds the optimal hyperplane that separates the positive and negative samples.

But what does mean when we say a hyperplane is the optimal hyperplane that separates the positive and the negative samples? Consider the solid green and blue lines in the following graph as two candidate hyperplanes for the task. One can argue that the green line represents a better hyperplane for separating the two sets of points than the blue line because it leaves more margin between it and the point nearest to it. The margin between a hyperplane and its nearest neighbouring sample can be thought to represent the “confidence” of the classification — a larger margins leaves more rooms for error, so the larger the margin, the more confident the classification.

By observing the above diagram, we can immediately note an “intuition”: if a hyperplane’s margin touches only one sample point, then the hyperplane cannot be optimal because we can increase the margin by pushing the hyperplane towards the samples of the other class until it touches another sample, as illustrated below:

which automatically implies that an optimal hyperplane will have equal margins to the nearest data points in both classes. The situation is illustrated by the following diagram:

In this case, we introduce a positive quantity \(\delta\) and express the hyperplanes that represent the two margin boundaries by adding/subtracting \(\delta\) to/from the bias term \(b\) of the equation that gives the optimal hyperplane. In this way, we see that \(M\), the actual margin for each side of the optimal hyperplane, is \[ M = \frac{b + \delta}{\norm{\vec{w}}} - \frac{b}{\norm{\vec{w}}} = \frac{b}{\norm{\vec{w}}} - \frac{b - \delta}{\norm{\vec{w}}} = \frac{\delta}{{\norm{\vec{w}}}}\] Immediately, we have an optimisation objective: \[ \begin{array}{rll} \text{maximise:} & \qquad \frac{\delta}{\norm{\vec{w}}} & \text{w.r.t. } \vec{w} \text{ and } b \end{array} \] We also want to specify the constraint that “all points must classified correctly”, which is to say that the absolute distance from any data point to the optimal separation hyperplane must be greater than or equal to the length of the margin \(M\), i.e. \(\frac{\delta}{\norm{\vec{w}}}\). To specify these constraints, we need a way to express the absolute distance from a data point to the optimal separation hyperplane. From the earlier discussions, we already know that the absolute distance from a hyperplane \(\t{\vec{w}} \vec{x} = b\) to a point \(\vec{x}_i\) is: \[ \abs{\frac{\t{\vec{w}} \vec{x}_i - b}{\norm{\vec{w}}}}\] Constraints involving absolute value functions, however, are unpleasant to deal with. We can make our life a bit easier by exploiting the following fact which is always true for any point that is classified correctly: \[ \abs{\frac{\t{\vec{w}} \vec{x}_i - b}{\norm{\vec{w}}}} = y_i \left( \frac{\t{\vec{w}} \vec{x}_i - b}{\norm{\vec{w}}} \right)\] where \(y_i\) encodes the label or class associated with \(\vec{x}_i\): \(y_i = +1\) when \(\vec{x}_i\) is a positive sample and \(y_i = -1\) when \(\vec{x}_i\) is a negative sample¹. Now we can specify the constraints: \[ y_i \left( \frac{\t{\vec{w}} \vec{x}_i - b}{\norm{\vec{w}}} \right) \ge \frac{\delta}{\norm{\vec{w}}},\ \forall\ i \]

Note that the left hand side will be negative if a data point is classified wrongly by the separation hyperplane, thus violating the constraint because the right hand side is strictly positive. Cancelling out the \(\norm{\vec{w}}\) from each side, we end up with our full optimisation objective: \[ \begin{array}{rll} \text{maximise: } & \qquad \frac{\delta}{\norm{\vec{w}}} & \text{w.r.t. } \vec{w} \text{ and } b \\ \text{subject to: } & \qquad y_{i} \left( \t{\vec{w}} \vec{x}_i - b \right) \ge \delta, & \forall\ i \end{array} \]

Solving the optimisation problem with Lagrange multipliers and the KKT conditions

This optimisation problem of SVM is traditionally approached as an minimisation problem, simply because most software packages that perform numeric optimisation assume minimisation rather than maximisation. To convert the above maximisation into minimisation, we can state its equivalent like this: \[ \begin{align*} \text{minimise: } & \qquad \frac{1}{2\delta^2} {\vec{w}^2} & \text{w.r.t. } \vec{w} \text{ and } b \\ \text{subject to: } & \qquad y_{i} \left( \t{\vec{w}} \vec{x}_i - b \right) \ge \delta, & \forall\ i \end{align*} \]

which is to take the inverse of the original objective \(\frac{\delta}{\norm{\vec{w}}}\), square it (doesn’t affect the result since \(\frac{\norm{\vec{w}}}{\delta} \gt 0\)), and multiply a \(\frac{1}{2}\) in its front, which is just a algebraic convenience.

This is a quadratic optimisation problem with inequality linear constraints, and is solved using Lagrange multipliers and the KKT (Karush–Kuhn–Tucker) conditions. The KKT conditions are a set of necessary conditions for a set of parameters \(\vec{x}\) to be a local optimum of a function \(\fun{f}{\vec{x}}\) under both equality and inequality constraints. More formally, given the following minimisation problem: \[ \begin{array}{rll} \text{minimise:} & \qquad \fun{f}{\vec{x}} & \vec{x} \in \mathbb{R}^n\\ \text{subject to:} & \qquad \fun{g_j}{\vec{x}} \le 0, & \forall\ j=1, \dots, p \\ & \qquad \fun{h_i}{\vec{x}} = 0, & \forall\ i=1, \dots, q \end{array} \] Let the Lagrangian function of the above optimisation, \(\fun{\mathcal{L}}{\vec{x}, \vec{\lambda}, \vec{\mu}}\), be: \[ \fun{\mathcal{L}}{\vec{x}, \vec{\lambda}, \vec{\mu}} = \fun{f}{x} + \sum_{i = 1}^{p} \lambda_i \fun{g_i}{\vec{x}} + \sum_{j = 1}^{q} \mu_j \fun{h_j}{\vec{x}}\]

Then for any \(\vec{x}^{\ast}\) that is a local minimum for \(\fun{f}{\vec{x}}\), the following must be true²: \[ \begin{array}{rl} \frac{\diff{}}{\diff{\vec{x}^{\ast}}}\fun{\mathcal{L}}{\vec{x}, \vec{\lambda}, \vec{\mu}} &= \vec{0} & \\ \lambda_i &\ge 0, & \forall\ i = 1, \dots, p \\ \lambda_i \fun{g_i}{\vec{x}^{\ast}} &= 0, & \forall\ i = 1, \dots, p \\ \fun{g_i}{\vec{x}^{\ast}} &\le 0, & \forall\ i = 1, \dots, p \\ \fun{h_i}{\vec{x}^{\ast}} &= 0, & \forall\ i = 1, \dots, q \\ \end{array} \]

Note that the condition \(\lambda_i \fun{g_i}{\vec{x}^{\ast}} = 0\) implies that either \(\lambda_i = 0\) or \(\fun{g_i}{\vec{x}^{\ast}} = 0\). What this condition is saying is that when \(\lambda_i \gt 0\), it must be the case that \(\fun{g_i}{\vec{x}^{\ast}} = 0\), which is to say the \(k\)-th inequality constraint has reduced to an equality constraint. A key insight with the KKT conditions here that will be pivotal to the interpretation of SVM later is that an inequality constraint is only “in-use” or “active” if it reduces to an equality constraint — that is, only when the Lagrange multiply for an inequality constraint is greater than zero will the constraint make a difference to the locating of \(\vec{x}^{\ast}\); otherwise, the constraint can be ignored without affecting the solution to the optimisation problem. For a more detailed exposition of the KKT conditions, see this post.

In our case, we note that we only have inequality constraints: \(\fun{g_i}{\vec{x}_i} = - \left( \left( \t{\vec{w}} \vec{x}_i - b \right) - \delta \right),\ \forall\ i \in 1, 2, \dots, N\). The Lagrangian \(\fun{\mathcal{L}}{\vec{x}, \vec{\lambda}}\) for our minimisation problem is a function of three variables, i.e. \(\vec{w}\), \(b\), and \(\lambda\): \[ \begin{align} \fun{\mathcal{L}}{\vec{w}, b, \vec{\lambda}} &= \frac{1}{2\delta^2} {\vec{w}^2} - \sum_{i = 1}^{N} \lambda_i \left( y_i \left( \t{\vec{w}} \vec{x}_i - b \right) - \delta \right) \\ &= \frac{1}{2\delta^2} {\vec{w}^2} - \sum_{i = 1}^{N} \lambda_i y_i \t{\vec{w}} \vec{x}_i + b \sum_{i = 1}^{N} \lambda_i y_i + \delta \sum_{i = 1}^{N} \lambda_i \label{eq:primal-problem} \end{align} \] Take the derivatives for both \(\vec{w}\) and \(b\): \[ \begin{align} \frac{\diff{\mathcal{L}}}{\diff{\vec{w}}} &= \frac{\vec{w}}{\delta^2} - \sum_{i = 1}^{N} \lambda_i y_i \vec{x}_i \\ \frac{\diff{\mathcal{L}}}{\diff{b}} &= \sum_{i = 1}^{N} \lambda_i y_i \end{align} \]

Set the above derivatives both to zero and we have: \[ \begin{align} \vec{w} &= \delta^2 \sum_{i = 1}^{N} \lambda_i y_i \vec{x}_i \label{eq:equation-for-w} \\ 0 &= \sum_{i = 1}^{N} \lambda_i y_i \label{eq:equation-for-b} \end{align} \]

Before going any further, it is worthwhile to contemplate Equation \(\ref{eq:equation-for-w}\) for a moment. What it says is that the norm vector of the optimal separation hyperplane is a linear combination of the inputs data:

\[ \vec{w} = \delta^2 \begin{bmatrix} \vec{x}_1 & \vec{x}_2 & \cdots & \vec{x}_N \\ \end{bmatrix} \begin{bmatrix} y_1 & & & \\ & y_2 & & \\ & & \ddots & \\ & & & y_N \end{bmatrix} \begin{bmatrix} \lambda_1 \\ \lambda_2 \\ \vdots \\ \lambda_N \\ \end{bmatrix} \] where \(\vec{x}_i\) is the column vector that represent the \(i\)-th data point in the sample and \(y_i\) is its corresponding label. The coefficients of the the linear combination are the Lagrangian multipliers \(\lambda_i\)s. Another important implication of Equation \(\eqref{eq:equation-for-w}\) is that \(\delta\) only appears as a scaling factor for \(\vec{w}\) — that is, it only affects the length of the norm vector for the optimal separation hyperplane. As mentioned earlier in Section Representation of Hyperplanes, it is necessary to constrain the length of the norm vector \(\vec{w}\) so that we find exactly one optimal separation hyperplane instead of a family of such hyperplanes.

For our case, we can “fix” the length of the norm vector by selecting a specific value for \(\delta\); traditionally, literature on the derivation of SVM use \(\delta = 1\). Once \(\delta\) is set to \(1\), the length of the norm vector \(\vec{w}\) that we will be finding for the optimal separation hyperplane will be such that the geometric distance from the hyperplane to its nearest sample point becomes one unit of \(\norm{\vec{w}}\). Note that this choice is completely arbitrary — we could have set \(\delta\) to any constant, but setting it to \(1\) would be the most convenient.

After setting \(\delta\) to \(1\), let us plug the results from Equation \(\eqref{eq:equation-for-w}\) and \(\eqref{eq:equation-for-b}\) back to our Lagrangian, using the fact that \(\vec{w}^2 = \t{\vec{w}}\vec{w}\) and also being careful with the subscripts within the summations:

\[ \begin{align} \fun{\mathcal{L}}{\vec{\lambda}} &= \frac{1}{2} {\left( \sum_{j = 1}^{N} \lambda_j y_j \t{\vec{x}_j} \right)} {\left( \sum_{i = 1}^{N} \lambda_i y_i \vec{x}_i \right)} - \sum_{i = 1}^{N} {\lambda_i y_i \left( \sum_{j = 1}^{N} \lambda_j y_i \t{\vec{x}_j } \right) } \vec{x}_i + 0 + \sum_{i = 1}^{N} \lambda_i \\ &= \frac{1}{2} \left( \sum_{j = 1}^{N} \sum_{i = 1}^{N} \lambda_j \lambda_i y_j y_i \t{\vec{x}_j} \vec{x}_i \right) - \left( \sum_{j = 1}^{N} \sum_{i = 1}^{N} \lambda_j \lambda_i y_j y_i \t{\vec{x}_j} \vec{x}_i \right) + \sum_{i = 1}^{N} \lambda_i\\ &= - \frac{1}{2} \left( \sum_{j = 1}^{N} \sum_{i = 1}^{N} \lambda_j \lambda_i y_j y_i \t{\vec{x}_j} \vec{x}_i \right) + \sum_{i = 1}^{N} \lambda_i \end{align} \] Note that we have eliminated \(\vec{w}\) and \(b\) from the Lagrangian function — the Lagrangian function \(\fun{\mathcal{L}}{\vec{\lambda}}\) is now a function of only \(\vec{\lambda}\).

Using the rest of the constraints in the KKT conditions, we can transform our original optimisation problem into the following form: \[ \begin{array}{rll} \text{maximise:} & \qquad - \frac{1}{2} \left( \sum_{j = 1}^{N} \sum_{i = 1}^{N} \lambda_j \lambda_i y_j y_i \t{\vec{x}_j} \vec{x}_i \right) + \sum_{i = 1}^{N} \lambda_i & \text{w.r.t. } \lambda_i,\ \forall\ i = 1, \dots, N \\ \text{subject to:} & \qquad \lambda_i \ge 0, & \forall\ i=1, \dots, N \\ & \qquad \sum_{i = 1}^{N} \lambda_i y_i = 0, & \forall\ i=1, \dots, N \end{array} \] This optimisation problem is also called the dual problem of the original primal problem, i.e. the optimisation of \(\eqref{eq:primal-problem}\). Before moving forward, it would be useful to rewrite the maximisation objective as a minimisation objective in matrix form: \[ \begin{array}{rll} \text{minimise:} & \qquad \frac{1}{2} \t{\vec{\lambda}} \t{\fun{\diag}{\vec{y}}} \t{\vec{X}} \vec{X} \fun{\diag}{\vec{y}} \vec{\lambda} + \t{- \vec{1}} \vec{\lambda} & \text{w.r.t. } \lambda_i,\ \forall\ i = 1, \dots, N \\ \text{subject to:} & \qquad \lambda_i \ge 0, & \forall\ i=1, \dots, N \\ & \qquad \t{\vec{y}} \vec{\lambda} = 0, & \forall\ i=1, \dots, N \end{array} \]

The optimisation is now represented in a format that is ready to be fed into software packages that solve quadratic programming problems. In this case, I will be solving the quadratic programming problem using the LowRankQP package in R.

Implement a Hard-Margin SVM from Scratch

Prepare Testing Data

pacman::p_load(tidyverse)
pacman::p_load(LowRankQP)
set.seed(42)

Let us first generate some linearly separable data and visualise them:

N <- 20
training.data <- tibble(y = c(rep(1, N / 2), rep(-1, N / 2))) %>% 
          mutate(x1 = runif(n(), min = y - 0.5, max = y + 0.5), 
                 x2 = runif(n(), min = y - 0.5, max = y + 0.5))

training.data %>% mutate(id = 1:N, y = as.character(y)) %>% 
  ggplot() + geom_label(aes(x = x1, y = x2, label = id, colour = y))

Solve for the Lagrange multipliers, i.e. \(\vec{\lambda}\)

I will be using LowRankQP::LowRankQP function from the LowRankQP package; the following screenshot shows its usage:

Comparing the specification in the manual page with the optimisation problem we have above, we can see the following correspondence:

Vmat is \(\fun{\diag}{\vec{y}} \vec{X}\);
dvec is \(\vec{-1}\);
Amat is \(\t{\vec{y}}\);
bvec is \(\vec{0}\);
uvec should be \(\vec{+\infty}\).

We now prepare the data as follows, replacing the \(\vec{+\infty}\) with a reasonable large positive number:

X <- training.data[, 2:3] %>% as.matrix
Y <- training.data[, 1] %>% as.matrix

Vmat <- diag(c(Y)) %*% X
dvec <- rep(-1, N)
Amat <- t(Y)
bvec <- rep(0, 1)
uvec <- rep(500, N)

And call the method to find the solution to the \(\vec{\lambda}\):

solutions <- LowRankQP(Vmat, dvec, Amat, bvec, uvec)

## LowRankQP CONVERGED IN 16 ITERATIONS
## 
##     Primal Feasibility    =   4.3246623e-15
##     Dual Feasibility      =   4.2669022e-17
##     Complementarity Value =   2.1134695e-12
##     Duality Gap           =   2.1134297e-12
##     Termination Condition =   1.4588601e-12

The solved \(\vec{\lambda}\) is stored in the alpha field of the solutions variable:

lambdas <- solutions[["alpha"]]
print(lambdas)

##               [,1]
##  [1,] 2.490838e-13
##  [2,] 3.612079e-14
##  [3,] 7.137997e-13
##  [4,] 2.746960e-13
##  [5,] 4.487130e-01
##  [6,] 5.899271e-13
##  [7,] 2.000331e-13
##  [8,] 3.447680e-12
##  [9,] 2.805455e-13
## [10,] 2.934919e-13
## [11,] 3.949479e-14
## [12,] 5.698585e-14
## [13,] 1.650124e-13
## [14,] 3.176032e-14
## [15,] 1.629877e-13
## [16,] 4.487130e-01
## [17,] 2.330404e-13
## [18,] 1.028228e-13
## [19,] 1.921889e-14
## [20,] 7.064577e-14

Support Vectors

The \(k\)-th component in the \(\vec{\lambda}\) vector corresponds to the particular constraint imposed by \(k\)-th sample in the dataset. We see that most of the components of \(\vec{\lambda}\) are close to zero. Recall our previous discussion on the KKT conditions, if the Lagrange multiplier for the \(k\)-th constraint \(\lambda_k\) is close to zero, then it means its corresponding constraint \(\fun{g_k}{\vec{x}} \le 0\) is inactive — if the \(k\)-th component of \(\vec{\lambda}\) is close to zero, then the \(k\)-th sample from the dataset does not contribute to the determination of the optimal separation hyperplane; in other words, the optimal separation hyperplane can be specified fully using only the samples whose associating Lagrange multipliers are greater than zero; such samples, usually represented as vectors, are called support vectors of the SVM. We can also visualise these support vectors:

training.data %>% 
  mutate(lambda = c(lambdas), 
         y = as.character(y)) %>% 
  mutate(id = 1:N) %>% 
  ggplot() + 
    geom_label(aes(x = x1, 
                   y = x2, 
                   label = id, 
                   size = lambda, 
                   colour = y))

Predicting new samples

If we can compute the \(\vec{w}\) and \(b\) for the optimal separation hyperplane, we can predict a sample \(\vec{x}_{\new}\). We could use Equation \(\eqref{eq:equation-for-w}\) to compute the \(\vec{w}\) of the optimal separation hyperplane: \[ \begin{align} \vec{w} &= \sum_{i = 1}^{N} \lambda_i y_i \vec{x}_i \\ \end{align} \]

Here we also see how a SVM can be fully specified using only samples whose associating \(\lambda\) is non-zero, since terms in the sum where \(\lambda_k = 0\) will be \(0\) anyway. We can exploit this fact to reduce the computation needed for finding \(\vec{w}\) by extracting and operating only on the support vectors:

sv.index <- c(lambdas) >= 0.0001
w <- diag(lambdas[sv.index]) %*% diag(c(Y)[sv.index]) %*% X[sv.index, ] %>% colSums()
print(w)

##        x1        x2 
## 0.7635888 0.5606765

To compute the \(b\) for the optimal separation hyperplane, we note that if \(\vec{x}_i^{\ast}\) is a support vector for the SVM, then we have the following equality by definition: \[ \begin{align} y_i \left( \t{\vec{w}} \vec{x}_i^{\ast} - b \right) &= 1 \\ b &= \t{\vec{w}} \vec{x}_i^{\ast} - {y_i} \qquad \text{since } y_i \in \left\lbrace -1, +1 \right\rbrace \end{align} \] That is, we can compute \(b\) for the optimal separation hyperplane using any support vector \(\vec{x}_i^{\ast}\) together with its label \(y_i^{\ast}\) and the norm vector for the optimal separation hyperplane \(\vec{w}\). A numerically more stable solution can be achieved by computing the average of the bias terms \(b\) calculated from all support vectors and their corresponding labels:

b <- mean(X[sv.index, ] %*% as.matrix(w) - Y[sv.index])
print(b)

## [1] 0.1983831

We can visualise the optimal separation hyperplane on the training dataset:

training.data %>% 
  mutate(lambda = c(lambdas), 
         y = as.character(y)) %>% 
  mutate(id = 1:N) %>% 
  ggplot() + 
    geom_label(aes(x = x1, 
                   y = x2, 
                   label = id, 
                   size = lambda, 
                   colour = y)) + 
    stat_normline(norm.vector = w, bias = b)

We may test out the SVM by applying the optimal \(\vec{w}\) and \(b\) found previously on newly generated testing data; I have allowed the two classes of data points to overlap to demonstrate misclassified cases. I also mapped the distances from the data points to the hyperplane to the size of the points — the larger the distance, the more “confident” the classification.

N <- 40
testing.data <- tibble(y = c(rep(1, N / 2), rep(-1, N / 2))) %>% 
                  mutate(x1 = runif(n(), min = y - 3, max = y + 3), 
                         x2 = runif(n(), min = y - 3, max = y + 3))

testing.X <- testing.data[, 2:3] %>% as.matrix
testing.Y <- testing.data[, 1] %>% as.matrix
prediction <- sign(testing.X %*% w - b)
confidence <-  abs(testing.X %*% w - b)
is_prediction_correct <- as.logical(prediction * testing.Y + 1)

testing.data %>% mutate(id = 1:N, is_prediction_correct = is_prediction_correct, confidence = confidence ) %>% 
                  ggplot() + 
                    geom_label(aes(x = x1, 
                                   y = x2, 
                                   label = id,
                                   size = confidence,
                                   colour = is_prediction_correct)) + 
                    stat_normline(norm.vector = w, bias = b)

The “positive”/“negative” classification is completely arbitrary and interchangable — one can call the two classes anything she wants.↩︎
[https://www.encyclopediaofmath.org/index.php/Karush-Kuhn-Tucker_conditions]↩︎

Support Vector Machine from Scratch