Consider a two-class classifier (for the same
problem described in the beginning of the chapter) with oriented
hyper-planes. The decision function of the classifier
is , where .
If the training data is linearly separable, then a set of
pairs can be found such that the constraints in Eqn. are
satisfied. Notice
that there is ambiguity in the magnitude of *w* and *b*. They can be
arbitrary scaled such that , where is the
training data nearest to the decision plane.

Intuitively, the classifier with the largest margin will give lower
expected risk (Eqn. ), i.e. better generalisation.
Comparing the two different decision planes in Figure ,
the classifier with smaller margin will have higher expected risk.
The margin for this linear classifier is just 2/||*w*|| (Figure
). Hence to maximise the margin, one needs to minimise the
||*w*|| with the constraints in Eqn. . In short, the
training of this classifier is achieved by solving a linearly
constraint-ed optimisation problem.

**Figure:** Comparing liner classifier with different margin size

**Figure:** Decision margin for a oriented hyper-planes classifier

This optimisation problem is solved using the Lagrangian
formulation. The Lagrangian for minimising ||*w*|| is shown in
Eqn. . This needs to be minimised with respect to *w*, *b*, and
it is simultaneously required that and
Eqn. be satisfied. Applying
the prima-dual formulation, this can be achieved by maximising
subject to the constraints that , and also satisfying Eqn. .

From the constraints in the dual formulation, the condition in Eqn. and is obtained. These can be substituted into Eqn. to give the new Lagrangian in Eqn. . The training problem is transformed to maximise Eqn. with respect to constraints in Eqn. , and .

This is a convex QP problem, since ||*w*|| is convex (Figure
shows that the norm of a vector is convex) and all
constraints are linear. For this particular problem, the
Karush-Kuhn-Tucker (KKT) conditions are Eqn. ,
and with an additional condition as stated in
Eqn. . From these KKT condition, the following conclusion can
be made.

- if , then
- if , then

At this point it would be beneficial to consider the significance the value for each set of training data. Those training data with nonzero values will fall on the +1 or -1 plane (see Figure ), i.e. these are the data that contribute to defining the decision boundary. If the other data are removed and the classifier is retrained on the remaining data, the training will result in the same decision boundary. These data points are called the Support Vectors (SV). SV with larger are more important, since they have stronger influence on the decision boundary.

Solving Eqn. will give the value for all .
From Eqn. , *w* is the a linear combination of all the
training data (only those training points with nonzero
value contribute). *b* is found by using Eqn. , by
selecting any training data with nonzero values. It is
numerically wiser to average *b* over all such training data.

Thu Sep 10 11:05:30 BST 1998