Consider a two-class classifier (for the same problem described in the beginning of the chapter) with oriented hyper-planes. The decision function of the classifier is , where . If the training data is linearly separable, then a set of pairs can be found such that the constraints in Eqn. are satisfied. Notice that there is ambiguity in the magnitude of w and b. They can be arbitrary scaled such that , where is the training data nearest to the decision plane.
Intuitively, the classifier with the largest margin will give lower expected risk (Eqn. ), i.e. better generalisation. Comparing the two different decision planes in Figure , the classifier with smaller margin will have higher expected risk. The margin for this linear classifier is just 2/||w|| (Figure ). Hence to maximise the margin, one needs to minimise the ||w|| with the constraints in Eqn. . In short, the training of this classifier is achieved by solving a linearly constraint-ed optimisation problem.
Figure: Comparing liner classifier with different margin size
Figure: Decision margin for a oriented hyper-planes classifier
This optimisation problem is solved using the Lagrangian formulation. The Lagrangian for minimising ||w|| is shown in Eqn. . This needs to be minimised with respect to w, b, and it is simultaneously required that and Eqn. be satisfied. Applying the prima-dual formulation, this can be achieved by maximising subject to the constraints that , and also satisfying Eqn. .
From the constraints in the dual formulation, the condition in Eqn. and is obtained. These can be substituted into Eqn. to give the new Lagrangian in Eqn. . The training problem is transformed to maximise Eqn. with respect to constraints in Eqn. , and .
This is a convex QP problem, since ||w|| is convex (Figure shows that the norm of a vector is convex) and all constraints are linear. For this particular problem, the Karush-Kuhn-Tucker (KKT) conditions are Eqn. , and with an additional condition as stated in Eqn. . From these KKT condition, the following conclusion can be made.
At this point it would be beneficial to consider the significance the value for each set of training data. Those training data with nonzero values will fall on the +1 or -1 plane (see Figure ), i.e. these are the data that contribute to defining the decision boundary. If the other data are removed and the classifier is retrained on the remaining data, the training will result in the same decision boundary. These data points are called the Support Vectors (SV). SV with larger are more important, since they have stronger influence on the decision boundary.
Solving Eqn. will give the value for all . From Eqn. , w is the a linear combination of all the training data (only those training points with nonzero value contribute). b is found by using Eqn. , by selecting any training data with nonzero values. It is numerically wiser to average b over all such training data.