In this post I will go over some simple examples of applications for the optimization techniques that we have seen in the previous lectures. The general theme of these examples is the canonical Machine Learning problem of classification. Since the focus here is on optimization I will not discuss the strong interplay between the design of the optimization problems and their statistical properties in terms of generalization error. The interested reader is referred to the book ‘Learning with Kernels’ by Schölkopf and Smola, and the paper ‘Learning the kernel matrix with semi-definite programming’ by Lanckriet, Cristianini, Bartlett, El Ghaoui and Jordan.
Linear classification
Let be a ‘labeled’ data set. Imagine for instance that is a picture, and is equal to if the picture contains an apple. Our goal is to construct a map (or a classifier) from to that correctly classifies our data set. We restrict our attention to linear classifiers of the form . In other words we are looking for a separating hyperplane between and . Let us assume that such an hyperplane exists, we say that the data is linearly separable.
For several reasons it seems reasonable to search for the hyperplane that maximizes the margin, that is that maximizes the distance between the hyperplane and its closest data points. Up to renormalization one can assume that for all points one has , in which case a quick picture shows that the margin is proportional to . Thus we arrive at the following optimization problem:
This problem can be efficiently solved with Interior Point Methods. It is a Quadratic Program with linear constraints.
Support Vector Machine (SVM)
Let us look at the Lagrangian for the linear classification problem described above:
Since strong duality clearly holds true for this problem (see previous post), one has by the KKT conditions that at the optimum for the primal, there exists (a solution to the dual problem) such that
The last equation (3) states that at any point which is not on the boundary of decision the corresponding Lagrange multiplier must be equal to . The points on the boundary of decision are calledsupport vectors. In particular, given equation (1) one can see that in fact the optimal classifier depends only on these support vectors, as , hence the name Support Vector Machine.
Another consequence of (1) is that the solution to the primal problem can be immediately obtained from the solution of the dual problem. Since the KKT conditions are necessary and sufficient for optimality (by convexity), we can restrict our search over that satisfies equations (1)–(2)–(3). In particular the dual problem can now be written as:
A simple rewriting now yields the standard dual form of the SVM, with , the Gram matrix defined by , and the Hadamard product between two matrices,
The kernel trick
In the dual formulation of the SVM it happens that the original data only appears in the form of the Gram matrix . The kernel trick consists in replacing the Gram matrix in the original space by the Gram matrix in a potentially much higher dimensional space. Formally one considers a mapping from the original space of data ( in our example so far, but more generally it could be an arbitrary set) to some Hilbert space . Then the Gram matrix is computed in the feature space , that is . Solving the (dual) SVM with this Gram matrix then gives a linear classifier in the feature space, which when viewed in the original space might be highly non-linear.
What is important to understand is that one does not need to compute the mapping , but only inner products of the form . In many interesting cases the latter operation is much cheaper, and it makes sense to focus the method around this last object. The idea is that it suffices to have a kernel such that for any the kernel Gram matrix is positive definite. We call such a kernel a positive definite kernel. Typical examples on are the polynomial kernel , and the Gaussian kernel .
Kernel learning
A natural idea is to put the kernel Gram matrix as part of the objective, in order to maximize the resulting margin. This leads to the following optimization problem
We show now how to express the above problem as an SDP. The idea is apply the KKT conditions and strong duality to transform the maximization problem (over ) into a minimization problem. Let us see how this works. First we consider the Lagrangian associated with the maximization problem:
The KKT conditions shows that at the optimum, one has (note that and thus it is invertible)
Thus the dual of the maximization problem above can be written as
Finally we use Schur complement lemma to write this minimization into the constraints as an SDP.
Lemma 1 (Schur complement) Let
where is positive definite. Then is positive semi-definite if and only if (the Schur complement of in ) is positive semi-definite.
In our case we use this lemma to write
Putting everything together we have proved that our original problem is equivalent to:
which is clearly an SDP with equality constraints. Note that for statistical reasons one usually adds a constraint of the form (for some fixed constant ) in the above optimization problem.
By Anonymous May 5, 2013 - 9:21 pm
From the definition of the Lagrangian in the previous post, shouldn’t alpha belong to R^{-}_{m} or alternatively shouldn’t there be a – in front of alpha^{T} lambda?
Thankyou
By t.k. April 26, 2013 - 8:37 am
thanks for your blog post, it is really helpful. But I have a question here that wish you could help me with:
you say “strong duality clearly holds true for this problem”, but why? I’ve read about the slater’s condition, however, I have no idea why SVM prime problem has satisfied the slater’s condition.