Support Vector Machine (SVM): Step-by-step Explanation and Implementation in MATLAB

6 min readApr 8, 2023

Support Vector Machine (SVM) is a supervised machine learning algorithm for classification and regression tasks. Here’s a detailed description of the SVM algorithm for binary classification:

1. Problem Formulation

Given a set of training examples:

The goal is to find a decision boundary (a hyperplane) that maximizes the margin between the two classes.

2. Define the Hyperplane

A hyperplane can be defined as wᵀx + b = 0, where w is the weight vector (normal to the hyperplane), and b is the bias term. The equation can be rewritten as:

3. Maximize the Margin

The margin can be calculated as the perpendicular distance between the closest data points and the hyperplane, given by 2 / ||w||. To maximize the margin, we need to minimize ||w||² / 2, subject to the constraints

4. Lagrange Multipliers

Introduce Lagrange multipliers αᵢ for each constraint, and form the Lagrangian:

where the summation is overall i.

5. Find the saddle point

To find the saddle point of L(w, b, α), compute the gradients with respect to w and b, and set them to zero:

6. Dual Formulation

Substitute the gradients back into the Lagrangian to obtain the dual problem:

7. Solve the Quadratic Programming Problem

The dual problem is a convex quadratic programming problem. Solve it using optimization techniques such as the Sequential Minimal Optimization (SMO) algorithm, gradient ascent, or specialized quadratic programming solvers.

8. Obtain w and b

Once you have the optimal αᵢ values, compute the weight vector w:

To find the bias term b, use any support vector (xₛ, yₛ) where αₛ > 0:

9. Make Predictions

For a new data point x, the predicted class label ŷ can be calculated using:

The sign of the result determines the class: if the result is positive, the predicted class is 1, and if the result is negative, the predicted class is -1.

10. Kernel Trick (optional)

For non-linearly separable data, you can use the kernel trick to map the data to a higher-dimensional space where it becomes linearly separable. Replace the dot product (xᵢᵀxⱼ) in the dual problem with a kernel function K(xᵢ, xⱼ) that computes the dot product in the higher-dimensional space:

Common kernel functions include the linear kernel (K(xᵢ, xⱼ) = xᵢᵀxⱼ), polynomial kernel (K(xᵢ, xⱼ) = (xᵢᵀxⱼ + c)ᵈ), and radial basis function (RBF) or Gaussian kernel (K(xᵢ, xⱼ) = exp(-γ||xᵢ — xⱼ||²)).

11. Make Predictions with Kernels

For a new data point x, the predicted class label ŷ can be calculated using the kernel function:

This is the complete step-by-step description of the Support Vector Machine algorithm. You can now use it to solve binary classification problems. For multi-class problems, you can apply one-vs-one or one-vs-all strategies, which involve training multiple binary classifiers and combining their results to make a final prediction.

Here’s an implementation of a simple SVM with a linear kernel in MATLAB. The following code includes functions to create a Gram matrix, solve the quadratic programming problem, train the SVM, and make predictions.

First, create some example data. This code will generate two clusters of random 2D data points, one centered at (0, 0) and the other at (1, 1). Each data point will be assigned a label of -1 or 1, depending on its cluster.

num_points = 100;
X1 = randn(num_points, 2);
Y1 = -ones(num_points, 1);
X2 = randn(num_points, 2) + 1;
Y2 = ones(num_points, 1);

X = [X1; X2];
Y = [Y1; Y2];

We visualize the data points by red circles to represent data points with a label of 1, and blue squares represent data points with a label of -1.

Now we’ll define the kernel function. We’ll use a simple linear kernel, which is just the dot product of two input vectors.

function k = linear_kernel(x1, x2)
    k = x1 * x2';
end

Next, we’ll create the Gram matrix. This is a matrix of the kernel function evaluated for every pair of input data points.

function K = gram_matrix(X, kernel)
    num_points = size(X, 1);
    K = zeros(num_points);
    
    for i = 1:num_points
        for j = 1:num_points
            K(i, j) = kernel(X(i, :), X(j, :));
        end
    end
end

Now we’ll create a function to solve the quadratic programming problem. This function will return the Lagrange multipliers, which we’ll use to compute the support vectors and the decision boundary.

function alpha = solve_qp(Y, K, C)
    num_points = length(Y);
    
    H = (Y * Y') .* K;
    f = -ones(num_points, 1);
    A = [];
    b = [];
    Aeq = Y';
    beq = 0;
    lb = zeros(num_points, 1);
    ub = C * ones(num_points, 1);
    
    options = optimoptions('quadprog', 'Display', 'off');
    alpha = quadprog(H, f, A, b, Aeq, beq, lb, ub, [], options);
end

Finally, we can put it all together and train the SVM.

% Train the SVM
C = 1; % Regularization parameter
K = gram_matrix(X, @linear_kernel);
alpha = solve_qp(Y, K, C);

% Get the support vectors
threshold = 1e-5;
support_vectors = X(alpha > threshold, :);
support_vector_labels = Y(alpha > threshold);
support_vector_alphas = alpha(alpha > threshold);

% Compute the bias term
bias = 0;
for i = 1:length(support_vector_alphas)
    bias = bias + support_vector_labels(i) - sum(support_vector_alphas .* support_vector_labels .* linear_kernel(support_vectors, support_vectors(i, :)));
end
bias = bias / length(support_vector_alphas);

After training the SVM, we can use the support vectors, their labels, and the bias term to make predictions on new data.

function y_pred = svm_predict(X, support_vectors, support_vector_labels, support_vector_alphas, bias, kernel)
    y_pred = zeros(size(X, 1), 1);
    
    for i = 1:size(X, 1)
        prediction = 0;
        for j = 1:length(support_vector_alphas)
            prediction = prediction + support_vector_alphas(j) * support_vector_labels(j) * kernel(X(i, :), support_vectors(j, :));
        end
        y_pred(i) = prediction + bias;
    end
    y_pred = sign(y_pred);
end

Now you can use the svm_predict function to make predictions on new data points.

% Create some new test data
num_test_points = 50;
X1_test = randn(num_test_points, 2);
X2_test = randn(num_test_points, 2) + 1;
X_test = [X1_test; X2_test];

% Predict the labels of the test data
y_pred = svm_predict(X_test, support_vectors, support_vector_labels, support_vector_alphas, bias, @linear_kernel);

If you want to use a different kernel function, simply define the kernel function and replace @linear_kernel with your new kernel function when calling gram_matrix and svm_predict.

For example, if you want to use a polynomial kernel with degree 3 and constant term 1, you can define the kernel function as:

function k = polynomial_kernel(x1, x2)
    degree = 3;
    c = 1;
    k = (x1 * x2' + c) .^ degree;
end

Then, train the SVM and make predictions using the @polynomial_kernel kernel function:

% Train the SVM with the polynomial kernel
K = gram_matrix(X, @polynomial_kernel);
alpha = solve_qp(Y, K, C);

% Compute the bias term for the polynomial kernel
bias = 0;
for i = 1:length(support_vector_alphas)
    bias = bias + support_vector_labels(i) - sum(support_vector_alphas .* support_vector_labels .* polynomial_kernel(support_vectors, support_vectors(i, :)));
end
bias = bias / length(support_vector_alphas);

% Predict the labels of the test data using the polynomial kernel
y_pred = svm_predict(X_test, support_vectors, support_vector_labels, support_vector_alphas, bias, @polynomial_kernel);