Understanding Stochastic Gradient Descent (SGD)

3 min readApr 22, 2023

Stochastic Gradient Descent (SGD) is a widely used optimization algorithm in machine learning, particularly for large-scale datasets. To understand SGD, let’s break it down into its main components.

Objective function

In machine learning, we want to optimize a model by minimizing a loss function. The loss function measures the discrepancy between the predictions of the model and the actual target values. Mathematically, the objective function is denoted as:

where L(w) is the average loss over the entire dataset, L_i(w) is the loss for a single data point i, and n is the number of data points.

Gradient

The gradient of the loss function with respect to the model’s weights (w) represents the direction of the steepest increase in the loss function. We want to minimize the loss, so we will move in the opposite direction of the gradient. The gradient is denoted as:

where ∇L_i(w) is the gradient of the loss for a single data point i.

In Gradient Descent, we update the weights by taking a step in the direction of the negative gradient:

where η is the learning rate, a positive scalar that determines the step size.

In Stochastic Gradient Descent, instead of using the true gradient ∇L(w), we approximate it using the gradient of the loss function for a single data point or a small subset (mini-batch) of data points. This makes the algorithm faster and more computationally efficient for large datasets. The weight update for SGD is:

where i is a randomly selected data point or a mini-batch.

The steps involved in SGD are:

Step 1: Initialize the model’s weights randomly.

W = zeros(num_features, num_labels);
b = zeros(1, num_labels);

Step 2: Shuffle the training dataset.

function [X_train_shuffled,y_train_shuffled] = shuffle_batch(X_train,y_train)
    m = size(X_train,1);

    perm = randperm(m);
    X_train_shuffled = X_train(perm, :);
    y_train_shuffled = y_train(perm, :);
end

Step 3: For each training example or mini-batch, compute the gradient of the loss function with respect to the model’s weights.

function [dW,db] = compute_gradients(X_batch,y_batch,W,b)
    n      = size(X_batch,1);

    y_pred = forward_pass(X_batch,W,b);
    dL     = y_pred - y_batch;
    dW     = (X_batch' * dL) / n;
    db     = sum(dL) / n;
end

Step 4: Update the weights using the computed gradient and the learning rate.

function [W,b] = SGD(W,b,dW,db,learning_rate)
    W = W - learning_rate * dW;
    b = b - learning_rate * db;
end

Repeat steps 3–4 until the stopping criterion is met.

A complete implementation of steps 1–4 for the Iris dataset using MATLAB can be accessed at this link (here).

The randomness in SGD helps the algorithm escape local minima and eventually converge to a global minimum or a good enough solution. The trade-off is that the convergence can be noisy due to the randomness, but this can be mitigated by using techniques like learning rate schedules or momentum.

In summary, Stochastic Gradient Descent is an efficient optimization algorithm that approximates the true gradient using a single data point or a small subset of data points, making it suitable for large datasets. The key concepts involved in SGD are the objective function, gradient, and weight update.

Understanding Stochastic Gradient Descent (SGD)

Objective function

Gradient

Written by Mohammad Jamhuri

No responses yet