Prince Mensah

Implementing Neural Network from scratch-Part 2 (Softmax Classification)

2024-08-16T09:46:13+00:00

Introduction

In a previous post on binary classification, we explored how to build a neural network from scratch using the MNIST dataset, focusing on distinguishing between two digits. If you followed that guide, you should now be familiar with key concepts such as forward and backward propagation, as well as the use of the sigmoid activation function for binary outputs.

In this tutorial, we’ll expand on that foundation by modifying our neural network to handle multi-class classification. While binary classification involves only two possible outcomes, multi-class classification requires our model to choose from multiple classes—in this case, the digits 0 through 9. To achieve this, we’ll replace the sigmoid activation in the output layer with the softmax function, which will allow our network to output a probability distribution across all classes.

If you’re new to this series, I recommend checking out the previous tutorial on binary classification to get a solid understanding of the basics before diving into multi-class classification. For those who are already familiar, let’s jump right into extending our neural network to handle multiple classes!

Data Preprocessing

Before we can train our neural network on the MNIST dataset, we need to preprocess the data to ensure it’s in the right format. This involves flattening the images, normalizing the pixel values, and converting the labels into a one-hot encoded format.

def pre_process_data(train_x, train_y, test_x, test_y):
    # Flatten the input images
    train_x = train_x.reshape(train_x.shape[0], -1) / 255.  # Flatten and normalize
    test_x = test_x.reshape(test_x.shape[0], -1) / 255.  # Flatten and normalize

    enc = OneHotEncoder(sparse=False, categories='auto')
    train_y = enc.fit_transform(train_y.reshape(len(train_y), -1))
    test_y = enc.transform(test_y.reshape(len(test_y), -1))

    return train_x, train_y, test_x, test_y

Checking the Data Shape

Next, we print the shapes of the preprocessed training and test datasets to confirm that the preprocessing steps were applied correctly. This helps ensure that the data is in the expected format before we proceed with training the neural network.

(train_x, train_y), (test_x, test_y) = tf.keras.datasets.mnist.load_data()
train_x, train_y, test_x, test_y = pre_process_data(train_x, train_y, test_x, test_y)

print("train_x's shape: " + str(train_x.shape))
print("test_x's shape: " + str(test_x.shape))

Defining the Neural Network

With our data preprocessed and ready, the next step is to define the architecture of our neural network. We’ll do this by creating a NeuralNetwork class that will handle everything from parameter initialization to training and prediction.

class NeuralNetwork:
    def __init__(self, layers_size):
        self.layers_size = layers_size
        self.parameters = {}
        self.length = len(self.layers_size)
        self.n = 0
        self.costs = []

The setup we have implemented above is the foundation upon which the rest of the neural network operations—such as forward propagation, backpropagation, and parameter updates—will be built.

Activation Functions

Activation functions are very impotant since they introduce non-linearity into model, helping to learn more complex patterns. for introducing non-linearity into the model, allowing it to learn complex patterns in the data. Here, we will use two different activation functions: sigmoid for the hidden layers and softmax for the output layer.

def sigmoid(self, Z):
    return 1 / (1 + np.exp(-Z))

def sigmoid_derivative(self, Z):
    s = 1 / (1 + np.exp(-Z))
    return s * (1 - s)
    
def softmax(self, Z):
    expZ = np.exp(Z - np.max(Z))
    return expZ / expZ.sum(axis=0, keepdims=True)

The softmax function transforms the output of the network into a form that can be interpreted as probabilities, making it ideal for multi-class classification tasks like the MNIST dataset which has 10 different classes.

Forward Pass

With our activation functions defined, we can now implement the forward propagation process, where the input data is passed through the network layer by layer to produce the final output. This step involves calculating the weighted sums of the inputs, applying activation functions, and saving the necessary values for backpropagation.

def forward(self, X):
    save = {}
    A = X.T  # X is already flattened, so no further reshaping needed
    for layer in range(self.length - 1):
        Z = self.parameters["W" + str(layer + 1)].dot(A) + self.parameters["b" + str(layer + 1)]
        A = self.sigmoid(Z)
        save["A" + str(layer + 1)] = A
        save["W" + str(layer + 1)] = self.parameters["W" + str(layer + 1)]
        save["Z" + str(layer + 1)] = Z

    Z = self.parameters["W" + str(self.length)].dot(A) + self.parameters["b" + str(self.length)]
    A = self.softmax(Z)
    save["A" + str(self.length)] = A
    save["W" + str(self.length)] = self.parameters["W" + str(self.length)]
    save["Z" + str(self.length)] = Z

    return A, save

By passing the input data through each layer, the network transforms the raw input into a meaningful output—probabilities that represent the likelihood of each class.

Backward Pass

After completing the forward propagation and obtaining the network’s output, the next step is backward pass (backpropagation). This is where we calculate the gradients of the cost function with respect to each parameter (weights and biases) and use these gradients to update the parameters, minimizing the error in predictions.

def backward(self, X, Y, save):
    
    gradients = {}
    
    save["A0"] = X.T
    
    A = save["A" + str(self.length)]
    dZ = A - Y.T
    
    dW = dZ.dot(save["A" + str(self.length - 1)].T) / self.n
    db = np.sum(dZ, axis=1, keepdims=True) / self.n
    dAPrev = save["W" + str(self.length)].T.dot(dZ)
    
    gradients["dW" + str(self.length)] = dW
    gradients["db" + str(self.length)] = db
    
    for layer in range(self.length - 1, 0, -1):
        dZ = dAPrev * self.sigmoid_derivative(save["Z" + str(layer)])
        dW = 1. / self.n * dZ.dot(save["A" + str(layer - 1)].T)
        db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
        if layer > 1:
            dAPrev = save["W" + str(layer)].T.dot(dZ)
    
        gradients["dW" + str(layer)] = dW
        gradients["db" + str(layer)] = db
    
    return gradients

The Backpropagation we’ve implemented above is the core mechanism that allows a neural network to learn from data. By calculating how much each parameter (weight and bias) contributes to the overall error, the network can adjust these parameters to minimize the error.

Training the Neural Network

Once we’ve set up the forward and backward propagation methods, the next step is to train the neural network. Training involves repeatedly passing the training data through the network, calculating the error, and then adjusting the network’s parameters to reduce this error.

def fit(self, X, Y, learning_rate=0.01, n_iterations=2500):
    np.random.seed(1)
    
    self.n = X.shape[0]
    
    self.layers_size.insert(0, X.shape[1])
    
    self.initialize_parameters()
    for loop in range(n_iterations):
        A, save = self.forward(X)
        cost = -np.mean(Y * np.log(A.T + 1e-8))
        gradients = self.backward(X, Y, save)
    
        for layer in range(1, self.length + 1):
            self.parameters["W" + str(layer)] = self.parameters["W" + str(layer)] - learning_rate * gradients["dW" + str(layer)]
            self.parameters["b" + str(layer)] = self.parameters["b" + str(layer)] - learning_rate * gradients["db" + str(layer)]
    
        if loop % 10 == 0:
            print("Cost: ", cost, "Train Accuracy:", self.predict(X, Y))
    
        if loop % 1 == 0:
            self.costs.append(cost)

By repeating this process over many iterations, the network gradually learns to minimize the error, improving its ability to make accurate predictions.

Evaluating the Model

After training the neural network, the next step is to evaluate its performance on both the training and test datasets Let’s implement two methods; the predict method which is used to make predictions and calculate the accuracy of the model, and the plot_cost method which allows us to visualize the cost function over the course of the training process.

def predict(self, X, Y):
    A, cache = self.forward(X)
    y_hat = np.argmax(A, axis=0)
    Y = np.argmax(Y, axis=1)
    accuracy = (y_hat == Y).mean()
    return accuracy * 100

def plot_cost(self):
    plt.figure()
    plt.plot(np.arange(len(self.costs)), self.costs)
    plt.xlabel("epochs")
    plt.ylabel("cost")
    plt.show()

By calculating the accuracy of the model on the training and test datasets, we can assess how well the network has learned and how effectively it can generalize to new data.

Full Code Implementation

import numpy as np
import tensorflow as tf # Use to download the data 
import matplotlib.pylab as plt
from sklearn.preprocessing import OneHotEncoder


class NeuralNetwork:
    def __init__(self, layers_size):
        self.layers_size = layers_size
        self.parameters = {}
        self.length = len(self.layers_size)
        self.n = 0
        self.costs = []

    def sigmoid(self, Z):
        return 1 / (1 + np.exp(-Z))
    
    def softmax(self, Z):
        expZ = np.exp(Z - np.max(Z))
        return expZ / expZ.sum(axis=0, keepdims=True)
    
    def initialize_parameters(self):
        np.random.seed(1)
    
        for layer in range(1, len(self.layers_size)):
            self.parameters["W" + str(layer)] = np.random.randn(self.layers_size[layer], self.layers_size[layer - 1]) / np.sqrt(
                self.layers_size[layer - 1])
            self.parameters["b" + str(layer)] = np.zeros((self.layers_size[layer], 1))
    
    def forward(self, X):
        save = {}
        A = X.T  # X is already flattened, so no further reshaping needed
        for layer in range(self.length - 1):
            Z = self.parameters["W" + str(layer + 1)].dot(A) + self.parameters["b" + str(layer + 1)]
            A = self.sigmoid(Z)
            save["A" + str(layer + 1)] = A
            save["W" + str(layer + 1)] = self.parameters["W" + str(layer + 1)]
            save["Z" + str(layer + 1)] = Z

        Z = self.parameters["W" + str(self.length)].dot(A) + self.parameters["b" + str(self.length)]
        A = self.softmax(Z)
        save["A" + str(self.length)] = A
        save["W" + str(self.length)] = self.parameters["W" + str(self.length)]
        save["Z" + str(self.length)] = Z

        return A, save

    
    def sigmoid_derivative(self, Z):
        s = 1 / (1 + np.exp(-Z))
        return s * (1 - s)
    
    def backward(self, X, Y, save):
    
        gradients = {}
    
        save["A0"] = X.T
    
        A = save["A" + str(self.length)]
        dZ = A - Y.T
    
        dW = dZ.dot(save["A" + str(self.length - 1)].T) / self.n
        db = np.sum(dZ, axis=1, keepdims=True) / self.n
        dAPrev = save["W" + str(self.length)].T.dot(dZ)
    
        gradients["dW" + str(self.length)] = dW
        gradients["db" + str(self.length)] = db
    
        for layer in range(self.length - 1, 0, -1):
            dZ = dAPrev * self.sigmoid_derivative(save["Z" + str(layer)])
            dW = 1. / self.n * dZ.dot(save["A" + str(layer - 1)].T)
            db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
            if layer > 1:
                dAPrev = save["W" + str(layer)].T.dot(dZ)
    
            gradients["dW" + str(layer)] = dW
            gradients["db" + str(layer)] = db
    
        return gradients
    
    def fit(self, X, Y, learning_rate=0.01, n_iterations=2500):
        np.random.seed(1)
    
        self.n = X.shape[0]
    
        self.layers_size.insert(0, X.shape[1])
    
        self.initialize_parameters()
        for loop in range(n_iterations):
            A, save = self.forward(X)
            cost = -np.mean(Y * np.log(A.T+ 1e-8))
            gradients = self.backward(X, Y, save)
    
            for layer in range(1, self.length + 1):
                self.parameters["W" + str(layer)] = self.parameters["W" + str(layer)] - learning_rate * gradients[
                    "dW" + str(layer)]
                self.parameters["b" + str(layer)] = self.parameters["b" + str(layer)] - learning_rate * gradients[
                    "db" + str(layer)]
    
            if loop % 10 == 0:
                print("Cost: ", cost, "Train Accuracy:", self.predict(X, Y))
    
            if loop % 1 == 0:
                self.costs.append(cost)
    
    def predict(self, X, Y):
        A, cache = self.forward(X)
        y_hat = np.argmax(A, axis=0)
        Y = np.argmax(Y, axis=1)
        accuracy = (y_hat == Y).mean()
        return accuracy * 100
    
    def plot_cost(self):
        plt.figure()
        plt.plot(np.arange(len(self.costs)), self.costs)
        plt.xlabel("epochs")
        plt.ylabel("cost")
        plt.show()


def pre_process_data(train_x, train_y, test_x, test_y):
    # Flatten the input images
    train_x = train_x.reshape(train_x.shape[0], -1) / 255.  # Flatten and normalize
    test_x = test_x.reshape(test_x.shape[0], -1) / 255.  # Flatten and normalize

    enc = OneHotEncoder(sparse=False, categories='auto')
    train_y = enc.fit_transform(train_y.reshape(len(train_y), -1))
    test_y = enc.transform(test_y.reshape(len(test_y), -1))

    return train_x, train_y, test_x, test_y



if __name__ == '__main__':
    (train_x, train_y), (test_x, test_y) = tf.keras.datasets.mnist.load_data()

    train_x, train_y, test_x, test_y = pre_process_data(train_x, train_y, test_x, test_y)
    
    print("train_x's shape: " + str(train_x.shape))
    print("test_x's shape: " + str(test_x.shape))
    
    dims_of_layer = [50, 10]
    
    model = NeuralNetwork(dims_of_layer)
    model.fit(train_x, train_y, learning_rate=0.1, n_iterations=100)
    print("Train Accuracy:", model.predict(train_x, train_y))
    print("Test Accuracy:", model.predict(test_x, test_y))
    model.plot_cost()

Conclusion

In this post, we explored the process of building a neural network from scratch to perform multi-class classification on the MNIST dataset. We started by preprocessing the data, defining the network architecture, and implementing key components such as forward and backward propagation. By training the network, we minimized the error and improved its ability to classify handwritten digits accurately.

We also implemented methods to evaluate the model’s performance and visualize the cost function, providing insights into the network’s learning process. Understanding these foundational concepts equips you with the tools to tackle more complex problems and refine your models for better accuracy and efficiency. If you have any questions, feel free to leave them in the comment section.

Implementing Neural Network from scratch-Part 1 (Binary Classification)

2024-08-14T09:46:13+00:00

Introduction

Neural networks have become a powerful tool these days, forming the backbone of modern deep learning and powering almost everything from computer vison, natural language processing etc. In as much as it’s quite simpler to use pre-built libraries like Pytorch or TensorFlow to build and train neural networks, I think it’s quite important for us to know how these models fundamentally works. In this blog post, we will build a very simple neural network from scratch using on Numpy and perfom a binary classification using MNIST dataset.

We’ll focus on classifying between two distinct digits: 1 and 2. Before we dive into building the model, let’s start by downloading the MNIST dataset and perfom some preprocessing that will necessary for training the model.

Data Loading and Preprocessing

We’ll begin by loading the MNIST dataset using TensorFlow, which provides a convenient method to download and load the data. The MNIST dataset is a collection of 70,000 images of handwritten digits, each 28x28 pixels in size. After loading the data, we’ll filter it to only include the classes 1 and 2.

import numpy as np
import tensorflow as tf # Use to download the data 
np.random.seed(42) # Reproducibility.

def dataset():
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

    # Filter training data for classes 1 and 2
    index_1 = np.where(y_train == 1)
    index_2 = np.where(y_train == 2)

    index = np.concatenate([index_1[0], index_2[0]])
    np.random.shuffle(index)

    train_x = x_train[index]
    train_y = y_train[index]

    train_y[np.where(train_y == 1)] = 0
    train_y[np.where(train_y == 2)] = 1
    
    # Filter test data for classes 1 and 2
    index_1 = np.where(y_test == 1)
    index_2 = np.where(y_test == 2)

    index = np.concatenate([index_1[0], index_2[0]])
    np.random.shuffle(index)

    test_y = y_test[index]
    test_x = x_test[index]

    test_y[np.where(test_y == 1)] = 0
    test_y[np.where(test_y == 2)] = 1

    return train_x, train_y, test_x, test_y

In the above code, we loaded the dataset and then use NumPy to filter the images based on their labels Finally, we relabeled the data so that 1 becomes 0 and 2 becomes 1, making this a binary classification problem.

Preprocessing the Data

The next thing we’ll do it to normalize the data, which means that the pixel values of the mnist data which ranges from 0 to 255 will now be scaled to a range between 0 and 1. And yes, since our neural network will be a fully connected (dense) network, we need to flatten each 28x28 image into a 784-dimensional vector.

def data_preprocessing(train_x, test_x):
    # Normalize the pixel values to [0, 1]
    train_x = train_x / 255.
    test_x = test_x / 255.

    # Flatten the images from 28x28 to 784
    train_x = train_x.reshape(train_x.shape[0], -1)
    test_x = test_x.reshape(test_x.shape[0], -1)

    return train_x, test_x

print("train_x's shape: " + str(train_x.shape))
print("test_x's shape: " + str(test_x.shape)) 

Output

train_x's shape: (12700, 784)
test_x's shape: (2167, 784)

Implementing The Neural Network

Now, let’s dive into the core of this project starting with initializing the network and moving through the forward pass, backward pass, training, and prediction phases.

Initializing the Neural Network

The first step in building our neural network is to define its structure and initialize some key components. This is done in the __init__ method of the neural network class.

class NeuralNet:
  def __init__(self, size_of_layers):
    self.size_of_layers = size_of_layers
    self.parameters = {}
    self.length = len(self.size_of_layers) # number of layers
    self.n = 0 # number of traing examples
    self.costs = []

With this initialization, we’ve set up the basic structure of our neural network. In the next steps, we’ll define how the network initializes its weights, performs forward passes, and updates its parameters during training.

Initializing the Network Parameters

Once we have defined the structure of our neural the next step is to initialize the parameters, specifically the weights and biases—for each layer.

def initialize_parameters(self):
  np.random.seed(42)
  for layer in range(1, len(self.size_of_layers)):
    self.parameters["W" + str(layer)] = np.random.randn(self.size_of_layers[layer], self.size_of_layers[layer - 1]) / np.sqrt(self.size_of_layers[layer - 1])
    self.parameters["b" + str(layer)] = np.zeros((self.size_of_layers[layer], 1))

We initialize a weight matrix W using a Gaussian distribution where the dimensions of this matrix are determined by the number of neurons in the current layer and the previous layer. The weights are scaled by the inverse square root of the number of neurons in the previous layer. This technique is sometimes called He or Xavier initialization. The biases b for each layer are initialized to zeros.

Forward Pass: Feeding Data Through the Network

After initializing the parameters of our neural network, the next step is to define the forward pass. This is where we pass our preprocessed data through the network to generate predictions. In this step, the input data is transformed layer by layer until we reach the final output.

def forward_pass(self, X):
    save = {}

    A = X.T
    for layer in range(self.length - 1):
        Z = self.parameters["W" + str(layer + 1)].dot(A) + self.parameters["b" + str(layer + 1)]
        A = self.sigmoid(Z)
        save["A" + str(layer + 1)] = A
        save["W" + str(layer + 1)] = self.parameters["W" + str(layer + 1)]
        save["Z" + str(layer + 1)] = Z

    Z = self.parameters["W" + str(self.length)].dot(A) + self.parameters["b" + str(self.length)]
    A = self.sigmoid(Z)
    save["A" + str(self.length)] = A
    save["W" + str(self.length)] = self.parameters["W" + str(self.length)]
    save["Z" + str(self.length)] = Z

    return A, save

The forward pass we have just implemented is where the neural network processes the input data, transforms it through each layer, and produces an output prediction. And by storing intermediate results, the network prepares itself for the backward pass, where it will adjust its parameters to minimize the prediction error.

Backward Pass: Updating Parameters through Backpropagation

After implementing the forward pass and making predictions, the next important step is the backward pass, also known as backpropagation. This is where the neural network calculates the gradients of the loss function with respect to each parameter (weights and biases) and adjusts them to minimize the error in predictions.

def backward_pass(self, X, Y, save):
    save_gradients = {} 
    save["A0"] = X.T

    A = save["A" + str(self.length)]
    dA = -np.divide(Y, A) + np.divide(1 - Y, 1 - A)

    dZ = dA * self.sigmoid_derivative(save["Z" + str(self.length)])
    dW = dZ.dot(save["A" + str(self.length - 1)].T) / self.n
    db = np.sum(dZ, axis=1, keepdims=True) / self.n
    dAPrev = save["W" + str(self.length)].T.dot(dZ)

    save_gradients["dW" + str(self.length)] = dW
    save_gradients["db" + str(self.length)] = db

    for layer in range(self.length - 1, 0, -1):
        dZ = dAPrev * self.sigmoid_derivative(save["Z" + str(layer)])
        dW = 1. / self.n * dZ.dot(save["A" + str(layer - 1)].T)
        db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
        if layer > 1:
            dAPrev = save["W" + str(layer)].T.dot(dZ)

        save_gradients["dW" + str(layer)] = dW
        save_gradients["db" + str(layer)] = db

    return save_gradients

The backpropagation we have implemented above is an important part of the neural network. By calculating how much each parameter (weight and bias) contributes to the overall error, the network can adjust these parameters to minimize the error. This process is repeated over many iterations, gradually improving the network’s ability to make accurate predictions.

Training the Neural Network

Let’s now start training the neural network. The training process involves iteratively updating the network’s parameters (weights and biases) to minimize the prediction error.

def fit(self, X, Y, learning_rate=0.01, n_iterations=3000):
    np.random.seed(42)
    self.n = X.shape[0]
    self.size_of_layers.insert(0, X.shape[1])

    self.initialize_parameters()
    for loop in range(n_iterations):
        A, save = self.forward_pass(X)
        cost = np.squeeze(-(Y.dot(np.log(A.T)) + (1 - Y).dot(np.log(1 - A.T))) / self.n)
        gradients = self.backward_pass(X, Y, save)

        for layer in range(1, self.length + 1):
            self.parameters["W" + str(layer)] = self.parameters["W" + str(layer)] - learning_rate * gradients["dW" + str(layer)]
            self.parameters["b" + str(layer)] = self.parameters["b" + str(layer)] - learning_rate * gradients["db" + str(layer)]

        if loop % 10 == 0:
            print(cost)
            self.costs.append(cost)

The fit method we have implemented above is simply the training process which repeatedly adjusts the network’s parameters based on the outputs from the cost function. By the end of the training process, the network should have learned a set of parameters that minimize the error on the training data, allowing it to make accurate predictions.

Making Predictions

After training the neural network, the next step is to use it to make predictions on new data. The predict method handles this task, taking input data and using the trained model to predict the output labels. Additionally, it calculates the accuracy of the predictions compared to the actual labels.

def predict(self, X, Y):
    A, cache = self.forward_pass(X)
    n = X.shape[0]
    pred = np.zeros((1, n))

    for idx in range(0, A.shape[1]): 
        if A[0, idx] > 0.5:
            pred[0, idx] = 1
        else:
            pred[0, idx] = 0

    print("Accuracy: " + str(np.sum((pred == Y) / n)))

def plot_cost(self):
    plt.figure()
    plt.plot(np.arange(len(self.costs)), self.costs)
    plt.xlabel("epochs")
    plt.ylabel("cost")
    plt.show()

The predict method we have implemented above allows us to evaluate how well our trained model performs on new, unseen data. This method is import for testing the generalizability of the neural network and ensuring that it can make accurate predictions outside of the training data. Lastly, we generate a plot of the cost function over the iterations, allowing us to visualize how well the model is learning over time.

Putting It All Together

With the neural network class fully implemented, we can now put everything together to train the model, make predictions, and evaluate its performance.

size_of_layers = [196, 1]

model = NeuralNet(size_of_layers)
model.fit(train_x, train_y, learning_rate=0.1, n_iterations=100)
model.predict(train_x, train_y)
model.predict(test_x, test_y)
model.plot_cost()

The above implementation is the final step, which define the structure of our neural network, train it on the training data, and then test its accuracy on both the training and test datasets.

Full Code Implementation

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf # Use to download the data 
np.random.seed(42) #reproducibility.

class NeuralNet:
  def __init__(self, size_of_layers):
    self.size_of_layers = size_of_layers
    self.parameters = {}
    self.length = len(self.size_of_layers)
    self.n = 0
    self.costs = []


  def sigmoid(self, z):
    return 1/(1 + np.exp(-z))
    

  def sigmoid_derivative(self, z):
    sigma = 1/(1 + np.exp(-z))
    return sigma * (1 - sigma)
    

  def initialize_parameters(self):
    np.random.seed(42) # reproducibility
    for layer in range(1, len(self.size_of_layers)):
      self.parameters["W" + str(layer)] = np.random.randn(self.size_of_layers[layer], self.size_of_layers[layer - 1])/np.sqrt(self.size_of_layers[layer - 1])
      self.parameters["b" + str(layer)] = np.zeros((self.size_of_layers[layer], 1))

  # forward pass
  def forward_pass(self, X):
    save = {}

    A = X.T
    for layer in range(self.length - 1):
      Z = self.parameters["W" + str(layer + 1)].dot(A) + self.parameters["b" + str(layer + 1)]
      A = self.sigmoid(Z)
      save["A" + str(layer + 1)] = A
      save["W" + str(layer + 1)] = self.parameters["W" + str(layer + 1)]
      save["Z" + str(layer + 1)] = Z

    Z = self.parameters["W" + str(self.length)].dot(A) + self.parameters["b" + str(self.length)]
    A = self.sigmoid(Z)
    save["A" + str(self.length)] = A
    save["W" + str(self.length)] = self.parameters["W" + str(self.length)]
    save["Z" + str(self.length)] = Z

    return A, save

  # backward pass
  def backward_pass(self, X, Y, save):
      save_gradients = {} 
      save["A0"] = X.T

      A = save["A" + str(self.length)]
      dA = -np.divide(Y, A) + np.divide(1 - Y, 1 - A)

      dZ = dA * self.sigmoid_derivative(save["Z" + str(self.length)])
      dW = dZ.dot(save["A" + str(self.length - 1)].T) / self.n
      db = np.sum(dZ, axis=1, keepdims=True) / self.n
      dAPrev = save["W" + str(self.length)].T.dot(dZ)

      save_gradients["dW" + str(self.length)] = dW
      save_gradients["db" + str(self.length)] = db

      for layer in range(self.length - 1, 0, -1):
          dZ = dAPrev * self.sigmoid_derivative(save["Z" + str(layer)])
          dW = 1. / self.n * dZ.dot(save["A" + str(layer - 1)].T)
          db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
          if layer > 1:
              dAPrev = save["W" + str(layer)].T.dot(dZ)

          save_gradients["dW" + str(layer)] = dW
          save_gradients["db" + str(layer)] = db

      return save_gradients

  def fit(self, X, Y, learning_rate=0.01, n_iterations=3000):
      np.random.seed(42) # reproducibility
      self.n = X.shape[0]
      self.size_of_layers.insert(0, X.shape[1])

      self.initialize_parameters()
      for loop in range(n_iterations):
          A, save = self.forward_pass(X)
          cost = np.squeeze(-(Y.dot(np.log(A.T)) + (1 - Y).dot(np.log(1 - A.T))) / self.n)
          gradients = self.backward_pass(X, Y, save)

          for layer in range(1, self.length + 1):
              self.parameters["W" + str(layer)] = self.parameters["W" + str(layer)] - learning_rate * gradients[
                  "dW" + str(layer)]
              self.parameters["b" + str(layer)] = self.parameters["b" + str(layer)] - learning_rate * gradients[
                  "db" + str(layer)]

          if loop % 100 == 0:
              print(cost)
              self.costs.append(cost)

  def predict(self, X, Y):
      A, cache = self.forward_pass(X)
      n = X.shape[0]
      pred = np.zeros((1, n))

      for idx in range(0, A.shape[1]): 
          if A[0, idx] > 0.5:
              pred[0, idx] = 1
          else:
              pred[0, idx] = 0

      print("Accuracy: " + str(np.sum((pred == Y) / n)))

  def plot_cost(self):
      plt.figure()
      plt.plot(np.arange(len(self.costs)), self.costs)
      plt.xlabel("epochs")
      plt.ylabel("cost")
      plt.show()

def dataset():
  (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

  index_1 = np.where(y_train == 1)
  index_2 = np.where(y_train == 2)

  index = np.concatenate([index_1[0], index_2[0]])
  np.random.seed(42)
  np.random.shuffle(index)

  train_x = x_train[index]
  train_y = y_train[index]

  train_y[np.where(train_y == 1)] = 0
  train_y[np.where(train_y == 2)] = 1
  
  index_1 = np.where(y_test == 1)
  index_2 = np.where(y_test == 2)

  index = np.concatenate([index_1[0], index_2[0]])
  np.random.shuffle(index)

  index = np.concatenate([index_1[0], index_2[0]])
  np.random.shuffle(index)

  test_y = y_test[index]
  test_x = x_test[index]

  test_y[np.where(test_y == 1)] = 0
  test_y[np.where(test_y == 2)] = 1

  return train_x, train_y, test_x, test_y

def data_preprocessing(train_x, test_x):
    # Normalize
    train_x = train_x / 255.
    test_x = test_x / 255.

    # Flatten the images
    train_x = train_x.reshape(train_x.shape[0], -1)
    test_x = test_x.reshape(test_x.shape[0], -1)

    return train_x, test_x

train_x, train_y, test_x, test_y = dataset()
train_x, test_x = data_preprocessing(train_x, test_x)

print("train_x's shape: " + str(train_x.shape))
print("test_x's shape: " + str(test_x.shape)) 

size_of_layers = [196, 1]

model = NeuralNet(size_of_layers)
model.fit(train_x, train_y, learning_rate=0.1, n_iterations=1000)
model.predict(train_x, train_y)
model.predict(test_x, test_y)
model.plot_cost()

Conclusion

I hope this tutorial provides a detailed approach of the process of building a neural network from scratch. Understanding the core components like forward and backward propagation is crucial since they form the backbone of any neural network. From here, we can explore various optimizations to improve accuracy, speed up computation, and enhance performance. In the next steps, we’ll look at how to implement similar neural networks using popular frameworks like TensorFlow and PyTorch, which offer powerful tools for more advanced applications.

Implementing Stochastic Gradient Descent and variants from scratch.

2024-08-09T10:33:13+00:00

Welcome to the implementation of an important optimization techniques in machine learning! In this post, we’ll look at Gradient Descent (GD) and Stochastic Gradient Descent (SGD) which are two essential methods for training machine learning models. Whether you’re new to these concepts or looking to refine your understanding, this post is designed to make these methods comprehensive and practical.

We’ll walk through various SGD variants like constant and shrinking step sizes, momentum, and averaging, comparing how each one impacts the speed and accuracy of the model’s convergence. Along the way, we’ll discuss when to use each technique, what makes them effective, and how to balance computational cost with performance.

Let’s dive in together and discover the best method for training your machine learning model!

# The following libraries will be essential for our implemetation.
import numpy as np
from numpy import linalg as la
from scipy.linalg import norm
import matplotlib.pyplot as plt
from numba import njit, jit  # A just in time compiler to speed things up!
%matplotlib inline

Linear Regression with Ridge Penalization

In our linear regression model with Ridge penalization, the goal is to find the weight vector $w$ that minimizes the following objective function:

\begin{equation} \label{eq:linear-regression} f(w) = \frac{1}{2n} |Xw - y|^2 + \frac{\lambda}{2} ||w||^2, \end{equation}

where $X$ is our feature matrix, $y$ is the vector of true values, and $\lambda$ is the regularization parameter that controls the strength of the penalty on the size of the weights.

To optimize this objective function using gradient-based methods, we need to compute the gradient, which tells us the direction in which the function decreases most rapidly. The gradient of the objective function $f(w)$is:

\begin{equation} \label{eq:gradient} \nabla f(w) = \frac{1}{n} X^T(Xw - y) + \lambda w, \end{equation}

where the first term $\frac{1}{n} X^T(Xw - y)$ represents the gradient of the least-squares loss, while the second term $\lambda w$ accounts for the regularization.

For stochastic gradient descent (SGD), we often update the weights using the gradient calculated from a single data point rather than the entire dataset. The gradient for a single data point $i$ is given by:

\begin{equation} \label{eq:sgd} \nabla f_i(w) = (X_i w - y_i) X_i + \lambda w \end{equation}

We will implement this as well which will allow us to perform efficient updates in each iteration of SGD.”

To ensure stable and efficient updates in gradient-based methods, it’s important to set an appropriate step size. The Lipschitz constant $L$ provides an upper bound on the gradient’s rate of change and helps in choosing this step size:

\begin{equation} \label{eq:step-size} L = \frac{|X|_2^2}{n} + \lambda, \end{equation}

which guides us in selecting a step size that prevents overshooting during optimization.

In stochastic gradient descent, where updates are made based on individual data points, the step size can be adapted to the specific characteristics of each data point.

\begin{equation} \label{eq:lmax} L_{\text{max}} = \max\left(\sum X_i^2\right) + \lambda \end{equation}

This constant ensures that the step size is appropriately scaled, even for the most ‘difficult’ data points, preventing instability in the updates.

Lastly, when dealing with strongly convex functions, the strong convexity constant $\mu$ provides a lower bound on the curvature of the objective function.

\begin{equation} \label{eq:muconstant} \mu = \frac{\min(\text{eigenvalues}(X^TX))}{n} + \lambda \end{equation}

The strong convexity constant helps in determining how aggressively we can update our weights without risking divergence.

Lets now put all the pieces we’ve discussed above into a LinReg class which will be important for our optimization tasks.

from scipy.linalg import svd

class LinearRegression(object):
    def __init__(self, X, y, lbda):
        self.X = X
        self.y = y
        self.n, self.d = X.shape
        self.lbda = lbda  
    def grad(self, w):
        return self.X.T.dot(self.X.dot(w) - self.y) / self.n + self.lbda * w
    
    def f_i(self, i, w):
        return norm(self.X[i].dot(w) - self.y[i]) ** 2 / (2.) + self.lbda * norm(w) ** 2 / 2.  
    
    def f(self, w):
        return norm(self.X.dot(w) - self.y) ** 2 / (2. * self.n) + self.lbda * norm(w) ** 2 / 2.

    def grad_i(self, i, w):
        x_i = self.X[i]
        return (x_i.dot(w) - self.y[i]) * x_i + self.lbda * w

    def lipschitz_constant(self):
        L = norm(self.X, ord=2) ** 2 / self.n + self.lbda
        return L
    
    def L_max_constant(self):
        L_max = np.max(np.sum(self.X ** 2, axis=1)) + self.lbda
        return L_max 
    
    def mu_constant(self):
        mu =  min(abs(la.eigvals(np.dot(self.X.T,self.X)))) / self.n + self.lbda
        return mu     

Whether you’re using full-batch gradient descent or stochastic methods, this class forms the backbone of our optimization experiments, enabling us to test and compare different techniques effectively.

Logistic Regression with Ridge Penalization

Similarly, in logistic regression, our goal is to find the weight vector $w$ that minimizes the following objective function, which includes both the logistic loss and an L2 regularization term:

\begin{equation} \label{eq:logistic-regression} f(w) = \frac{1}{n} \sum_{i=1}^{n} \log\left(1 + \exp(-y_i \cdot X_i w)\right) + \frac{\lambda}{2} ||w||^2, \end{equation}

where, $X$ is the feature matrix, $y$ is the vector of binary labels, and $\lambda$ is the regularization parameter that controls the penalty on the magnitude of the weights.

To minimize this objective function using gradient-based methods, we need to compute its gradient, which tells us the direction in which the function decreases most rapidly. The gradient of $f(w)$ is:

\begin{equation} \label{eq:log_grad} \nabla f(w) = -\frac{1}{n} X^T \left(\frac{y}{1 + \exp(y \cdot Xw)}\right) + \lambda w . \end{equation}

The first term represents the gradient of the logistic loss, and the second term $\lambda w$ is the gradient of the L2 regularization.

For stochastic gradient descent, where we update the weights based on one data point at a time, we use the gradient calculated from that individual data point. The gradient for a single data point $i$ is:

\begin{equation} \label{eq:log_sgdgrad} \nabla f_i(w) = -\frac{y_i \cdot X_i}{1 + \exp(y_i \cdot X_i w)} + \lambda w . \end{equation}

This allow us to perform efficient updates during each iteration of SGD.

To ensure that our gradient-based methods converge efficiently, we need to carefully choose the step size. The Lipschitz constant $L$ gives us an upper bound on how much the gradient can change, helping us set a stable step size:

\begin{equation} \label{eq:log_L} L = \frac{||X||_2^2}{4n} + \lambda . \end{equation}

And this help us in selecting a step size that prevents overshooting during optimization.

When using stochastic gradient descent, it’s often beneficial to adapt the step size to the characteristics of each data point.

\begin{equation} \label{eq:log_Lmax} L_{\text{max}} = \frac{\max(\sum X_i^2)}{4} + \lambda \end{equation}

This constant ensures that our step sizes are appropriately scaled, even for the most challenging data points.

In strongly convex optimization problems, the strong convexity constant $\mu$ plays an important role in accelerating convergence. For our logistic regression problem, the strong convexity constant is given by:

\begin{equation} \label{eq:log_mu} \mu = \lambda \end{equation}

This constant reflects the curvature of our loss function, helping us fine-tune our optimization algorithms for faster convergence.

class LogisticRegression(object):
    def __init__(self, X, y, lbda):
        self.X = X
        self.y = y
        self.n, self.d = X.shape
        self.lbda = lbda
 
    def grad(self, w):
        bAx = self.y * self.X.dot(w)
        temp = 1. / (1. + np.exp(bAx))
        grad = - (self.X.T).dot(self.y * temp) / self.n + self.lbda * w
        return grad
    
    def f_i(self,i, w):
        bAx_i = self.y[i] * np.dot(self.X[i], w)
        return np.log(1. + np.exp(- bAx_i)) + self.lbda * norm(w) ** 2 / 2.
    
    def f(self, w):
        bAx = self.y * self.X.dot(w)
        return np.mean(np.log(1. + np.exp(- bAx))) + self.lbda * norm(w) ** 2 / 2.

    def grad_i(self, i, w):
        grad = - self.X[i] * self.y[i] / (1. + np.exp(self.y[i] 
                                                      * self.X[i].dot(w)))
        grad += self.lbda * w
        return grad

    def lipschitz_constant(self):
        L = norm(self.X, ord=2) ** 2  / (4. * self.n) + self.lbda
        return L
    def L_max_constant(self):
        L_max = np.max(np.sum(self.X ** 2, axis=1))/4 + self.lbda
        return L_max 
    
    def mu_constant(self):
        mu =  self.lbda
        return mu    

Whether you’re using full-batch gradient descent, stochastic gradient descent, momentum or averaging, this class gives us the tools we need to achieve stable and efficient convergence.

Data Functions

To test and compare our optimization methods, we first need to create a dataset that simulates a real-world least-squares and logistic regression task. The code block below defines a function called simu_linreg, which generates such a dataset for the linear regressioin model.

Data simulation for linear regression

from numpy.random import multivariate_normal, randn
from scipy.linalg.special_matrices import toeplitz

    
def simulate_linreg(w, n, std=1., corr=0.5):
    """
    Simulation of the least-squares problem
    
    Parameters
    ----------
    x : np.ndarray, shape=(d,)
        The coefficients of the model
    
    n : int
        Sample size
    
    std : float, default=1.
        Standard-deviation of the noise

    corr : float, default=0.5
        Correlation of the features matrix
    """    
    d = w.shape[0]
    cov = toeplitz(corr ** np.arange(0, d))
    X = multivariate_normal(np.zeros(d), cov, size=n)
    noise = std * randn(n)
    y = X.dot(w) + noise
    return X, y

Data simulation for linear regression

def simulate_logreg(w, n, std=1., corr=0.5):
    """
    Simulation of the logistic regression problem
    
    Parameters
    ----------
    x : np.ndarray, shape=(d,)
        The coefficients of the model
    
    n : int
        Sample size
    
    std : float, default=1.
        Standard-deviation of the noise

    corr : float, default=0.5
        Correlation of the features matrix
    """    
    X, y = simulate_linreg(w, n, std=1., corr=0.5)
    return X, np.sign(y)

Both functions are essential because they allow us to create controlled datasets, making it easier to evaluate how well our models perform under different conditions.

Generating the Dataset

In this step, we create the dataset that will be used to test our linear and logistic regression model.

Define Dimensions

d = 50
n = 1000

We set the number of features $d = 50$ and the number of data points $n = 1000.$ This means our dataset will have 50 features per data point, and we’ll generate $1000$ such data points.

Setting Up Ground Truth Coefficients

idx = np.arange(d)
w_model_truth = (-1)**idx * np.exp(-idx / 10.)

plt.stem(w_model_truth); 

We create the true coefficients $w_{\text{model_truth}}$ that the model will try to learn. These coefficients are generated using an exponential decay function, alternating signs with each feature.

Generate the Dataset

#X, y = simulate_linreg(w_model_truth, n, std=1., corr=0.1)
X, y = simulate_logreg(w_model_truth, n, std=1., corr=0.7)

Using the simulate_linreg function, we generate the feature matrix $X$ and the target labels $y$. The dataset is created with a moderate noise level (std=1.0) and a correlation of (corr=0.1) between features.

This dataset simulates a realistic logistic regression problem, providing the data we need to test and refine our optimization algorithms. Please not that we will not be using the the logistic regression model for this task and that explains why I commented it out.

Selecting the Model

In this step, we choose the model that will be used for our optimization experiments. Here’s what the code does:

Set the Regularization Parameter

lbda = 1. / n ** (0.5)

We define the regularization parameter $\lambda$ as $1 / \sqrt{n}$, where $n$ is the number of data points. This setting helps balance the model complexity and prevents overfitting.

Choose the Model

#model = LinearRegression(X, y, lbda)
model = LogisticRegression(X, y, lbda)

Again I chose the logistic regression model with L2 regularization as the preferred model for this task. However, you can choose to use the linear regression model with Ridge penalization as your preferred model.

This choice determines whether you’ll be performing regression (with LinearRegression) or classification (with LogisticRegression). Depending on the dataset you’ve generated (X, y), you’ll select the appropriate model for the task.

Gradient Verification

What we want to ensue is that the analytical gradient $\nabla f_i(w)$ calculated by the model matches the numerical gradient derived from the objective function $f_i(w)$.

We compute the numerical gradient as follows: \begin{equation} \label{eq:num-grad} \text{numerical_grad} = \frac{f_i(w + \epsilon \cdot \text{vec}) - f_i(w)}{\epsilon} \end{equation}

And we compute the analytical gradient and checkt the difference. \begin{equation} \label{eq:ana-grad} \text{grad_error} = \text{numerical_grad} - \text{analytical_grad} \end{equation}

grad_error = []
for i in range(n):
    ind = np.random.choice(n,1)
    w =  np.random.randn(d)
    vec =  np.random.randn(d)
    eps = pow(10.0, -7.0)
    model.f_i(ind[0],w)
    grad_error.append((model.f_i( ind[0], w+eps*vec) - model.f_i( ind[0], w))/eps - np.dot(model.grad_i(ind[0],w),vec))
print(np.mean(grad_error))

Output:

2.7469189607901637e-06

The small value of 2.7469189607901637e-06 indicates that the gradients computed by the model are highly accurate and closely match the numerical gradients. This low error confirms that our gradient implementation is correct, ensuring that our optimization algorithms will perform correctly, as they rely on accurate gradient calculations to update the model weights

Alternatively, we can also use the check_grad function from the scipy.optimize module to verify the accuracy of the gradient calculations in our LinearRegression and LogisticRegression models.

from scipy.optimize import check_grad
modellin = LinearRegression(X, y, lbda)
check_grad(modellin.f, modellin.grad, np.random.randn(d))

Output:

1.2288105629057588e-06

modellog = LogReg(X, y, lbda)
check_grad(modellog.f, modellog.grad, np.random.randn(d))

Output

1.8667365426265916e-07

What we want to do now is to use the L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) algorithm to find a highly accurate solution which will serve as a benchmark for evaluating the performance of the SGD method we’re going to implement.

from scipy.optimize import fmin_l_bfgs_b
w_init = np.zeros(d)
w_min, obj_min, _ = fmin_l_bfgs_b(model.f, w_init, model.grad, args=(), pgtol=1e-30, factr =1e-30)

print(obj_min)
print(norm(model.grad(w_min)))

Output:

0.2736626885606007
7.144141131678549e-09

From the output obj_min = 0.2736626885606007 is the value of the objective function at the found minimum and norm(model.grad(w_min)) = 7.144141131678549e-09 indicates that the algorithm has converged to a point where the gradient is nearly zero, meaning the solution is highly accurate.

Implementing Stochastic Gradient descent

Unlike gradient descent method, which updates the model parameters using the entire dataset, SGD performs updates using a randomly selected data point at each iteration.

The update rule for SGD is: \begin{equation} \label{eq:sgd-update} w^{(t+1)} = w^{(t)} - \gamma^{(t)} \nabla f_{i_t}(w^{(t)}) \end{equation}

where $\gamma^{(t)}$ is the learning rate at iteration $t$, and $\nabla f_{i_t}(w^{(t)})$ is the gradient with respect to the randomly chosen data point $i_t$.

To further enhance this, we can add a momentum term that helps accelerate convergence: \begin{equation} \label{eq:momentum} w^{t+1} = w^t - \gamma^t \nabla f_i(w^t) + \text{momentum} \times (w^t - w^{t-1}), \end{equation} where, $\text{momentum}$ is a hyperparameter that controls the influence of the previous step.

Additionally, we can use iterative averaging to improve the stability and convergence of the algorithm. After a certain number of iterations, we start averaging the iterates: \begin{equation} \label{eq:sgd_averaging} w_{\text{avg}}^{(t+1)} = \frac{1}{t - t_0 + 1} \sum_{j=t_0}^t w^{(j)} \end{equation} where $t_0$ is the iteration at which we begin averaging. Averaging can be particularly useful in the later stages of optimization to smooth out the noise introduced by stochastic updates.

Now, lets implement the above SGD with option for momentum, averaging and step sizes.

def sgd(w0, model, indices, steps, w_min, n_iter=100, averaging_on=False, momentum =0, verbose=True, start_late_averaging = 0):
    w = w0.copy()
    w_new = w0.copy()
    n_samples, n_features = X.shape
    w_average = w0.copy()
    w_test = w0.copy()
    w_old = w0.copy()
    errors = []
    err = 1.0
    objectives = []
    # Current estimation error
    if np.any(w_min):
        err = norm(w - w_min) / norm(w_min)
        errors.append(err)
    # Current objective
    obj = model.f(w) 
    objectives.append(obj)
    if verbose:
        print("Lauching SGD solver...")
        print(' | '.join([name.center(8) for name in ["it", "obj", "err"]]))
    for k in range(n_iter):
        w_new[:] = w - steps[k] * (model.grad_i(indices[k],w) + momentum*(w - w_old))
        w_old[:] = w
        w[:] = w_new
        if k < start_late_averaging:
            w_average[:] = w
        else:    
            k_new = k-start_late_averaging
            w_average[:] = k_new / (k_new+1) * w_average + w / (k_new+1)
            
        if averaging_on:
            w_test[:] = w_average
        else:
            w_test[:] = w
        obj = model.f(w_test) 
        if np.any(w_min):
            err = norm(w_test - w_min) / norm(w_min)
            errors.append(err)
        objectives.append(obj)
        if k % n_samples == 0 and verbose:
            if(sum(w_min)):
                print(' | '.join([("%d" % k).rjust(8), 
                              ("%.2e" % obj).rjust(8), 
                              ("%.2e" % err).rjust(8)]))
            else:
                print(' | '.join([("%d" % k).rjust(8), 
                              ("%.2e" % obj).rjust(8)]))
    if averaging_on:
        w_output = w_average.copy()
    else:
        w_output = w.copy()    
    return w_output, np.array(objectives), np.array(errors)

This function provides a flexible framework for testing with different variants of SGD, allowing us to test the effects of momentum, averaging, and various step size schedules.

Constant and Shrinking Step Sizes (With Replacement)

Now that we’ve implemented our SGD function, it’s time to show how different step sizes impact the optimization process. Specifically, we’ll implement and compare SGD with a constant step size and SGD with a shrinking step size, both using sampling with replacement.

First, lets set up the number of iterations:

datapasses = 30 
n_iter = int(datapasses * n)

datapasses refers to the number of complete passes over the dataset. The total number of iterations, n_iter, is calculated by multiplying the number of data points $n$ by the number of passes. This ensures that each data point is updated multiple times during the training process.

Constant Stepsizes Step Size (With Replacement)

In our first approach, we’ll use a constant step size throughout the optimization:

Lmax = model.L_max_constant()

indices = np.random.choice(n, n_iter + 1, replace=True)
steps = np.ones(n_iter + 1) / (2*Lmax)
w0 = np.zeros(d)
w_sgdcr, obj_sgdcr, err_sgdcr = sgd(w0, model, indices, steps, w_min, n_iter)

Shrinking Stepsizes Step Size (With Replacement)

Next, we’ll implement SGD using a shrinking step size:

Lmax = model.L_max_constant()

indices = np.random.choice(n, n_iter+1, replace=True)
steps =  2/(Lmax*(np.sqrt(np.arange(1, n_iter + 2))))
w_sgdsr, obj_sgdsr, err_sgdsr = sgd(w0, model, indices, steps, w_min, n_iter)

Comparing SGD with Constant and Shrinking Step Sizes

Let’s now compare the difference between SGD with constant step size and shrinking step size and observe their rate of convergence.

# Error of objective on a logarithmic scale
plt.figure(figsize=(7, 5))
plt.semilogy(obj_sgdcr - obj_min, label="SGD const", lw=2)
plt.semilogy(obj_sgdsr - obj_min, label="SGD shrink", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#iterations", fontsize=14)
plt.ylabel("Error of objective", fontsize=14)
plt.legend()
# Distance to the minimizer on a logarithmic scale
plt.figure(figsize=(7, 5))
plt.yscale("log")
plt.semilogy(err_sgdcr , label="SGD const", lw=2)
plt.semilogy(err_sgdsr , label="SGD shrink", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#iterations", fontsize=14)
plt.ylabel("Distance to the minimum", fontsize=14)
plt.legend()

A plot showing the difference between constant step size and shrinking step size in terms of convergence.

By comparing these two methods, we can see that while constant step sizes may be faster initially it tends to oscillate around the minimum as the iteration increases, shrinking step sizes provide a more reliable path to convergence, making them a preferred choice in scenarios where stability and accuracy are critical.

SGD with Switching to Shrinking Step Sizes

It’s often beneficial to start with a larger, constant step size for faster convergence early on, and then transition to smaller, shrinking step sizes to fine-tune the solution.

Constant Step Size (Early Iterations)

For the first $t^*$ iterations, we use a constant step size: \begin{equation} \label{eq:const_to_switch} \gamma_t = \frac{1}{2L_{\max}} \end{equation} This ensures rapid progress toward minimizing the objective function.

Switching to Shrinking Step Sizes (Later Iterations)

After $t^*$, we switch to a shrinking step size: \begin{equation} \gamma_t = \frac{2t + 1}{(t + 1)^2 \mu}, \end{equation} where, $\mu$ is the strong convexity constant of the function, and the shrinking step size ensures that the updates become more conservative as the algorithm nears the optimal solution, which helps to reduce oscillations and improving stability.

Switch Point

The switch occurs at the iteration index $t^*$, which is determined by the condition: \begin{equation} t^* = 4 \times \lceil \kappa \rceil, \end{equation} where $\kappa = \frac{L_{\max}}{\mu}$ is the condition number of the problem. This point is chosen to balance between fast initial convergence and the need for more precision as we get closer to the solution.

Let’s now implement the above.

mu = model.mu_constant()
Kappa = Lmax/mu
tstar = 4 * int(np.ceil(Kappa))

steps_switch = np.zeros(n_iter + 1)
for i in range(n_iter):
    if i <= tstar:
        steps_switch[i] = 1 / (2 * Lmax)
    else:
        steps_switch[i] = (2 * i + 1) / ((i + 1) ** 2 * mu)

indices = np.random.choice(n, n_iter + 1, replace=True)
np.size(indices)
w_sgdss, obj_sgdss, err_sgdss = sgd(w0, model, indices, steps_switch, w_min, n_iter)

This switching approach effectively combines the advantages of both constant and shrinking step sizes as the constant step size in the early iterations allows for quick progress toward reducing the objective function and as we approach the minimum, the gradients become smaller, and the shrinking step sizes help to ensure that the updates do not overshoot the minimum.

Comparing SGD with Constant to Switching Step Sizes

Let’s now compare the difference between SGD with constant step size and shrinking step size and observe their rate of convergence.

# Plotting to compare with constant stepsize
plt.figure(figsize=(7, 5))
plt.semilogy(obj_sgdcr - obj_min, label="SGD const", lw=2)
plt.semilogy(obj_sgdss - obj_min, label="SGD switch", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#iterations", fontsize=14)
plt.ylabel("Error of objective", fontsize=14)
plt.legend()
plt.axvline(x=tstar, color = "orange", linestyle='dashed')

# Distance to the minimizer on a logarithmic scale
plt.figure(figsize=(7, 5))
plt.yscale("log")
plt.semilogy(err_sgdcr, label="SGD const", lw=2)
plt.semilogy(err_sgdss , label="SGD switch", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#iterations", fontsize=14)
plt.ylabel("Distance to the minimum", fontsize=14)
plt.legend()
plt.axvline(x=tstar,  color = "orange", linestyle='dashed')

A plot showing the difference between constant step size and switching step size in terms of convergence.

The plot demonstrates that the switch to shrinking stepsizes strategy outperforms the constant stepsize approach by reducing the oscillations and providing a smoother convergence towards the minimum.

SGD With Averaging

One powerful technique that can enhance the performance of SGD is averaging. Averaging works by calculating the mean of the iterates towards the end of the optimization process

Here, start averaging the iterates only in the last quarter of the total iterations. This allows the algorithm to have more information to average on. Let’s implement it.

# Implementing averaging with SGD
indices = np.random.choice(n, n_iter+1, replace=True)
start_late_averaging = 3*n_iter/4
averaging_on = True 

w_sgdar, obj_sgdar, err_sgdar = sgd(w0, model, indices, steps_switch, w_min, n_iter, averaging_on, 0.0, True, start_late_averaging)

Comparing the Results.

Let’s now compare the difference between SGD with constant, switching and averaging step size and observe their rate of convergence.

# Plotting to compare constant stepsize, switchting, switching + averaging
plt.figure(figsize=(7, 5))
plt.semilogy(obj_sgdcr - obj_min, label="SGD const", lw=2)
plt.semilogy(obj_sgdss - obj_min, label="SGD switch", lw=2)
plt.semilogy(obj_sgdar - obj_min, label="SGD average end", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#iterations", fontsize=14)
plt.ylabel("Loss function", fontsize=14)
plt.legend()
plt.axvline(x=tstar, color = "orange", linestyle='dashed')
plt.axvline(x=start_late_averaging, color = "green", linestyle='dashed')

# Distance to the minimizer on a logarithmic scale
plt.figure(figsize=(7, 5))
plt.semilogy(err_sgdcr, label="SGD const", lw=2)
plt.semilogy(err_sgdss , label="SGD switch", lw=2)
plt.semilogy(err_sgdar , label="SGD average end", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#iterations", fontsize=14)
plt.ylabel("Distance to the minimum", fontsize=14)
plt.legend()
plt.axvline(x=tstar, color = "orange", linestyle='dashed')
plt.axvline(x=start_late_averaging, color = "green", linestyle='dashed')

A plot showing the difference between constant, switch and averaging step size.

We can see that the averaging technique (green line) helps to stabilize the objective function, especially towards the end of the optimization process. This method is particularly useful when we want to ensure that the algorithm converges to a solution that generalizes well, as it mitigates the risk of overfitting due to fluctuations in the later stages.

SGD with Momentum

Momentum is a technique used to accelerate convergence, especially in scenarios where gradients oscillate. By adding a fraction of the previous update to the current update, this method potentially lead to faster convergence. Please note that I have already given and explained the updare rule for SGD with momentum in $\eqref{eq:momentum}$.

Now let’s implement SGD with momentum:

indices = np.random.choice(n, n_iter+1, replace=True)
averaging_on = True
start_late_averaging = 0.0
momentum = 1.0
w_sgdm, obj_sgdm, err_sgdm = sgd(w0,model, indices, steps_switch, w_min, n_iter, averaging_on, momentum, True, start_late_averaging)

For simplicity, we have set the momentum parameter to $1$. However, you can work with different values of momentum to check which one works best.

Comparing the Results

Let’s now compare the performance of SGD with constant step size, switching step size, switching step size with averaging, and SGD with momentum.

# Plotting to compare constant stepsize, switchting, switching + averaging
plt.figure(figsize=(7, 5))
plt.semilogy(obj_sgdcr - obj_min, label="SGD const", lw=2)
plt.semilogy(obj_sgdss - obj_min, label="SGD switch", lw=2)
plt.semilogy(obj_sgdar - obj_min, label="SGD average end", lw=2)
plt.semilogy(obj_sgdm - obj_min, label="SGDm", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#iterations", fontsize=14)
plt.ylabel("Loss function", fontsize=14)
plt.legend()
plt.axvline(x=tstar, color = "orange", linestyle='dashed')
plt.axvline(x=start_late_averaging, color = "purple", linestyle='dashed')

# Distance to the minimizer on a logarithmic scale
plt.figure(figsize=(7, 5))
plt.semilogy(err_sgdcr, label="SGD const", lw=2)
plt.semilogy(err_sgdss , label="SGD switch", lw=2)
plt.semilogy(err_sgdar , label="SGD average end", lw=2)
plt.semilogy(err_sgdm , label="SGDm", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#iterations", fontsize=14)
plt.ylabel("Distance to the minimum", fontsize=14)
plt.legend()
plt.axvline(x=tstar, color = "orange", linestyle='dashed')
plt.axvline(x=start_late_averaging, color = "purple", linestyle='dashed')

A plot showing the difference between constant, switch and momentum step sizes.

We can observe that SGD with momentum (red curve) shows the fastest convergence, outperforming other methods in both loss reduction and distance to the minimum. The vertical dashed lines indicate the point at which the step size switching occurs and where the late averaging begins.

SGD without Replacement

SGD without replacement selects each data point exactly once per epoch, ensuring that the model sees the entire dataset in each pass without replacement.

import numpy.matlib
niters = int(datapasses * n) - 1
indices = np.matlib.repmat(np.random.choice(n, n, replace = False), 1, datapasses)
indices = indices.flatten()
w_sgdsw, obj_sgdsw, err_sgdsw = sgd(w0, model, indices, steps_switch, w_min, niters)

Compare Result

Let’s now compare the performance of SGD with replacement and without replacement.

# Error of objective on a logarithmic scale
plt.figure(figsize=(7, 5))
plt.yscale("log")
plt.plot(obj_sgdss - obj_min, label="SGD with replacement", lw=2)
plt.plot(obj_sgdsw - obj_min, label="SGD without replacement", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#iterations", fontsize=14)
plt.ylabel("Distance to the minimum", fontsize=14)
plt.legend()

# Distance to the minimizer on a logarithmic scale
plt.figure(figsize=(7, 5))
plt.yscale("log")
plt.plot(err_sgdss , label="SGD replacement", lw=2)
plt.plot(err_sgdsw , label="SGD without replacement", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#iterations", fontsize=14)
plt.ylabel("Distance to the minimum", fontsize=14)
plt.legend()

A plot showing the comparison between SGD with replacement and without replacement.

SGD without replacement demonstrates a better convergence to the minimum, likely due to the efficiency of utilizing the dataset without replacement. This method is generally more efficient because it avoids redundant updates and thus lead to faster convergence.

Comparing Gradient Descent with Stochastic Gradient Descent

After looking at various forms of Stochastic Gradient Descent (SGD), it’s important to compare these results with the traditional Gradient Descent (GD) method.

Gradient Descent Implementation

In each iteration of the gradient descent algorithm, the gradient is computed using the entire dataset, and the model’s weights are updated accordingly.

def gd(w0, model, step, w_min =[], n_iter=100, verbose=True):
    """Gradient descent algorithm
    """
    w = w0.copy()
    w_new = w0.copy()
    n_samples, n_features = X.shape
    # estimation error history
    errors = []
    err = 1.
    # objective history
    objectives = []
    # Current estimation error
    if np.any(w_min):
        err = norm(w - w_min) / norm(w_min)
        errors.append(err)
    # Current objective
    obj = model.f(w)
    objectives.append(obj)
    if verbose:
        print("Lauching GD solver...")
        print(' | '.join([name.center(8) for name in ["it", "obj", "err"]]))
    for k in range(n_iter ):
        w[:] = w - step * model.grad(w)
        obj = model.f(w)
        if (sum(w_min)):
            err = norm(w - w_min) / norm(w_min)
            errors.append(err)
        objectives.append(obj)
        if verbose:
            print(' | '.join([("%d" % k).rjust(8),
                              ("%.2e" % obj).rjust(8),
                              ("%.2e" % err).rjust(8)]))
    return w, np.array(objectives), np.array(errors)

To ensure stable convergence in Gradient Descent, we select the step size (step) as the inverse of the Lipschitz constant of the gradient:

step = 1. / model.lipschitz_constant()
w_gd, obj_gd, err_gd = gd(w0, model, step, w_min, datapasses)
print(obj_gd)

To fairly compare GD with SGD, we calculate the computational complexity of GD. Since each step of GD requires a full pass over the dataset, the total computational effort can be represented as:

complexityofGD = n * np.arange(0, datapasses + 1)

Compare Results

Let’s now compare the performance of SGD with GD.

# Error of objective on a logarithmic scale
plt.figure(figsize=(7, 5))
plt.semilogy(complexityofGD, obj_gd - obj_min, label="gd", lw=2)
plt.semilogy(obj_sgdss - obj_min, label="sgd switch", lw=2)
plt.semilogy(obj_sgdm - obj_min, label="sgdm", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("# SGD iterations", fontsize=14)
plt.ylabel("Loss function", fontsize=14)
plt.legend()

# Distance to the minimum on a logarithmic scale
plt.figure(figsize=(7, 5))
plt.semilogy(complexityofGD, err_gd, label="gd", lw=2)
plt.semilogy(err_sgdss , label="sgd switch", lw=2)
plt.semilogy(err_sgdm , label="sgd switch", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("# SGD iterations", fontsize=14)
plt.ylabel("Distance to the minimum", fontsize=14)
plt.legend()

A plot showing the comparison between SGD and GD.

From our comparison, SGD variants are more computationally efficient compared to GD. They make faster progress in the initial stages, which is crucial in large-scale datasets. GD provides more stable convergence but at a higher computational cost.

Comparing Test Error: Gradient Descent vs. SGD with Momentum

In this final comparison, we focus on the test error, which is important for understanding how well our models generalize to unseen data.

datapasses = 30;
n_iters = int(datapasses * n)
# With replacement
indices = np.matlib.repmat(np.random.choice(n, n, replace = False), 1, datapasses)
indices = indices.flatten()
##
steps = 0.25 / np.sqrt(np.arange(1, niters + 2))

indices = np.random.choice(n, n_iter+1, replace=True)
w_sgdar, obj_sgdar, err_sgdart    = sgd(w0,model, indices, steps_switch, w_model_truth, n_iter, True, False, 3*n_iter/4) # (datapasses-5)*n

w_sgdsw, obj_sgdsw, err_sgdswt = sgd(w0,model, indices, steps, w_model_truth, n_iter, verbose = False);
## GD
step = 1. / model.lipschitz_constant()
w_gd, obj_gd, err_gd = gd(w0, model, step, w_model_truth, datapasses, verbose = False)
complexityofGD = n * np.arange(0, datapasses + 1)

## SGD with momentum
averaging_on = True
start_late_averaging = 0.0
momentum = 1.0
w_sgdm, obj_sgdm, err_sgdmt = sgd(w0,model, indices, steps_switch, w_model_truth, n_iter, averaging_on, momentum, True, start_late_averaging) # (datapasses-5)*n

## GD
step = 1. / model.lipschitz_constant()
w_gd, obj_gd, err_gdt = gd(w0, model, step, w_model_truth, datapasses)

Compare Result

Let’s compares the test error convergence for Gradient Descent (GD) and Stochastic Gradient Descent with Momentum (SGDm).

# Distance to the minimizer on a logarithmic scale
plt.figure(figsize=(7, 5))
plt.yscale("log")
plt.semilogy(complexityofGD, err_gdt , label="GD", lw=2)
# plt.semilogy(err_sgdswt, label="SGD without replacement", lw=2)
# plt.semilogy(err_sgdart , label="SGD averaging end", lw=2)
plt.semilogy(err_sgdmt,  label="SGDm", lw=2)
plt.title("Convergence plot", fontsize=16)
plt.xlabel("#iterations", fontsize=14)
plt.ylabel("Test error", fontsize=14)
plt.legend()

A plot showing the comparison between SGD and GD.

From the plot, SGDm not only converges faster but also achieves a lower final test error compared to GD. This indicates better generalization, making SGDm more suitable for real-world applications where test performance is critical.

Conclusion

By comparing these methods with Gradient Descent, we’ve highlighted the practical advantages of SGD, particularly in handling large-scale datasets where computational efficiency is key. Our final comparison of test error revealed that SGD with momentum not only accelerates convergence but also leads to superior model performance, making it a powerful method.