Implementing Neural Network from scratch-Part 2 (Softmax Classification)

Introduction

In a previous post on binary classification, we explored how to build a neural network from scratch using the MNIST dataset, focusing on distinguishing between two digits. If you followed that guide, you should now be familiar with key concepts such as forward and backward propagation, as well as the use of the sigmoid activation function for binary outputs.

In this tutorial, we’ll expand on that foundation by modifying our neural network to handle multi-class classification. While binary classification involves only two possible outcomes, multi-class classification requires our model to choose from multiple classes—in this case, the digits 0 through 9. To achieve this, we’ll replace the sigmoid activation in the output layer with the softmax function, which will allow our network to output a probability distribution across all classes.

If you’re new to this series, I recommend checking out the previous tutorial on binary classification to get a solid understanding of the basics before diving into multi-class classification. For those who are already familiar, let’s jump right into extending our neural network to handle multiple classes!

Data Preprocessing

Before we can train our neural network on the MNIST dataset, we need to preprocess the data to ensure it’s in the right format. This involves flattening the images, normalizing the pixel values, and converting the labels into a one-hot encoded format.

def pre_process_data(train_x, train_y, test_x, test_y):
    # Flatten the input images
    train_x = train_x.reshape(train_x.shape[0], -1) / 255.  # Flatten and normalize
    test_x = test_x.reshape(test_x.shape[0], -1) / 255.  # Flatten and normalize

    enc = OneHotEncoder(sparse=False, categories='auto')
    train_y = enc.fit_transform(train_y.reshape(len(train_y), -1))
    test_y = enc.transform(test_y.reshape(len(test_y), -1))

    return train_x, train_y, test_x, test_y

Checking the Data Shape

Next, we print the shapes of the preprocessed training and test datasets to confirm that the preprocessing steps were applied correctly. This helps ensure that the data is in the expected format before we proceed with training the neural network.

(train_x, train_y), (test_x, test_y) = tf.keras.datasets.mnist.load_data()
train_x, train_y, test_x, test_y = pre_process_data(train_x, train_y, test_x, test_y)

print("train_x's shape: " + str(train_x.shape))
print("test_x's shape: " + str(test_x.shape))

Defining the Neural Network

With our data preprocessed and ready, the next step is to define the architecture of our neural network. We’ll do this by creating a NeuralNetwork class that will handle everything from parameter initialization to training and prediction.

class NeuralNetwork:
    def __init__(self, layers_size):
        self.layers_size = layers_size
        self.parameters = {}
        self.length = len(self.layers_size)
        self.n = 0
        self.costs = []

The setup we have implemented above is the foundation upon which the rest of the neural network operations—such as forward propagation, backpropagation, and parameter updates—will be built.

Activation Functions

Activation functions are very impotant since they introduce non-linearity into model, helping to learn more complex patterns. for introducing non-linearity into the model, allowing it to learn complex patterns in the data. Here, we will use two different activation functions: sigmoid for the hidden layers and softmax for the output layer.

def sigmoid(self, Z):
    return 1 / (1 + np.exp(-Z))

def sigmoid_derivative(self, Z):
    s = 1 / (1 + np.exp(-Z))
    return s * (1 - s)
    
def softmax(self, Z):
    expZ = np.exp(Z - np.max(Z))
    return expZ / expZ.sum(axis=0, keepdims=True)

The softmax function transforms the output of the network into a form that can be interpreted as probabilities, making it ideal for multi-class classification tasks like the MNIST dataset which has 10 different classes.

Forward Pass

With our activation functions defined, we can now implement the forward propagation process, where the input data is passed through the network layer by layer to produce the final output. This step involves calculating the weighted sums of the inputs, applying activation functions, and saving the necessary values for backpropagation.

def forward(self, X):
    save = {}
    A = X.T  # X is already flattened, so no further reshaping needed
    for layer in range(self.length - 1):
        Z = self.parameters["W" + str(layer + 1)].dot(A) + self.parameters["b" + str(layer + 1)]
        A = self.sigmoid(Z)
        save["A" + str(layer + 1)] = A
        save["W" + str(layer + 1)] = self.parameters["W" + str(layer + 1)]
        save["Z" + str(layer + 1)] = Z

    Z = self.parameters["W" + str(self.length)].dot(A) + self.parameters["b" + str(self.length)]
    A = self.softmax(Z)
    save["A" + str(self.length)] = A
    save["W" + str(self.length)] = self.parameters["W" + str(self.length)]
    save["Z" + str(self.length)] = Z

    return A, save

By passing the input data through each layer, the network transforms the raw input into a meaningful output—probabilities that represent the likelihood of each class.

Backward Pass

After completing the forward propagation and obtaining the network’s output, the next step is backward pass (backpropagation). This is where we calculate the gradients of the cost function with respect to each parameter (weights and biases) and use these gradients to update the parameters, minimizing the error in predictions.

def backward(self, X, Y, save):
    
    gradients = {}
    
    save["A0"] = X.T
    
    A = save["A" + str(self.length)]
    dZ = A - Y.T
    
    dW = dZ.dot(save["A" + str(self.length - 1)].T) / self.n
    db = np.sum(dZ, axis=1, keepdims=True) / self.n
    dAPrev = save["W" + str(self.length)].T.dot(dZ)
    
    gradients["dW" + str(self.length)] = dW
    gradients["db" + str(self.length)] = db
    
    for layer in range(self.length - 1, 0, -1):
        dZ = dAPrev * self.sigmoid_derivative(save["Z" + str(layer)])
        dW = 1. / self.n * dZ.dot(save["A" + str(layer - 1)].T)
        db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
        if layer > 1:
            dAPrev = save["W" + str(layer)].T.dot(dZ)
    
        gradients["dW" + str(layer)] = dW
        gradients["db" + str(layer)] = db
    
    return gradients

The Backpropagation we’ve implemented above is the core mechanism that allows a neural network to learn from data. By calculating how much each parameter (weight and bias) contributes to the overall error, the network can adjust these parameters to minimize the error.

Training the Neural Network

Once we’ve set up the forward and backward propagation methods, the next step is to train the neural network. Training involves repeatedly passing the training data through the network, calculating the error, and then adjusting the network’s parameters to reduce this error.

def fit(self, X, Y, learning_rate=0.01, n_iterations=2500):
    np.random.seed(1)
    
    self.n = X.shape[0]
    
    self.layers_size.insert(0, X.shape[1])
    
    self.initialize_parameters()
    for loop in range(n_iterations):
        A, save = self.forward(X)
        cost = -np.mean(Y * np.log(A.T + 1e-8))
        gradients = self.backward(X, Y, save)
    
        for layer in range(1, self.length + 1):
            self.parameters["W" + str(layer)] = self.parameters["W" + str(layer)] - learning_rate * gradients["dW" + str(layer)]
            self.parameters["b" + str(layer)] = self.parameters["b" + str(layer)] - learning_rate * gradients["db" + str(layer)]
    
        if loop % 10 == 0:
            print("Cost: ", cost, "Train Accuracy:", self.predict(X, Y))
    
        if loop % 1 == 0:
            self.costs.append(cost)

By repeating this process over many iterations, the network gradually learns to minimize the error, improving its ability to make accurate predictions.

Evaluating the Model

After training the neural network, the next step is to evaluate its performance on both the training and test datasets Let’s implement two methods; the predict method which is used to make predictions and calculate the accuracy of the model, and the plot_cost method which allows us to visualize the cost function over the course of the training process.

def predict(self, X, Y):
    A, cache = self.forward(X)
    y_hat = np.argmax(A, axis=0)
    Y = np.argmax(Y, axis=1)
    accuracy = (y_hat == Y).mean()
    return accuracy * 100

def plot_cost(self):
    plt.figure()
    plt.plot(np.arange(len(self.costs)), self.costs)
    plt.xlabel("epochs")
    plt.ylabel("cost")
    plt.show()

By calculating the accuracy of the model on the training and test datasets, we can assess how well the network has learned and how effectively it can generalize to new data.

Full Code Implementation

import numpy as np
import tensorflow as tf # Use to download the data 
import matplotlib.pylab as plt
from sklearn.preprocessing import OneHotEncoder


class NeuralNetwork:
    def __init__(self, layers_size):
        self.layers_size = layers_size
        self.parameters = {}
        self.length = len(self.layers_size)
        self.n = 0
        self.costs = []

    def sigmoid(self, Z):
        return 1 / (1 + np.exp(-Z))
    
    def softmax(self, Z):
        expZ = np.exp(Z - np.max(Z))
        return expZ / expZ.sum(axis=0, keepdims=True)
    
    def initialize_parameters(self):
        np.random.seed(1)
    
        for layer in range(1, len(self.layers_size)):
            self.parameters["W" + str(layer)] = np.random.randn(self.layers_size[layer], self.layers_size[layer - 1]) / np.sqrt(
                self.layers_size[layer - 1])
            self.parameters["b" + str(layer)] = np.zeros((self.layers_size[layer], 1))
    
    def forward(self, X):
        save = {}
        A = X.T  # X is already flattened, so no further reshaping needed
        for layer in range(self.length - 1):
            Z = self.parameters["W" + str(layer + 1)].dot(A) + self.parameters["b" + str(layer + 1)]
            A = self.sigmoid(Z)
            save["A" + str(layer + 1)] = A
            save["W" + str(layer + 1)] = self.parameters["W" + str(layer + 1)]
            save["Z" + str(layer + 1)] = Z

        Z = self.parameters["W" + str(self.length)].dot(A) + self.parameters["b" + str(self.length)]
        A = self.softmax(Z)
        save["A" + str(self.length)] = A
        save["W" + str(self.length)] = self.parameters["W" + str(self.length)]
        save["Z" + str(self.length)] = Z

        return A, save

    
    def sigmoid_derivative(self, Z):
        s = 1 / (1 + np.exp(-Z))
        return s * (1 - s)
    
    def backward(self, X, Y, save):
    
        gradients = {}
    
        save["A0"] = X.T
    
        A = save["A" + str(self.length)]
        dZ = A - Y.T
    
        dW = dZ.dot(save["A" + str(self.length - 1)].T) / self.n
        db = np.sum(dZ, axis=1, keepdims=True) / self.n
        dAPrev = save["W" + str(self.length)].T.dot(dZ)
    
        gradients["dW" + str(self.length)] = dW
        gradients["db" + str(self.length)] = db
    
        for layer in range(self.length - 1, 0, -1):
            dZ = dAPrev * self.sigmoid_derivative(save["Z" + str(layer)])
            dW = 1. / self.n * dZ.dot(save["A" + str(layer - 1)].T)
            db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
            if layer > 1:
                dAPrev = save["W" + str(layer)].T.dot(dZ)
    
            gradients["dW" + str(layer)] = dW
            gradients["db" + str(layer)] = db
    
        return gradients
    
    def fit(self, X, Y, learning_rate=0.01, n_iterations=2500):
        np.random.seed(1)
    
        self.n = X.shape[0]
    
        self.layers_size.insert(0, X.shape[1])
    
        self.initialize_parameters()
        for loop in range(n_iterations):
            A, save = self.forward(X)
            cost = -np.mean(Y * np.log(A.T+ 1e-8))
            gradients = self.backward(X, Y, save)
    
            for layer in range(1, self.length + 1):
                self.parameters["W" + str(layer)] = self.parameters["W" + str(layer)] - learning_rate * gradients[
                    "dW" + str(layer)]
                self.parameters["b" + str(layer)] = self.parameters["b" + str(layer)] - learning_rate * gradients[
                    "db" + str(layer)]
    
            if loop % 10 == 0:
                print("Cost: ", cost, "Train Accuracy:", self.predict(X, Y))
    
            if loop % 1 == 0:
                self.costs.append(cost)
    
    def predict(self, X, Y):
        A, cache = self.forward(X)
        y_hat = np.argmax(A, axis=0)
        Y = np.argmax(Y, axis=1)
        accuracy = (y_hat == Y).mean()
        return accuracy * 100
    
    def plot_cost(self):
        plt.figure()
        plt.plot(np.arange(len(self.costs)), self.costs)
        plt.xlabel("epochs")
        plt.ylabel("cost")
        plt.show()


def pre_process_data(train_x, train_y, test_x, test_y):
    # Flatten the input images
    train_x = train_x.reshape(train_x.shape[0], -1) / 255.  # Flatten and normalize
    test_x = test_x.reshape(test_x.shape[0], -1) / 255.  # Flatten and normalize

    enc = OneHotEncoder(sparse=False, categories='auto')
    train_y = enc.fit_transform(train_y.reshape(len(train_y), -1))
    test_y = enc.transform(test_y.reshape(len(test_y), -1))

    return train_x, train_y, test_x, test_y



if __name__ == '__main__':
    (train_x, train_y), (test_x, test_y) = tf.keras.datasets.mnist.load_data()

    train_x, train_y, test_x, test_y = pre_process_data(train_x, train_y, test_x, test_y)
    
    print("train_x's shape: " + str(train_x.shape))
    print("test_x's shape: " + str(test_x.shape))
    
    dims_of_layer = [50, 10]
    
    model = NeuralNetwork(dims_of_layer)
    model.fit(train_x, train_y, learning_rate=0.1, n_iterations=100)
    print("Train Accuracy:", model.predict(train_x, train_y))
    print("Test Accuracy:", model.predict(test_x, test_y))
    model.plot_cost()

Conclusion

In this post, we explored the process of building a neural network from scratch to perform multi-class classification on the MNIST dataset. We started by preprocessing the data, defining the network architecture, and implementing key components such as forward and backward propagation. By training the network, we minimized the error and improved its ability to classify handwritten digits accurately.

We also implemented methods to evaluate the model’s performance and visualize the cost function, providing insights into the network’s learning process. Understanding these foundational concepts equips you with the tools to tackle more complex problems and refine your models for better accuracy and efficiency. If you have any questions, feel free to leave them in the comment section.