Introduction

Neural networks have become a powerful tool these days, forming the backbone of modern deep learning and powering almost everything from computer vison, natural language processing etc. In as much as it’s quite simpler to use pre-built libraries like Pytorch or TensorFlow to build and train neural networks, I think it’s quite important for us to know how these models fundamentally works. In this blog post, we will build a very simple neural network from scratch using on Numpy and perfom a binary classification using MNIST dataset.

We’ll focus on classifying between two distinct digits: 1 and 2. Before we dive into building the model, let’s start by downloading the MNIST dataset and perfom some preprocessing that will necessary for training the model.

Data Loading and Preprocessing

We’ll begin by loading the MNIST dataset using TensorFlow, which provides a convenient method to download and load the data. The MNIST dataset is a collection of 70,000 images of handwritten digits, each 28x28 pixels in size. After loading the data, we’ll filter it to only include the classes 1 and 2.

1
2
3
import numpy as np
import tensorflow as tf # Use to download the data 
np.random.seed(42) # Reproducibility.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def dataset():
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

    # Filter training data for classes 1 and 2
    index_1 = np.where(y_train == 1)
    index_2 = np.where(y_train == 2)

    index = np.concatenate([index_1[0], index_2[0]])
    np.random.shuffle(index)

    train_x = x_train[index]
    train_y = y_train[index]

    train_y[np.where(train_y == 1)] = 0
    train_y[np.where(train_y == 2)] = 1
    
    # Filter test data for classes 1 and 2
    index_1 = np.where(y_test == 1)
    index_2 = np.where(y_test == 2)

    index = np.concatenate([index_1[0], index_2[0]])
    np.random.shuffle(index)

    test_y = y_test[index]
    test_x = x_test[index]

    test_y[np.where(test_y == 1)] = 0
    test_y[np.where(test_y == 2)] = 1

    return train_x, train_y, test_x, test_y

In the above code, we loaded the dataset and then use NumPy to filter the images based on their labels Finally, we relabeled the data so that 1 becomes 0 and 2 becomes 1, making this a binary classification problem.

Preprocessing the Data

The next thing we’ll do it to normalize the data, which means that the pixel values of the mnist data which ranges from 0 to 255 will now be scaled to a range between 0 and 1. And yes, since our neural network will be a fully connected (dense) network, we need to flatten each 28x28 image into a 784-dimensional vector.

1
2
3
4
5
6
7
8
9
10
def data_preprocessing(train_x, test_x):
    # Normalize the pixel values to [0, 1]
    train_x = train_x / 255.
    test_x = test_x / 255.

    # Flatten the images from 28x28 to 784
    train_x = train_x.reshape(train_x.shape[0], -1)
    test_x = test_x.reshape(test_x.shape[0], -1)

    return train_x, test_x
1
2
print("train_x's shape: " + str(train_x.shape))
print("test_x's shape: " + str(test_x.shape)) 

Output

1
2
train_x's shape: (12700, 784)
test_x's shape: (2167, 784)

Implementing The Neural Network

Now, let’s dive into the core of this project starting with initializing the network and moving through the forward pass, backward pass, training, and prediction phases.

Initializing the Neural Network

The first step in building our neural network is to define its structure and initialize some key components. This is done in the __init__ method of the neural network class.

1
2
3
4
5
6
7
class NeuralNet:
  def __init__(self, size_of_layers):
    self.size_of_layers = size_of_layers
    self.parameters = {}
    self.length = len(self.size_of_layers) # number of layers
    self.n = 0 # number of traing examples
    self.costs = []

With this initialization, we’ve set up the basic structure of our neural network. In the next steps, we’ll define how the network initializes its weights, performs forward passes, and updates its parameters during training.

Initializing the Network Parameters

Once we have defined the structure of our neural the next step is to initialize the parameters, specifically the weights and biases—for each layer.

1
2
3
4
5
def initialize_parameters(self):
  np.random.seed(42)
  for layer in range(1, len(self.size_of_layers)):
    self.parameters["W" + str(layer)] = np.random.randn(self.size_of_layers[layer], self.size_of_layers[layer - 1]) / np.sqrt(self.size_of_layers[layer - 1])
    self.parameters["b" + str(layer)] = np.zeros((self.size_of_layers[layer], 1))

We initialize a weight matrix W using a Gaussian distribution where the dimensions of this matrix are determined by the number of neurons in the current layer and the previous layer. The weights are scaled by the inverse square root of the number of neurons in the previous layer. This technique is sometimes called He or Xavier initialization. The biases b for each layer are initialized to zeros.

Forward Pass: Feeding Data Through the Network

After initializing the parameters of our neural network, the next step is to define the forward pass. This is where we pass our preprocessed data through the network to generate predictions. In this step, the input data is transformed layer by layer until we reach the final output.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def forward_pass(self, X):
    save = {}

    A = X.T
    for layer in range(self.length - 1):
        Z = self.parameters["W" + str(layer + 1)].dot(A) + self.parameters["b" + str(layer + 1)]
        A = self.sigmoid(Z)
        save["A" + str(layer + 1)] = A
        save["W" + str(layer + 1)] = self.parameters["W" + str(layer + 1)]
        save["Z" + str(layer + 1)] = Z

    Z = self.parameters["W" + str(self.length)].dot(A) + self.parameters["b" + str(self.length)]
    A = self.sigmoid(Z)
    save["A" + str(self.length)] = A
    save["W" + str(self.length)] = self.parameters["W" + str(self.length)]
    save["Z" + str(self.length)] = Z

    return A, save

The forward pass we have just implemented is where the neural network processes the input data, transforms it through each layer, and produces an output prediction. And by storing intermediate results, the network prepares itself for the backward pass, where it will adjust its parameters to minimize the prediction error.

Backward Pass: Updating Parameters through Backpropagation

After implementing the forward pass and making predictions, the next important step is the backward pass, also known as backpropagation. This is where the neural network calculates the gradients of the loss function with respect to each parameter (weights and biases) and adjusts them to minimize the error in predictions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def backward_pass(self, X, Y, save):
    save_gradients = {} 
    save["A0"] = X.T

    A = save["A" + str(self.length)]
    dA = -np.divide(Y, A) + np.divide(1 - Y, 1 - A)

    dZ = dA * self.sigmoid_derivative(save["Z" + str(self.length)])
    dW = dZ.dot(save["A" + str(self.length - 1)].T) / self.n
    db = np.sum(dZ, axis=1, keepdims=True) / self.n
    dAPrev = save["W" + str(self.length)].T.dot(dZ)

    save_gradients["dW" + str(self.length)] = dW
    save_gradients["db" + str(self.length)] = db

    for layer in range(self.length - 1, 0, -1):
        dZ = dAPrev * self.sigmoid_derivative(save["Z" + str(layer)])
        dW = 1. / self.n * dZ.dot(save["A" + str(layer - 1)].T)
        db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
        if layer > 1:
            dAPrev = save["W" + str(layer)].T.dot(dZ)

        save_gradients["dW" + str(layer)] = dW
        save_gradients["db" + str(layer)] = db

    return save_gradients

The backpropagation we have implemented above is an important part of the neural network. By calculating how much each parameter (weight and bias) contributes to the overall error, the network can adjust these parameters to minimize the error. This process is repeated over many iterations, gradually improving the network’s ability to make accurate predictions.

Training the Neural Network

Let’s now start training the neural network. The training process involves iteratively updating the network’s parameters (weights and biases) to minimize the prediction error.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def fit(self, X, Y, learning_rate=0.01, n_iterations=3000):
    np.random.seed(42)
    self.n = X.shape[0]
    self.size_of_layers.insert(0, X.shape[1])

    self.initialize_parameters()
    for loop in range(n_iterations):
        A, save = self.forward_pass(X)
        cost = np.squeeze(-(Y.dot(np.log(A.T)) + (1 - Y).dot(np.log(1 - A.T))) / self.n)
        gradients = self.backward_pass(X, Y, save)

        for layer in range(1, self.length + 1):
            self.parameters["W" + str(layer)] = self.parameters["W" + str(layer)] - learning_rate * gradients["dW" + str(layer)]
            self.parameters["b" + str(layer)] = self.parameters["b" + str(layer)] - learning_rate * gradients["db" + str(layer)]

        if loop % 10 == 0:
            print(cost)
            self.costs.append(cost)

The fit method we have implemented above is simply the training process which repeatedly adjusts the network’s parameters based on the outputs from the cost function. By the end of the training process, the network should have learned a set of parameters that minimize the error on the training data, allowing it to make accurate predictions.

Making Predictions

After training the neural network, the next step is to use it to make predictions on new data. The predict method handles this task, taking input data and using the trained model to predict the output labels. Additionally, it calculates the accuracy of the predictions compared to the actual labels.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def predict(self, X, Y):
    A, cache = self.forward_pass(X)
    n = X.shape[0]
    pred = np.zeros((1, n))

    for idx in range(0, A.shape[1]): 
        if A[0, idx] > 0.5:
            pred[0, idx] = 1
        else:
            pred[0, idx] = 0

    print("Accuracy: " + str(np.sum((pred == Y) / n)))

def plot_cost(self):
    plt.figure()
    plt.plot(np.arange(len(self.costs)), self.costs)
    plt.xlabel("epochs")
    plt.ylabel("cost")
    plt.show()

The predict method we have implemented above allows us to evaluate how well our trained model performs on new, unseen data. This method is import for testing the generalizability of the neural network and ensuring that it can make accurate predictions outside of the training data. Lastly, we generate a plot of the cost function over the iterations, allowing us to visualize how well the model is learning over time.

Putting It All Together

With the neural network class fully implemented, we can now put everything together to train the model, make predictions, and evaluate its performance.

1
2
3
4
5
6
7
size_of_layers = [196, 1]

model = NeuralNet(size_of_layers)
model.fit(train_x, train_y, learning_rate=0.1, n_iterations=100)
model.predict(train_x, train_y)
model.predict(test_x, test_y)
model.plot_cost()

The above implementation is the final step, which define the structure of our neural network, train it on the training data, and then test its accuracy on both the training and test datasets.

Full Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf # Use to download the data 
np.random.seed(42) #reproducibility.

class NeuralNet:
  def __init__(self, size_of_layers):
    self.size_of_layers = size_of_layers
    self.parameters = {}
    self.length = len(self.size_of_layers)
    self.n = 0
    self.costs = []


  def sigmoid(self, z):
    return 1/(1 + np.exp(-z))
    

  def sigmoid_derivative(self, z):
    sigma = 1/(1 + np.exp(-z))
    return sigma * (1 - sigma)
    

  def initialize_parameters(self):
    np.random.seed(42) # reproducibility
    for layer in range(1, len(self.size_of_layers)):
      self.parameters["W" + str(layer)] = np.random.randn(self.size_of_layers[layer], self.size_of_layers[layer - 1])/np.sqrt(self.size_of_layers[layer - 1])
      self.parameters["b" + str(layer)] = np.zeros((self.size_of_layers[layer], 1))

  # forward pass
  def forward_pass(self, X):
    save = {}

    A = X.T
    for layer in range(self.length - 1):
      Z = self.parameters["W" + str(layer + 1)].dot(A) + self.parameters["b" + str(layer + 1)]
      A = self.sigmoid(Z)
      save["A" + str(layer + 1)] = A
      save["W" + str(layer + 1)] = self.parameters["W" + str(layer + 1)]
      save["Z" + str(layer + 1)] = Z

    Z = self.parameters["W" + str(self.length)].dot(A) + self.parameters["b" + str(self.length)]
    A = self.sigmoid(Z)
    save["A" + str(self.length)] = A
    save["W" + str(self.length)] = self.parameters["W" + str(self.length)]
    save["Z" + str(self.length)] = Z

    return A, save

  # backward pass
  def backward_pass(self, X, Y, save):
      save_gradients = {} 
      save["A0"] = X.T

      A = save["A" + str(self.length)]
      dA = -np.divide(Y, A) + np.divide(1 - Y, 1 - A)

      dZ = dA * self.sigmoid_derivative(save["Z" + str(self.length)])
      dW = dZ.dot(save["A" + str(self.length - 1)].T) / self.n
      db = np.sum(dZ, axis=1, keepdims=True) / self.n
      dAPrev = save["W" + str(self.length)].T.dot(dZ)

      save_gradients["dW" + str(self.length)] = dW
      save_gradients["db" + str(self.length)] = db

      for layer in range(self.length - 1, 0, -1):
          dZ = dAPrev * self.sigmoid_derivative(save["Z" + str(layer)])
          dW = 1. / self.n * dZ.dot(save["A" + str(layer - 1)].T)
          db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
          if layer > 1:
              dAPrev = save["W" + str(layer)].T.dot(dZ)

          save_gradients["dW" + str(layer)] = dW
          save_gradients["db" + str(layer)] = db

      return save_gradients

  def fit(self, X, Y, learning_rate=0.01, n_iterations=3000):
      np.random.seed(42) # reproducibility
      self.n = X.shape[0]
      self.size_of_layers.insert(0, X.shape[1])

      self.initialize_parameters()
      for loop in range(n_iterations):
          A, save = self.forward_pass(X)
          cost = np.squeeze(-(Y.dot(np.log(A.T)) + (1 - Y).dot(np.log(1 - A.T))) / self.n)
          gradients = self.backward_pass(X, Y, save)

          for layer in range(1, self.length + 1):
              self.parameters["W" + str(layer)] = self.parameters["W" + str(layer)] - learning_rate * gradients[
                  "dW" + str(layer)]
              self.parameters["b" + str(layer)] = self.parameters["b" + str(layer)] - learning_rate * gradients[
                  "db" + str(layer)]

          if loop % 100 == 0:
              print(cost)
              self.costs.append(cost)

  def predict(self, X, Y):
      A, cache = self.forward_pass(X)
      n = X.shape[0]
      pred = np.zeros((1, n))

      for idx in range(0, A.shape[1]): 
          if A[0, idx] > 0.5:
              pred[0, idx] = 1
          else:
              pred[0, idx] = 0

      print("Accuracy: " + str(np.sum((pred == Y) / n)))

  def plot_cost(self):
      plt.figure()
      plt.plot(np.arange(len(self.costs)), self.costs)
      plt.xlabel("epochs")
      plt.ylabel("cost")
      plt.show()

def dataset():
  (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

  index_1 = np.where(y_train == 1)
  index_2 = np.where(y_train == 2)

  index = np.concatenate([index_1[0], index_2[0]])
  np.random.seed(42)
  np.random.shuffle(index)

  train_x = x_train[index]
  train_y = y_train[index]

  train_y[np.where(train_y == 1)] = 0
  train_y[np.where(train_y == 2)] = 1
  
  index_1 = np.where(y_test == 1)
  index_2 = np.where(y_test == 2)

  index = np.concatenate([index_1[0], index_2[0]])
  np.random.shuffle(index)

  index = np.concatenate([index_1[0], index_2[0]])
  np.random.shuffle(index)

  test_y = y_test[index]
  test_x = x_test[index]

  test_y[np.where(test_y == 1)] = 0
  test_y[np.where(test_y == 2)] = 1

  return train_x, train_y, test_x, test_y

def data_preprocessing(train_x, test_x):
    # Normalize
    train_x = train_x / 255.
    test_x = test_x / 255.

    # Flatten the images
    train_x = train_x.reshape(train_x.shape[0], -1)
    test_x = test_x.reshape(test_x.shape[0], -1)

    return train_x, test_x

train_x, train_y, test_x, test_y = dataset()
train_x, test_x = data_preprocessing(train_x, test_x)

print("train_x's shape: " + str(train_x.shape))
print("test_x's shape: " + str(test_x.shape)) 

size_of_layers = [196, 1]

model = NeuralNet(size_of_layers)
model.fit(train_x, train_y, learning_rate=0.1, n_iterations=1000)
model.predict(train_x, train_y)
model.predict(test_x, test_y)
model.plot_cost()

Conclusion

I hope this tutorial provides a detailed approach of the process of building a neural network from scratch. Understanding the core components like forward and backward propagation is crucial since they form the backbone of any neural network. From here, we can explore various optimizations to improve accuracy, speed up computation, and enhance performance. In the next steps, we’ll look at how to implement similar neural networks using popular frameworks like TensorFlow and PyTorch, which offer powerful tools for more advanced applications.

Sigmoid activation function plot
Binary cross entropy loss plot
Training and validation accuracy plot