Implementing Neural Network from scratch-Part 2 (Softmax Classification)
In this post, we implemented a neural network from scratch to perform multi-class classification on the MNIST dataset. We started by preprocessing the data, defining the network architecture, and implementing key components such as forward and backward propagation. By training the network, we minimized the error and improved its ability to classify handwritten digits accurately.
Introduction
In a previous post on binary classification, we explored how to build a neural network from scratch using the MNIST dataset, focusing on distinguishing between two digits. If you followed that guide, you should now be familiar with key concepts such as forward and backward propagation, as well as the use of the sigmoid activation function for binary outputs.
In this tutorial, we’ll expand on that foundation by modifying our neural network to handle multi-class classification. While binary classification involves only two possible outcomes, multi-class classification requires our model to choose from multiple classes—in this case, the digits 0 through 9. To achieve this, we’ll replace the sigmoid activation in the output layer with the softmax function, which will allow our network to output a probability distribution across all classes.
If you’re new to this series, I recommend checking out the previous tutorial on binary classification to get a solid understanding of the basics before diving into multi-class classification. For those who are already familiar, let’s jump right into extending our neural network to handle multiple classes!
Data Preprocessing
Before we can train our neural network on the MNIST dataset, we need to preprocess the data to ensure it’s in the right format. This involves flattening the images, normalizing the pixel values, and converting the labels into a one-hot encoded format.
1
2
3
4
5
6
7
8
9
10
def pre_process_data(train_x, train_y, test_x, test_y):
# Flatten the input images
train_x = train_x.reshape(train_x.shape[0], -1) / 255. # Flatten and normalize
test_x = test_x.reshape(test_x.shape[0], -1) / 255. # Flatten and normalize
enc = OneHotEncoder(sparse=False, categories='auto')
train_y = enc.fit_transform(train_y.reshape(len(train_y), -1))
test_y = enc.transform(test_y.reshape(len(test_y), -1))
return train_x, train_y, test_x, test_y
Checking the Data Shape
Next, we print the shapes of the preprocessed training and test datasets to confirm that the preprocessing steps were applied correctly. This helps ensure that the data is in the expected format before we proceed with training the neural network.
1
2
3
4
5
(train_x, train_y), (test_x, test_y) = tf.keras.datasets.mnist.load_data()
train_x, train_y, test_x, test_y = pre_process_data(train_x, train_y, test_x, test_y)
print("train_x's shape: " + str(train_x.shape))
print("test_x's shape: " + str(test_x.shape))
Defining the Neural Network
With our data preprocessed and ready, the next step is to define the architecture of our neural network. We’ll do this by creating a NeuralNetwork class that will handle everything from parameter initialization to training and prediction.
1
2
3
4
5
6
7
class NeuralNetwork:
def __init__(self, layers_size):
self.layers_size = layers_size
self.parameters = {}
self.length = len(self.layers_size)
self.n = 0
self.costs = []
The setup we have implemented above is the foundation upon which the rest of the neural network operations—such as forward propagation, backpropagation, and parameter updates—will be built.
Activation Functions
Activation functions are very impotant since they introduce non-linearity into model, helping to learn more complex patterns. for introducing non-linearity into the model, allowing it to learn complex patterns in the data. Here, we will use two different activation functions: sigmoid for the hidden layers and softmax for the output layer.
1
2
3
4
5
6
7
8
9
10
def sigmoid(self, Z):
return 1 / (1 + np.exp(-Z))
def sigmoid_derivative(self, Z):
s = 1 / (1 + np.exp(-Z))
return s * (1 - s)
def softmax(self, Z):
expZ = np.exp(Z - np.max(Z))
return expZ / expZ.sum(axis=0, keepdims=True)
The softmax function transforms the output of the network into a form that can be interpreted as probabilities, making it ideal for multi-class classification tasks like the MNIST dataset which has 10 different classes.
Forward Pass
With our activation functions defined, we can now implement the forward propagation process, where the input data is passed through the network layer by layer to produce the final output. This step involves calculating the weighted sums of the inputs, applying activation functions, and saving the necessary values for backpropagation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def forward(self, X):
save = {}
A = X.T # X is already flattened, so no further reshaping needed
for layer in range(self.length - 1):
Z = self.parameters["W" + str(layer + 1)].dot(A) + self.parameters["b" + str(layer + 1)]
A = self.sigmoid(Z)
save["A" + str(layer + 1)] = A
save["W" + str(layer + 1)] = self.parameters["W" + str(layer + 1)]
save["Z" + str(layer + 1)] = Z
Z = self.parameters["W" + str(self.length)].dot(A) + self.parameters["b" + str(self.length)]
A = self.softmax(Z)
save["A" + str(self.length)] = A
save["W" + str(self.length)] = self.parameters["W" + str(self.length)]
save["Z" + str(self.length)] = Z
return A, save
By passing the input data through each layer, the network transforms the raw input into a meaningful output—probabilities that represent the likelihood of each class.
Backward Pass
After completing the forward propagation and obtaining the network’s output, the next step is backward pass (backpropagation). This is where we calculate the gradients of the cost function with respect to each parameter (weights and biases) and use these gradients to update the parameters, minimizing the error in predictions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def backward(self, X, Y, save):
gradients = {}
save["A0"] = X.T
A = save["A" + str(self.length)]
dZ = A - Y.T
dW = dZ.dot(save["A" + str(self.length - 1)].T) / self.n
db = np.sum(dZ, axis=1, keepdims=True) / self.n
dAPrev = save["W" + str(self.length)].T.dot(dZ)
gradients["dW" + str(self.length)] = dW
gradients["db" + str(self.length)] = db
for layer in range(self.length - 1, 0, -1):
dZ = dAPrev * self.sigmoid_derivative(save["Z" + str(layer)])
dW = 1. / self.n * dZ.dot(save["A" + str(layer - 1)].T)
db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
if layer > 1:
dAPrev = save["W" + str(layer)].T.dot(dZ)
gradients["dW" + str(layer)] = dW
gradients["db" + str(layer)] = db
return gradients
The Backpropagation we’ve implemented above is the core mechanism that allows a neural network to learn from data. By calculating how much each parameter (weight and bias) contributes to the overall error, the network can adjust these parameters to minimize the error.
Training the Neural Network
Once we’ve set up the forward and backward propagation methods, the next step is to train the neural network. Training involves repeatedly passing the training data through the network, calculating the error, and then adjusting the network’s parameters to reduce this error.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def fit(self, X, Y, learning_rate=0.01, n_iterations=2500):
np.random.seed(1)
self.n = X.shape[0]
self.layers_size.insert(0, X.shape[1])
self.initialize_parameters()
for loop in range(n_iterations):
A, save = self.forward(X)
cost = -np.mean(Y * np.log(A.T + 1e-8))
gradients = self.backward(X, Y, save)
for layer in range(1, self.length + 1):
self.parameters["W" + str(layer)] = self.parameters["W" + str(layer)] - learning_rate * gradients["dW" + str(layer)]
self.parameters["b" + str(layer)] = self.parameters["b" + str(layer)] - learning_rate * gradients["db" + str(layer)]
if loop % 10 == 0:
print("Cost: ", cost, "Train Accuracy:", self.predict(X, Y))
if loop % 1 == 0:
self.costs.append(cost)
By repeating this process over many iterations, the network gradually learns to minimize the error, improving its ability to make accurate predictions.
Evaluating the Model
After training the neural network, the next step is to evaluate its performance on both the training and test datasets Let’s implement two methods; the predict method which is used to make predictions and calculate the accuracy of the model, and the plot_cost method which allows us to visualize the cost function over the course of the training process.
1
2
3
4
5
6
7
8
9
10
11
12
13
def predict(self, X, Y):
A, cache = self.forward(X)
y_hat = np.argmax(A, axis=0)
Y = np.argmax(Y, axis=1)
accuracy = (y_hat == Y).mean()
return accuracy * 100
def plot_cost(self):
plt.figure()
plt.plot(np.arange(len(self.costs)), self.costs)
plt.xlabel("epochs")
plt.ylabel("cost")
plt.show()
By calculating the accuracy of the model on the training and test datasets, we can assess how well the network has learned and how effectively it can generalize to new data.
Full Code Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
import tensorflow as tf # Use to download the data
import matplotlib.pylab as plt
from sklearn.preprocessing import OneHotEncoder
class NeuralNetwork:
def __init__(self, layers_size):
self.layers_size = layers_size
self.parameters = {}
self.length = len(self.layers_size)
self.n = 0
self.costs = []
def sigmoid(self, Z):
return 1 / (1 + np.exp(-Z))
def softmax(self, Z):
expZ = np.exp(Z - np.max(Z))
return expZ / expZ.sum(axis=0, keepdims=True)
def initialize_parameters(self):
np.random.seed(1)
for layer in range(1, len(self.layers_size)):
self.parameters["W" + str(layer)] = np.random.randn(self.layers_size[layer], self.layers_size[layer - 1]) / np.sqrt(
self.layers_size[layer - 1])
self.parameters["b" + str(layer)] = np.zeros((self.layers_size[layer], 1))
def forward(self, X):
save = {}
A = X.T # X is already flattened, so no further reshaping needed
for layer in range(self.length - 1):
Z = self.parameters["W" + str(layer + 1)].dot(A) + self.parameters["b" + str(layer + 1)]
A = self.sigmoid(Z)
save["A" + str(layer + 1)] = A
save["W" + str(layer + 1)] = self.parameters["W" + str(layer + 1)]
save["Z" + str(layer + 1)] = Z
Z = self.parameters["W" + str(self.length)].dot(A) + self.parameters["b" + str(self.length)]
A = self.softmax(Z)
save["A" + str(self.length)] = A
save["W" + str(self.length)] = self.parameters["W" + str(self.length)]
save["Z" + str(self.length)] = Z
return A, save
def sigmoid_derivative(self, Z):
s = 1 / (1 + np.exp(-Z))
return s * (1 - s)
def backward(self, X, Y, save):
gradients = {}
save["A0"] = X.T
A = save["A" + str(self.length)]
dZ = A - Y.T
dW = dZ.dot(save["A" + str(self.length - 1)].T) / self.n
db = np.sum(dZ, axis=1, keepdims=True) / self.n
dAPrev = save["W" + str(self.length)].T.dot(dZ)
gradients["dW" + str(self.length)] = dW
gradients["db" + str(self.length)] = db
for layer in range(self.length - 1, 0, -1):
dZ = dAPrev * self.sigmoid_derivative(save["Z" + str(layer)])
dW = 1. / self.n * dZ.dot(save["A" + str(layer - 1)].T)
db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
if layer > 1:
dAPrev = save["W" + str(layer)].T.dot(dZ)
gradients["dW" + str(layer)] = dW
gradients["db" + str(layer)] = db
return gradients
def fit(self, X, Y, learning_rate=0.01, n_iterations=2500):
np.random.seed(1)
self.n = X.shape[0]
self.layers_size.insert(0, X.shape[1])
self.initialize_parameters()
for loop in range(n_iterations):
A, save = self.forward(X)
cost = -np.mean(Y * np.log(A.T+ 1e-8))
gradients = self.backward(X, Y, save)
for layer in range(1, self.length + 1):
self.parameters["W" + str(layer)] = self.parameters["W" + str(layer)] - learning_rate * gradients[
"dW" + str(layer)]
self.parameters["b" + str(layer)] = self.parameters["b" + str(layer)] - learning_rate * gradients[
"db" + str(layer)]
if loop % 10 == 0:
print("Cost: ", cost, "Train Accuracy:", self.predict(X, Y))
if loop % 1 == 0:
self.costs.append(cost)
def predict(self, X, Y):
A, cache = self.forward(X)
y_hat = np.argmax(A, axis=0)
Y = np.argmax(Y, axis=1)
accuracy = (y_hat == Y).mean()
return accuracy * 100
def plot_cost(self):
plt.figure()
plt.plot(np.arange(len(self.costs)), self.costs)
plt.xlabel("epochs")
plt.ylabel("cost")
plt.show()
def pre_process_data(train_x, train_y, test_x, test_y):
# Flatten the input images
train_x = train_x.reshape(train_x.shape[0], -1) / 255. # Flatten and normalize
test_x = test_x.reshape(test_x.shape[0], -1) / 255. # Flatten and normalize
enc = OneHotEncoder(sparse=False, categories='auto')
train_y = enc.fit_transform(train_y.reshape(len(train_y), -1))
test_y = enc.transform(test_y.reshape(len(test_y), -1))
return train_x, train_y, test_x, test_y
if __name__ == '__main__':
(train_x, train_y), (test_x, test_y) = tf.keras.datasets.mnist.load_data()
train_x, train_y, test_x, test_y = pre_process_data(train_x, train_y, test_x, test_y)
print("train_x's shape: " + str(train_x.shape))
print("test_x's shape: " + str(test_x.shape))
dims_of_layer = [50, 10]
model = NeuralNetwork(dims_of_layer)
model.fit(train_x, train_y, learning_rate=0.1, n_iterations=100)
print("Train Accuracy:", model.predict(train_x, train_y))
print("Test Accuracy:", model.predict(test_x, test_y))
model.plot_cost()
Conclusion
In this post, we explored the process of building a neural network from scratch to perform multi-class classification on the MNIST dataset. We started by preprocessing the data, defining the network architecture, and implementing key components such as forward and backward propagation. By training the network, we minimized the error and improved its ability to classify handwritten digits accurately.
We also implemented methods to evaluate the model’s performance and visualize the cost function, providing insights into the network’s learning process. Understanding these foundational concepts equips you with the tools to tackle more complex problems and refine your models for better accuracy and efficiency. If you have any questions, feel free to leave them in the comment section.