Udacity Deep Learning – The Hidden Layer

The homework of fullyconnected session require the students to:

Turn the logistic regression example with SGD into a 1-hidden layer neural network with rectified linear units nn.relu() and 1024 hidden nodes. This model should improve your validation / test accuracy.

The last block of code was neural network where is simply a network connecting input directly to the output, of course, softmax in the middle. And the change we need to make here is to instead of mapping all the inputs (784) to (10) outputs, we first need to create a layer which maps 784 to 1024 and then another layer to map 1024 to 10. Nothing fancy, but we simply need to add relu after the first Wx+b before passing on to the next layer as the activation function.

First, let’s take a look at how the old no hidden layer SGD looks like:

# Variables.
weights = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_labels])
biases = tf.Variable(

# Training computation.
logits = tf.matmul(tf_train_dataset, weights) + biases
loss = tf.reduce_mean(

# Optimizer.
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

# Predictions for the training, validation, and test data.
train_prediction = tf.nn.softmax(logits)
valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

Now, let’s build on top of this and see if we can architect a new network based on the homework requirement. Frankly speaking, the answer I am going to provide here was mostly inspired by this notebook or even more, the answer is only a fraction of what Mr. Damien offered in his code.

First of all, our network was like softmax(w * x + b) before. Now we are going to have one extra layer. The math will look like softmax(w_2 * h+b_2) where h = relu(w_1 * x + b_1). Or put everything in one:

softmax(w_2 * relu(w_1 * x + b_1) + b_2)

It is a bit complex, but not too much. we first need to redefine the variable weights and bias in a way where all of the four variables w_1, w_2, b_1, b_2 are included.

w_1 now is of the dimension (784, 1028) and b_1 has the size of (1, 1028)
w_2 is of dimension (1028, 10) and b_2 is of size (1, 10). Keep the right dimension in mind, we need to create four variables where two of the weights need to be initialized by normal distribution where the biases can be initialized by zeros, just like the old days.

After that, here comes to developer’s personal preference. You can put all the variables into a dictionary say “myvariables” where it stores all the weights and biases. Or you can create two variables “weights”  and store the two weight variables within that weights dictionary. And the same for biases. Or create one variable all separately and pass them around whenever you need them. First of all, all the design preferences will work in the end if you implement them properly. Damien’s code was following the second design type, which looks clean and tide. Here I am going to package all the variables in one dictionary just to be different.

# Variables
myvars = {
 'w_h': tf.Variable(tf.random_normal([n_input, n_hidden])),
 'b_h': tf.Variable(tf.random_normal([n_hidden])),
 'w_o': tf.Variable(tf.random_normal([n_hidden, num_labels])),
 'b_o': tf.Variable(tf.random_normal([num_labels]))

Now we are done initializing our variables, the next step is to build the model, or how we are going to predict using all the variables that we just defined. In the network without hidden layers, it was easy and actually a one liner

logits = tf.matmul(tf_train_dataset, weights) + biases

In our case, it will look like this:

layer_hidden = tf.add(
    tf.matmul(x, myvars['w_h']), 
logits = tf.matmul(
) + myvars['b_o']

Theoretically you can put everything into one liner to make the statement that “it is still a one liner”, but I highly suggest we at least break them down into components that are more readable. In this case, breaking them down into layers is probably a good idea.

Actually, this block of code will be reused quite a few times other than defining the logits, you will use the model when building accuracy, testing against validation, test, ..etc. So let’s build them into a function:

def model(x, myvars):
    layer_hidden = tf.add(tf.matmul(x, myvars['w_h']), myvars['b_h'])
    logits = tf.matmul(tf.nn.relu(layer_hidden), myvars['w_o']) + myvars['b_o']
    return logits

The definition of loss function and even optimizer is totally independent of how the neural internally looks like internally:

pred = multilayer_perceptron(tf_train_dataset, myvars)
loss = tf.reduce_mean(
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

After all, we need to define the accuracy against training, validation and testing. Since we have packaged our model into one function, our implementation for a network even with one hidden layer is even cleaned than the homework itself. This is how it looks like right now:

train_prediction = tf.nn.softmax(pred)
valid_prediction = tf.nn.softmax(model(valid_dataset, myvars))
test_prediction  = tf.nn.softmax(model(test_dataset,myvars))

Now we have our new model. And the way we structured our code now, you only need to modify model function and add more variables to myvars even want to add more layers. Everything else should stay exactly the same. When we run our code, the accuracy got improved quite a bit, by about 10% (from 80% to 90%). And the performance difference between GPU and CPU is really substantial. Attached are two screenshots to demonstrate the performance difference between GPU (4 secs) and CPU (37 seconds).

This slideshow requires JavaScript.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s