Just a question, beforehand. Who has done some work with neural networks before? Oh, wow. OK. Quite a few people. So feel free to help me and I hope this will not be too basic for you and I hope it will at least be a good introduction to TensorFlow. But if you have never done anything with neural networks, that's fine and I will explain everything from the start. So this is the simplest possible neural network we can imagine to recognize our hand-written digits. So the digits, they come as 28 by 28 pixel images and the first thing we do is that we flatten all those pixels into one big vector of pixels and these will be our inputs. Now, we will use exactly 10 neurons. The neurons are the white circles. What a neuron does is always the same thing. A neuron does a weighted sum of all of its inputs, here the pixels. It adds another constant that is called a bias. That's just an additional degree of freedom. And then it will feed this sum through an activation function. And that is just a function– number in, transform, number out.
We will see several of those activation functions and the one thing they have in common in neural networks is that they are non-linear. So why 10 neurons? Well, simply because we are classifying those digits in 10 categories. We are trying to recognize a zero, a one, a two, on to the nine. So what we are hoping for here is that one of those neurons will light up and tell us, with a very strong output, that I have recognized here an eight. All right. And for that, since this is a classification problem, we are going to use a very specific activation function, one that, well, researchers tell us works really well on classification problems. It's called softmax and it's simply an exponential normalized. So what you do is that you make all those weighted sums, then you elevate that to the exponential. And once you have your 10 exponentials, you compute the norm of this vector and divide it by its norm so that you get values between zero and one. And those values, you will be able to interpret them as probabilities, probabilities of this being an eight, a one, or something else.
You will be asking which norm? Any norm, doesn't matter– the length of the vector. You pick your favorite norm. There are several. Usually, for softmax, we use L1, but L2 which is the Euclidean normal would work just as well. So what does softmax do actually? You see, it's an exponential so it's a very steeply increasing function. It will pull the data apart, increase the differences, and when you divide all of that, when you normalize the whole vector, you usually end up with one of the values being very close to one and all the other values being very close to zero. So it's a way of pulling the winner out on top without actually destroying the information. All right. So now we need to formalize this using a matrix multiply. I will remind you of what a matrix multiply is, but we will do it not one image, we are going to do this for a batch of 100 images at a time. So what we have here in my matrix is 100 images, one image per line. The images are flattened, all the pixels on one line.
So I take my matrix of weights, for the time being, I don't know what these weights are, it's just weights so I'm doing weighted sums. And I start the matrix multiplication. So I do a weighted sum of all the pixels of the first image. Here it is. And then if I continue this matrix multiply using the second column of weights, I get a weighted sum of all the pixels of the first image for the second neuron and then for the third neuron and the fourth and so on. What is left is to add the bias's, just an additional constant. Again, we don't know what it is for the time being. And there is one bias per neuron, that's why we have 10 biases. And now if I continue this matrix multiply, I'm going to obtain these weighted sums for the second image, and the third image, and so on, until I have processed all my images. I would like to write this as a simple formula there. You see there is a problem, x times w, you know that's a matrix of 10 columns by 100 images, and I have only 10 biases.
I can't simply add them together. Well, never mind. We will redefine addition and it's OK if everybody accepts it. And actually, people have already accepted it. It's called a broadcasting add and that's the way you do additions in NumPy, for instance, which is the numerical library for Python. The way a broadcasting add works is that if you're trying to add two things which don't match, not the same dimensions, you can't do the addition, you try to replicate the small one as much as needed to make the sizes match and then you do the addition. That's exactly what we need to do here. We have only those 10 biases. So it's the same biases on all the lines. We just need to replicate this bias vector on all the lines, and that's exactly what this generalized broadcasting add does. So we will just write it as a plus. And this is where I wanted to get to. I want you to remember this as the formula describing one layer in a neural network. So let's go through this again.
In x, we have a batch of images, 100 images, all the pixels on one line. In w, we have all of our weights for the 10 neurons, all the weights in the system. x times w are all about weighted sums. We add the biases, and then we feed this through our activation function, in this case softmax, the way it works is lined by line. Line by line, we take the 10 values, elevate them to the exponential, normalize the line. Next line, 10 values, elevate them to the exponential, normalize the line, and so on. So what we get in the output is, for each image, 10 values which look like probabilities and which are our predictions. So, of course, we still don't know what those weights and biases are and that's where the trick is in neural networks. We are going to train this neural network to actually figure out the correct weights and biases by itself. Well, this is how we write this in TensorFlow. You see, not very different. OK. TensorFlow has this in n library for neural network which has all sorts of very useful functions for neural networks, for example, softmax and so on.
So let's go train. When you train, you've got images, but you know what those images are. So your network, you initialize your weights and biases at random value and your network will output some probability. Since you know what this image is, you can tell it that it's not this, it should be that. So that is called a one-hot encoded vector. It's a not very fancy way of encoding numbers. Basically, here are our numbers from zero to nine. We encode them as 10 bits, all at zero and just one of them is a one at the index of the number we want to encode. Here are six. Why? Well, because then, it's in the same shape as our predictions and we can compute a distance between those two. So again, many ways of computing distances. The Euclidean distance, the normal distance, sum of differences squared would work, not a problem. But scientists tell us that for classification problems, this distance, the cross entropy, works slightly better. So we'll use this one. How does it work?
It's the sum across the vectors of the values on the top multiplied by the logarithms of the values on the bottom, and then we add in minus sign because all the values on the bottom are less than one, so all the logarithms are negative. So that's the distance. And of course, we will tell the system to minimize the distance between what it thinks is the truth and what we know to be true. So this we will call our error function and the training will be guided by an effort to minimize the error function. So let's see how this works in practice. So in this little visualization, I'm showing you over there, my training images. You see it's training so you see this batches of 100 training images being fed into the system. On the white background, you have the images that have been already correctly recognized by the system. On a red background, images that are still missed. So then, on the middle graph, you see our error function, computed both on the training dataset and we also kept aside a set of images which we have never seen during training for testing.
Of course, if you want to test the real world performance of your neural network, you have to do this on a set of images which you have never seen during training. So here we have 60,000 training images and I set aside 10,000 test images which you see in the bottom graph over there. They are a bit small. You see only 1,000 of them here. So imagine, there are nine more screens of pictures like that. But I sorted all the badly recognized one at the top. So you see all the ones that have been badly recognized and below are nine screens of correctly recognized images, here after 2,000 rounds of training. So there is a little scale on the side here. It shows you that it's already capable of recognizing 92% of our images with this very simple model, just 10 neuron's, nothing else. And that's what you get on the top graph, the accuracy graph, as well. That's simply the percentage of correctly recognized images, both on test and training data. So what else do we have? We have our weights and biases, those two diagrams are simply percentiles, so it shows you the spread of all the weights and biases.
And that's just useful to see that they are moving. They both started at zero and they took some values for the weights between one and minus one for biasses between two and minus two. It's helpful to keep an eye on those diagrams and see that we are not diverging completely. So that's the training algorithm. You give it training images, it gives you a prediction, you compute the distance between the prediction and what you know to be true. You use that distance as an error function to guide a mechanism that will drive the error down by modifying weights and biases. So now let's write this in TensorFlow. And I'll get more explicit about exactly how this training works. So we need to write this in TensorFlow. The first thing you do in TensorFlow is define variables and placeholders. A variable is a degree of freedom of our system, something we are asking TensorFlow to compute for us through training. So in our case, those are our weights and biases. And we will need to feed in training data.
So for this data that will be fed in at training time, we define a placeholder. You see here x is a placeholder for our training images. Let's look at the shape in brackets. What you have is the shape of this multidimensional matrix, which we call a tensor. So the first dimension is none. It says I don't know yet so this will be the number of images in a batch. This will be determined at training time. If we give 100 images, this will be 100. Then 28 by 28 is the size of our images and one is the number of values per pixel. So that's not useful at all because we are handling grayscale images. I just put it there. In case you wanted to handle color images, that would be three values per pixel. So OK. We have our placeholders, we have our variables, now we are ready to write our model. So that line you see on the top is our model. It's what we have determined to be the line representing one layer of a neural network. The only change is that reshape operation. You remember, our images, they come in as 28 by 28 pixel images and we want to flatten them as one big vector of pixels.
So that's what reshape does. 784 is 28 by 28. It's all the pixels in one line. All right. I need a second placeholder for the known answers, the labels of my training images, labels like this is a one, this is a zero, this is a seven, this is a five. And now that I have my predictions and my known labels, I'm ready to compute my error function, which is the cross entropy using the formula we've seen before. So the sum across the vector of the elements of the labels multiplied by elements of the logarithm of the predictions. So now I have my error function. What do I do with it? What you have on the bottom, I won't go into that. That is simply the computation of the percentage of correctly recognized images. You can skip that. OK. Now we get to the actual heart of what TensorFlow will do for you. So we have our error function. We pick an optimizer. There is a full library of them. They have different characteristics. And we ask the optimizer to minimize our error function.
So what is this going to do? When you do this, TensorFlow takes your error function and computes the partial derivatives of that error function relatively to all the weights and all the biases in the system. That's a big vector because there are lots of weights and lots of biases. How many? w, the weights, is a variable of almost 8,000 values. So this vector we get mathematically is called a gradient. And the gradient has one nice property. Who knows what is the nice property of the gradient? It points– Yeah. Almost. It points up, we had a minus sign, it points down, exactly. Down in which space? We are in the space of all the weight and all the variables and the function we are computing is our error function. So when we say down in this space, it means it gives us a direction in the space of weights and biases into which to go to modify our weights and biases in order to make our error function smaller. So that is the training. You compute this gradient and it gives you an arrow.
You take a little step along this arrow. Well, you are in the space of weights and biases, so taking a little step means you modify your weights and biases by this little delta, and you get into a location where the error is now smaller. Well, that's fantastic. That's exactly what you want. Then you repeat this using a second batch of training images. And again, using a third batch of training images, and so on. So it's called gradient descent because you follow the gradient to head down. And so we are ready to write our training loop. There is one more thing I need to explain to you about TensorFlow. TensorFlow has a deferred execution model. So everything we wrote up to now, all the tf dot something here commands, does not actually– when that is executed, it doesn't produce values. It builds a graph, a computation graph, in memory. Why is that important? Well, first of all, this derivation trick here, the computation of the gradient, that is actually a formal derivation.
TensorFlow takes the formula that you give it to define your error function and does a formal derivation on it. So it needs to know the full graph of how you computed this to do this formal derivation. And the second thing it will use this graph for is that TensorFlow is built for distributed computing. And there, as well, to distribute a graph on multiple machine, it helps to know what the graph is. OK. So this is all very useful, but it means for us that we have to go through an additional loop to actually get values from our computations. The way you do this in TensorFlow is that you define a session and then in the session, you call sess.run on one edge of your computation graph. That will give you actual values, but of course, for this to work, you have to fill in all the placeholders that you have defined now with real values. So for this to work, I will need to fill in the training images and the training labels for which I have defined placeholders. And the syntax is simply the train_data dictionary there.
You see the keys of the dictionary, x and y underscore, are the placeholders that I have defined. And then I can sess.run on my training step. I pass in this training data and that is where the actual magic happens. Just a reminder, what is this training step? Well it's what you got when you asked the optimizer to minimize your error function. So the training step, when executed, is actually what computes this gradient using the current batch of images, training images and labels, and follows it a little to modify the weights and biases and end up with better weights and biases. I said a little. I come back to this. What is that learning rate over there? Well, I can't make a big step along the gradient. Why not? Imagine you're in the mountains, you know where down is. We have senses for that. We don't know to derive anything. We know where down is. And you want to reach the bottom of the valley. Now, if every step you make is a 10 mile step, you will probably be jumping from one side of the valley to the other without ever reaching the bottom.
So if you want to reach the bottom, even if you know where a down is, you have to make small steps in that direction, and then you will reach the bottom. So the same here, when we compute this gradient, we multiplied by this very small value so as to take small steps and be sure that we not jumping from one side of the valley to the other. All right. So let's finish our training. Basically, in a loop, we load a batch of 100 training images and labels. We run this training step which adjusts our weights and biases. And we repeat. All the rest of the stuff on the bottom, it's just for display. I'm computing the accuracy and the cross entropy on my training data and again, on my test data so that I can show you four curves over there. It is just for display. It has nothing to do with the training itself. All right. So that was it. That's the entire code here on one slide. Let's go through this again. At the beginning, you define variables for everything that you want TensorFlow to compute for you.
So here are our weights and biasses. You define placeholders for everything that you will be feeding during the training, namely our images and our training labels. Then you define your model. Your model gives you predictions. You can compare those predictions with your known labels, compare the distance between the two, which is the cross entropy here, and use that as an error function. So you pick an optimizer and you ask the optimizer to minimize your error function. That gives all the gradients and all that, it gives you a training step. And now, in a loop, you load a batch of images. You're on your training step. You load a batch of images and labels, you run your training step, and you do this in a loop, and hoping this will converge, and usually it does. You see here, it did converge and with this approach, we got 92% accuracy. Small recap of all the ingredients we put in our pot so far. We have a softmax activation function. We have the cross entropy as an error function. And we did this mini batching thing where we train on 100 images at a time, do one step, and then load another batch of images.
So is 92% accuracy good? No, it's horrible. Imagine you're actually using this in production. I don't know, in the post office, your decoding zip codes. 92% out of 100 digits, you have eight bad values? No, not usable in production. Forget it. So how do we fix it? Well deep learning. We'll go deep. We can just stack those layers. How do we do that? Well, it's very simple. Look at the top layer of neurons. It does what we just did. It computes weighted sums of pixels. But we can just as easily add a second layer of neurons that will compute weighted sums all the outputs of the first layer. And that's how you stack layers to produce a deep neural network. Now we are going to change our activation function. We keep softmax for the output layer because softmax has these nice properties of pulling a winner apart and producing numbers between zero and one. But for the rest, we use a very classical activation function. In neural networks, it's called the sigmoid, and it's basically, the simplest possible continuous function that goes from zero to one.
OK. All right. Let's write this model. So we have now one set of weights and one set of biasses per layer. That's why we use C5 pairs here. And our model will actually look very familiar to you. Look at the first line. It's exactly what we have seen before for one layer of a neural network. Now what we do with the output, Y1, is that we use it as the input in the second line, and so on, we chain those. It's just that on the last line, the activation function we use is the softmax. So that's all the changes we did. And we can try to run this again. So oops. This one. Run. Run. Run. And it's coming. Well, I don't like this slope here. It shouldn't be shooting up really sharp. It's a bit slow. Actually, I have a solution for that. I lied to you when I said that the sigmoid was the most widely used activation function. That was true in the past, and today, people invented a new activation function, which is called the Relu, and this is a relu. It's even simpler.
It's just zero for all negative values and identity for all positive values. Now this actually works better. It has lots of advantages. Why does it work better? We don't know. People tried it, it worked better. [LAUGHTER] I'm being honest here. If you had a researcher here, he would fill your head with equations and prove it, but he would have done those equations after the fact. People already tried it, it worked better. Actually, they got inspiration from biology. It is said, I don't know if it is true, but I heard that the sigmoid was the preferred model of biologists for our actual biological neurons and that today, biologist thinks that neurons in our head work more like this. And the guys in computer science got inspiration from that, tried it, works better. How better? Well, this is just the beginning of the training. This is what we get with our sigmoids, just 300 iterations, so just the beginning. And this is what we get from relus. Well, I prefer this. The accuracy shoots up really sharp.
The cross entropy goes down really sharp. It's much faster. And actually, here on this very simple problem, the sigmoid would have recovered, it's not an issue, but in very deep networks, sometimes with the sigmoid, you don't converge at all. And the relu solves that problem to some extent. So the relu it is for most of our issues. OK. So now let's train. Let's do this for 10,000 iterations, five layers, look at that. 98% accuracy. First of all, oh, yeah. We went from 92 to 98 just by adding layers. That's fantastic. But look at those curves. They're all messy. What is all this noise? Well, when you see noise like that, it means that you are going too fast. You're actually jumping from one side of the valley to the other, without critically reaching the bottom of your error function. So we have a solution for that, but it's not just to go slower, because then you would spend 10 times more time training. The solution, actually, is to start fast and then slow down as you train.
It's called learning rate decay. We usually decay the learning rates on an exponential curve. So yes, I hear you. It sounds very simple, why this little trick, but let me play you the video of what this does. It's actually quite spectacular. So it's almost there. Should I have the end of it on a slide. Yeah, that's it. So this is what we had using a fixed learning rate and just by switching to a decaying learning rate, look, it's spectacular. All the noise is gone. And for the first– just with this little trick– really, this is not rocket science, it's just going slightly slower towards the end and all the noise is gone. And look at the blue curve, the training accuracy curve. Towards the end, it's stuck at 100%. So here, for the first time, we built a neural network that was capable of learning all of our training set perfectly. It doesn't make one single mistake in the entire training set which doesn't mean that it's perfect in the real world.
As you see on the test dataset, it has a 98% accuracy. But, well, it's something. We got 100% at least on the training. All right. So we still have something that is a bit bizarre. Look at those two curves. This is our error function. So the blue curve, the test error function, that is what we minimize. OK? So as expected, it goes down. And the error function computed on our test data at the beginning, well, it follows. That's quite nice. And then it disconnects. So this is not completely unexpected, you know. We are minimizing the training at our function. That's what we are actively minimizing. We are not doing anything at all on the test side. It's just a byproduct of the way neural networks work that the training you do on your training data, actually carries over to your test data to the real world. Well, it carries over or it doesn't. So as you see here, until some point, it does and then, there is a disconnect, it doesn't carry over anymore. You keep optimizing the error on the training data, but it has no positive effect on the test performance, the real work performance, anymore.
So if you see curves like this, you take the textbook, you look it up, it's called overfitting. You look at the solutions, they tell you overfitting, you need regularization. OK. Let's regularize. What regularization options do we have? My preferred one is called dropout. It's quite dramatic. You shoot the neurons. No, really. So this is how it works. You take your neural network, and pick a probability, let's say 50%. So at each training iteration, you will shoot, physically remove from the network, 50% percent of your neurons. Do the pass, then put them back, next iteration, again, randomly shoot 50% of your neurons. Of course, when you test , you don't test with a half brain dead neural network, you put all the neurons back. But that's what you do for training. So in TensorFlow, there is a very simple function to do that, which is called dropout, That you apply at the outputs of the layer. And what it simply does is it takes the probability and in the output of that layer, it will replace randomly some values by zeros and small technicality, it will actually boost the remaining values proportionally so that the average stays constant, that's a technicality.
So why does shooting neurons help? Well, first of all, let's see if it helps. So let's try to recap all the tricks we tried to play with our neural network. This is what we had initially with our five layers using the sigmoid as an activation function. The accuracy got up to 97.9% using five layers. So first, we replaced the sigmoid by the relu activation function. You see, it's faster to converge at the beginning and we actually gained a couple of fractions of percentage of accuracy. But we have these messy curves. So we train slower using the exponential learning rate decay and we get rid of the noise, and now we are stable or above 98% accuracy. But we have that weird disconnect between the error on our test data and the error on our training data. So let us try to add dropout. This is what you get with dropout. And actually, the cross entropy function, the test cross entropy function, the red one over there on the right, has been largely brought under control. You see, there is still some disconnect, but it's not shooting up as it was before.
That's very positive. Let's look at the accuracy. No improvement. Actually, I'm even amazed that it hasn't gone down seeing how brutal this technique is, you shoot neurons while you train. But here, I was very hopeful to get it up. No, nothing. We have to keep digging. So what is really overfitting? Let's go beyond the simple recipe in the textbook. Overfitting, in a neural network, is primarily when you give it too many degrees of freedom. Imagine you have so many neurons and so many weights in a neural network that it's somehow feasible to simply store all the training images in those weights and variables. You have enough room for that. And the neural network could figure out some cheap trick to pattern match the training images in what it has stored and just perfectly recognize your training images because it has stored copies of all of them. Well, if it has enough space to do that, that would not translate to any kind of recognition performance in the real world.
And that's the trick about neural networks. You have to constrain their degrees of freedom to force them to generalize. And mostly, when you get overfitting is because you have too many neurons. You need to get that number down to force the network to produce generalizations that will then produce good predictions, even in the real world. So either you get the number of neurons down or you apply some trick, like dropout, that is supposed to mitigate the consequences of too many degrees of freedom. The opposite of too many neurons if you have a very small dataset, well, even if you have only a small number of neurons, if the dataset, the training dataset is very small, it can still fit it all in. So that's a general truth in neural networks. You need big datasets for training. And then what happened here? We have a big data set, 60,000 digits, that's enough. We know that we don't have too many neurons because we added five layers, that's a bit overkill, but I tried, I promise, with four and three and two.
And we tried dropout which is supposed to mitigate the fact that you have too many neurons. And it didn't do anything to the accuracy. So the conclusion here that we come to is that our network, the way it is built, is inadequate. It's not capable by its architecture to extract the necessary information from our data. And maybe someone here can pinpoint something really stupid we did at the beginning. Someone has an idea? Remember, we have images? Images with shapes like curves and lines. And we flattened all the pixels in one big vector. So all that shape information is lost. This is terrible. That's why we are performing so badly. We lost all of the shape information. So what is the solution? Well, people have invented a different type of neural networks to handle specifically images and problems where shape is important. It's called convolutional networks. Here we go back to the general case of an image, of a color image. So that's why it has red, green, and blue components.
And in a convolutional network, one neuron will still be doing weighted sums of pixels, but only a small patch of pixels above its head, only a small patch. And the next neuron would, again, be doing weighted sum of the small patch of pixels above itself, but using the same weights. OK? That's the fundamental difference from what we have seen before. The second neuron is using the same weights as the first neuron. So we are actually taking just one set of weights and we are scanning the image in both directions, using that set of weights and producing weighted sums. So we scan it in both directions and we obtain one layer of weighted sums. So how many weights do we have? Well, as many weights as we have input values in that little highlighted cube, that's 4 times 4 times 3, which is around 48. What? 48? We had 8,000 degrees of freedom in our simplest network with just 10 neurons. How can it work with such a drastic reduction in the number of weights? Well, it won't work. We need more degrees of freedom.
How do we do that? Well, we pick a second set of weights and do this again. And we obtain the second– let's call it a channel of values using different weights. Now since those are multi-dimensional matrices, it's fairly easy to write those two matrices as one by simply adding a dimension of dimension two because we have two sets of values. And this here will be the shape of the weight made matrix for one convolutional layer in a neural network. Now, we still have one problem left which is that we need to bring the amount of information down. At the end, we still want only 10 outputs with our 10 probabilities to recognize what this number is. So traditionally, this was achieved by what we call a subsampling layer. I think it's quite useful to understand how this works because it gives you a good feeling for what this network is doing. So basically, we were scanning the image using a set of weights and during training, these weights will actually specialize in some kind of shape recognizer.
There will be some weights that will become very sensitive to horizontal lines and some weights that will become very sensitive to vertical lines, and so on. So basically, when you scan the image, if you simplify, you get an output which is mostly I've seen nothing, I've seen nothing, I've seen nothing, oh, I've seen something, I've seen nothing, I've seen nothing, oh, I've seen something. The subsampling basically takes four of those outputs, two by two, and it takes the maximum value. So it retains the biggest signal of I've seen something and passes that down to the layer below. But actually, there's a much simpler way of condensing information. What if we simply play with the stride of the convolution? Instead of scanning the image pixel by pixel, we scan it every two pixels, we jumped by two pixels between each weighted sum. Well, mechanically, instead of obtaining 28 by 28 output values, we obtain only 14 by 14 output values. So we have condensed our information.
And mostly today, I'm not saying this is better, but it's just simpler. And mostly today, people who build convolutional networks just use convolutional layers and play with the step to condense the information and it's simpler. You don't need, in this way, to have these subsampling layers. So this is the network that I would like to build with you. Let's go through it. There is a first convolutional layer that uses patches of five by five. I'm reading through the W1 tensor. And we have seen that in this shape, the two first digits is the size of the patch you pass. The third digits is the number of channels it's reading from the input. So here I'm back to my real example. This is a grayscale image. It has one value per pixel. So I'm reading one channel of information. And I will be applying four of those patches to my image. So I obtain four channels of output values. OK? Now second convolutional later, this time, my stride is two. So here, my outputs become plains of 14 by 14 values.
So let's go through it. I'm applying patches of four by four. I'm reading in four channels of values because that's why I output in the first layer. And this time, I will be using eight different batches, so I will actually produce eight different channels of weighted sums. Nextly, again, a stride of two. That's why I'm getting down from 14 by 14 to seven by seven. Batch is of four by four, reading in eight channels of values because that's what I had in the previous layer, and outputting 12 channels of values this time because I used 12 different batches. And now I apply a fully connected layer. So the kind of layer we've seen before. OK? This fully connected layer, I remember the differences in this one, each neuron does a weighted sum of all the values in the little cube of values above, not just a batch, all the values. In the next neuron in the fully connected network does, again, a weighted sum of all the values using its own weights. It's not sharing weights.
That's the normal neural network layer as we have seen before. And finally, I apply my softmax layer with my 10 outputs. All right. So can we write this in TensorFlow? Well, we need one set of weights and biases for each layer. The only difference is that for the convolutional layers, our weights will have this specific shape that we have seen before. So choose numbers for the filter size, one number for the number of input channels, and one number for the number of batches which corresponds to the number of output channels that you produce. For our normal layers, we have the weights and bias as defined as before. And so you see this truncated normal thingy up there? That's just random. OK? Its a complicated way of saying random. So we initialize those weights to random values, initially. And now this is what our model will look like. So TensorFlow has these helpful conv2d function. If you give it the weights' matrix and a batch of images, it will scan them in both directions.
Its just a double loop to scan the image in both directions and produce the weighted sums. So we do those weighted sums. We had a bias . We feed this through an activation function, in this case, the relu, and that's our outputs. And again, the way of stacking these layers is to feed why one, the first output, has the input of the next layer. All right. After our three convolutional layers, we need to do a weighted sum this time of all the values in this seven by seven by 12 little cube. So to achieve that, we will flatten this cube as one big vector of values. That's what the Reshape here does. And then, two additional lines that you should recognize, those are normal neural network layers as we have seen before. All right. How does this work? So this time, it takes a little bit more time to process so I have a video. You see the accuracy's shooting up really fast. I will have to zoom. And the promise to 99% accuracy is actually not too far. We're getting there. We're getting there.
Are we getting there? We're not getting there. Oh, damn. I'm so disappointed again. I really wanted to bring this to 99% accuracy. We'll have to do something more, 98.9. Dammit, that was so close. All right. Yes. Exactly. This should be your WTF moment. What is that? On the cross entropy loss curve. OK, let me zoom on it. You see that? That disconnect? Do we have a solution for this? Dropout. Yes. Let's go shooting our neurons. It didn't work last time, maybe this time it will. So actually, what we will do here, it's a little trick. It's almost a methodology for coming up with the ideal neural network for a given situation. And what I like doing is to restrict the degrees of freedom until it's apparent that it's not optimal. It's hurting the performance. Here, I know that I can get about 99%. So I erased a little bit too much. And from that point, I give it a little bit more freedom and apply dropout to make sure that this additional freedom will not result in overfitting.
And that's basically how you obtain a pretty optimal neural network for a given problem. So that's what I have done here. You see, the batches are a slightly bigger, six, six, five, five, four, four, instead of five, five, four, four, and so on. And I've used a lot more batches. So six patches in the first layer, 12 in the second layer, and 24 in the third layer, instead of four, eight, and 12. And, I applied dropout in the fully connected layer. So why not in the other layers? I tried both, it's possible to apply dropout in convolutional layers. But actually, if you count the number of neurons, there is a lot more neurons in the fully connected layer. So it's a lot more efficient to be shooting them there. I mean, it hurts a little bit too much to shoot neurons where you have only a few of them. So with this, let's run this again. So again, the accuracy shoots up very fast. I will have to zoom in. Look where the 99% is and we are above! Yes! [APPLAUSE] Thank you.
I promised you will get above 99 and we are actually quite comfortably above. We get to 99.3%. In this time, let's see what our dropout actually did. So this is what we had with a five layer network and already a little more degrees of freedom. So more patches in each layer. You see, we are already above 99%. But we have this big disconnect between the test and the training cross entropy. Letters apply dropout, boom. The test cross entropy function is brought in under control. It's not shooting up as much. And look, this time, we actually had a problem and this fixed it. With just applying dropout, we got 2/10 of a percent more accuracy. And here, we are fighting for the last percent, between 99 and 100. So getting 2/10 is enormous with just a little trick. All right. So there we have it. We built this network and brought it all the way to 99% accuracy. The Cliff's Notes is just a summary. And to finish, so this was mostly about TensorFlow. We also have a couple of pre-trained APIs, which you can use just as APIs if your problem is standard enough to fit into one of those Cloud Vision, Cloud Speech, Natural Language, or Translate APIs.
And if you want to run your TensorFlow jobs in the cloud, we also have this Cloud ML Engine service that allows you to execute your TensorFlow jobs in the cloud for training. And what is even more important, with just the click of a button, you can take a train model and push it to production behind an API and start serving predictions from the model in the cloud. So I think that's a little technical detail, but from an engineering perspective, it's quite significant that you have a very easy way of pushing something to prod. Thank you. You have the code on GitHub and this slide deck is freely available at that URL. And with that, we have five minutes for questions, if you have any. [APPLAUSE] Thank you. [MUSIC PLAYING]
With TensorFlow, deep machine learning transitions from an area of research to mainstream software engineering. In this video, Martin Gorner demonstrates how to construct and train a neural network that recognises handwritten digits. Along the way, he’ll describe some “tricks of the trade” used in neural network design, and finally, he’ll bring the recognition accuracy of his model above 99%.
Content applies to software developers of all levels. Experienced machine learning enthusiasts, this video will introduce you to TensorFlow through well known models such as dense and convolutional networks. This is an intense technical video designed to help beginners in machine learning ramp up quickly.
Missed the conference? Watch all the talks here: https://goo.gl/c1Vs3h
Watch more talks about Big Data & Machine Learning here: https://goo.gl/OcqI9k