Deep learning--CNN: Edge detection
Computer vision is one of the areas advancing rapidly thanks to deep learning. Deep learning is now helping the self-driving cars to figure out where are the others cars and pedestrians around it so it avoids them. It's making face recognition much better than ever before, you could unlock a phone or unlock even a door using just your face. The convolutional neural network (CNN) is the deep learning technique that actually pushes forward computer vision performance.
The very first step for CNN processing an image is to learn low-level features of the image such as edge. In this article we talk about the edge detection. Let's say there is a 6x6 grayscale image, in order to detect edges, let's say vertical edge in the image, what we can do is to construct a 3x3 matrix which is called filter or kernel, and then convolve the 6x6 image with the 3x3 filter to obtain a 4x4 image as the output. The upper left element of this 4x4 matrix is computed by moving the 3x3 filter on top of the upper left 3x3 region and then taking the element-wise product and adding up all the resulting 9 numbers, to get the second element in the first row of the 4x4 matrix, you're gonna shift the filter one step to the right and add up all the element-wise products, and repeat shift and sum of products operation you can compute the first raw of 4x4 matrix, when it comes to the second row in the 4x4 matrix, you have to move the filter one step down and do the element-wise product and sum up the products.
So, you shift right and compute next elements in a row and shift down and start to compute the first element in the next row, and so on, finally you obtain the output 4x4 matrix.
To illustrate edge detection using convolution operation, we're gonna use a simplified image, let's say we have a 6x6 image where the left half is all 10 and right half is all 0, if you thought it's a picture then the left half 10s give you a brighter pixel-intensive values and right half 0s give you darker pixel-intensive values, as shown in the following slide, here uses the shade of gray to denote 0s, and there is clearly a very strong vertical edge right down the middle of the image as a transitions from white to gray. And here we have a 3x3 filter which is visualized as a picture where lighter pixels are on the left, middle zeros in the middle and darker on the right. Then you convolve the image with the filter and then get a 4x4 output matrix, now if you plot this output as an image where the lighter region is in the middle and that corresponds to having detected the vertical edge down the middle in the input image. And in case the dimensions here seemed a little bit wrong since the detected edge in the output seemed really thick, that is only because we're working with very small images in this example, and if you're using a say 1000x1000 image then you'll find that this does a pretty good job.
One intuition to take away from vertical edge detection is that a vertical edge is a 3x3 region since we're using a 3x3 filter and there're bright pixels on the left and you don't care that much in the middle and dark pixels on the right. And the middle of the input image is really where bright pixels on the left and dark pixels on the right that is why it thinks as a vertical edge right down the middle, and convolution gives you a convenient way to specify how to find these vertical edges in the image.
Now, if the input image is flipped where the darker on the left and brighter on the right, and the shade of transition is now actually reversed, the minus 30s in the output shows the vertical edge as a transition from dark to light rather than light to dark.
To find the horizontal edge, you're using 3x3 filter where bright on the top and dark at the bottom. Let's say you have a 6x6 input image where it looks like a checkboard pattern because of brighter on upper left corner and bottom right corner. When you convolute the input image with a 3x3 filter or horizontal detector, you'll end up a 4x4 output image, where the 30s denote a positive edge, i.e., bright pixels on top and dark pixels on the bottom, whereas minus 30s denote a negative edge.
The 3x3 edge detector we use is just one possible choice, historically in computer vision literature, there was a fair amount of debate about what is the best set of numbers to use. So, here's something else you could use, in Sobel filter it puts a little bit weight on the central pixel and this makes it may be more robust, and in Scharr filter, it uses other assess numbers which have other slightly different properties.
And with the rise of deep learning, one of the things we learned is that when you really wanna detect edges in some complicated image maybe you don't need to handpick these 9 assess numbers in the filter, maybe you can learn them and treat these 9 numbers as parameters which you can learn them using backpropagation. The goal is to learn these 9 parameters so that when you take the input image and convolve it with your 3x3 filter this gives you a good edge detector. The backpropagation can choose to learn some filters even better at capturing the statistics of your data than any of these hand-coded filters, and rather than vertical/horizontal edges maybe you can learn to detect edges at 45 degrees, or 70 degrees or 73 degrees or whatever orientation you choose.
Chen Yang