#14 Coding U-Net Architecture from Scratch
Now that we have a good foundation on Image Segmentation, we will look into another model that is used for such tasks. U-Net, a convolutional neural network was proposed in 2015 in a paper by Olaf Ronneberger, Philipp Fischer, and Thomas Brox. It excels at achieving precise segmentation even with limited training data, a common challenge in medical imaging.
The paper builds on the structure proposed by Ciresan, Gambardella, Giusti and Schmidhuber (2012). The new U-Net model by Olaf, Brox and Fischer outperformed the existing model in the segmentation of neuronal structures in EM stacks.
The 2012 model trained a network in a sliding-window setup to predict the class label of each pixel by providing a local region (path) around that pixel. The advantage is that this network can be localised. Also, the training data in terms of patches is much larger than the number of training examples. Two drawbacks: Slow as the network must be run separately for each patch. There are overlapping patches so a lot of redundancy. Secondly, there is a trade-off between localization accuracy and the use of context. Large patches need more max-pooling layers that reduce the localization accuracy. Smaller patches allow the network to see only little content.
U- shaped
U-Net's name is a true representative of its U-shaped architecture. There is symmetry in the contracting path (encoder) and the expanding path (decoder). he encoder progressively captures image features while reducing its resolution. This is achieved through repeated applications of convolutional layers with increasing numbers of filters and max-pooling operations. Imagine a 128x128 image fed into the encoder. After passing through convolutional layers and a max-pooling operation, the image size might be reduced to 64x64, capturing essential features while discarding less important details.
Encoder
On the left of U is the Encoder. The input image size is 128 x 128. It is passed through two convolutional layers with 64 filters each. The subsequent images when pooled will be 64 x 64. This is because of the MaxPooling layer, with stride 2 and 2 x 2 window size. This reduces the image size by half.
The encoder has downsampled our original image (128 x128) to a size of (8 x 8).
Connecting Paths: The Bottleneck
A bottleneck layer acts as a bridge between the encoder and decoder. There is no pooling layer so dimensions remain the same. It maintains the spatial resolution obtained by the encoder while extracting even more features from the data.
领英推荐
Upsampling and Recovering Resolution
The decoder path takes over from the bottleneck. It upsamples the image at each level to increase the feature map dimensions. However, it's not the only operation at each level. The decoder also has skip connections. These connections are from the corresponding encoder level. The upsampled features are concatenated with the corresponding high-resolution features taken from the encoder at the same level. This merge allows the decoder to recover precise spatial details while maintaining the learned features.
Segmentation Output
The final step involves applying a 1x1 convolution to the upsampled features. The number of filters in this convolution layer corresponds to the number of classes you want to segment. For instance, if you want to classify each pixel as belonging to one of 10 different tissue types, you would use 10 filters. This final step produces a segmentation map, assigning a class label (e.g., a specific tissue type) to every pixel in the original image.
Advantages and Applications
U-Net offers several advantages: