#14 Coding U-Net Architecture from Scratch

#14 Coding U-Net Architecture from Scratch

Now that we have a good foundation on Image Segmentation, we will look into another model that is used for such tasks. U-Net, a convolutional neural network was proposed in 2015 in a paper by Olaf Ronneberger, Philipp Fischer, and Thomas Brox. It excels at achieving precise segmentation even with limited training data, a common challenge in medical imaging.

The paper builds on the structure proposed by Ciresan, Gambardella, Giusti and Schmidhuber (2012). The new U-Net model by Olaf, Brox and Fischer outperformed the existing model in the segmentation of neuronal structures in EM stacks.

The 2012 model trained a network in a sliding-window setup to predict the class label of each pixel by providing a local region (path) around that pixel. The advantage is that this network can be localised. Also, the training data in terms of patches is much larger than the number of training examples. Two drawbacks: Slow as the network must be run separately for each patch. There are overlapping patches so a lot of redundancy. Secondly, there is a trade-off between localization accuracy and the use of context. Large patches need more max-pooling layers that reduce the localization accuracy. Smaller patches allow the network to see only little content.

U- shaped

U-Net's name is a true representative of its U-shaped architecture. There is symmetry in the contracting path (encoder) and the expanding path (decoder). he encoder progressively captures image features while reducing its resolution. This is achieved through repeated applications of convolutional layers with increasing numbers of filters and max-pooling operations. Imagine a 128x128 image fed into the encoder. After passing through convolutional layers and a max-pooling operation, the image size might be reduced to 64x64, capturing essential features while discarding less important details.

Encoder

On the left of U is the Encoder. The input image size is 128 x 128. It is passed through two convolutional layers with 64 filters each. The subsequent images when pooled will be 64 x 64. This is because of the MaxPooling layer, with stride 2 and 2 x 2 window size. This reduces the image size by half.

  • They are passed through two layers of convolution with 128 filters. Then Maxpooled with 32 x 32.
  • This is passed through two convolution layers of 256 each. Then, pooled to 16 x16.
  • This is passed through two convolution layers of 512 filters each.
  • At the fifth level, they are pooled to 8 x 8.

The encoder has downsampled our original image (128 x128) to a size of (8 x 8).

Connecting Paths: The Bottleneck

A bottleneck layer acts as a bridge between the encoder and decoder. There is no pooling layer so dimensions remain the same. It maintains the spatial resolution obtained by the encoder while extracting even more features from the data.

Upsampling and Recovering Resolution

The decoder path takes over from the bottleneck. It upsamples the image at each level to increase the feature map dimensions. However, it's not the only operation at each level. The decoder also has skip connections. These connections are from the corresponding encoder level. The upsampled features are concatenated with the corresponding high-resolution features taken from the encoder at the same level. This merge allows the decoder to recover precise spatial details while maintaining the learned features.

  1. Bottleneck: The 8 x 8 got upsampled to 16 x 16.
  2. You use the encoder layer at the same level. They have the same height and width of 16 x 16. They also have the same number of filters: 512. You concatenate the filters from the encoder with the filters of the decoder.
  3. You then pass this layer with 1024 filters into two convolutional layers. You upsample the blocks to 32 x 32. You concatenate the layers from the encoder on the same level.
  4. Upsample to 64 x 64. Moved up.
  5. Upsample to 128 x 128. Moved up.

Segmentation Output

The final step involves applying a 1x1 convolution to the upsampled features. The number of filters in this convolution layer corresponds to the number of classes you want to segment. For instance, if you want to classify each pixel as belonging to one of 10 different tissue types, you would use 10 filters. This final step produces a segmentation map, assigning a class label (e.g., a specific tissue type) to every pixel in the original image.

Advantages and Applications

U-Net offers several advantages:

  • Efficient Training: It requires fewer training samples compared to other approaches, making it ideal for situations with limited data. Data augmentation is essential to teach the network the desired invariance and robustness properties when only a few training samples are available.
  • Precise Segmentation: Skip connections allow the decoder to recover detailed spatial information. This leads to more accurate segmentation.
  • Well-Suited for Biomedical Imaging: U-Net addresses the challenge of limited training data frequently encountered in medical image analysis.


GitHub:

https://github.com/RiyaChhikara/100daysofComputerVision/blob/main/Day14_UNet.ipynb


Resources:

  1. Advanced Computer Vision with TensorFlow by DeepLearning.ai (Coursera)
  2. Wikipedia: U-Net
  3. Original Paper: 'U-Net: Convolutional Networks for Biomedical Image Segmentation'

要查看或添加评论,请登录

Riya Chhikara的更多文章

社区洞察

其他会员也浏览了