Using padding in cnn layer distorts edges of feature maps - deep-learning

I try to implementing cnn for image denoising. I use noisy image fragments (size 32x32) and same-sized clear image for validation as a training set. I used the trained network on a noisy image and noticed that denoised image contains artifacts, like a grid 32x32 pixels. I visualised feature maps produced by convolution layers and noticed that layers with zero-padding give distorted edges. I found this topic in which as a solution of same problem describes convolution, last step of which (eg for 3x3 kernel) is operation divide result by 9 or 4, when using zero-padding.
In all articles about convolution operations that I read, this is not mentioned. Does anyone know where I can read more about this?

Related

U-net how to understand the cropped output

I'm looking for U-net implementation for landmark detection task, where the architecture is intended to be similar to the figure above. For reference please see this: An Attention-Guided Deep Regression Model for Landmark Detection in Cephalograms
From the figure, we can see the input dimension is 572x572 but the output dimension is 388x388. My question is, how do we visualize and correctly understand the cropped output? From what I know, we ideally expect the output size is the same as input size (which is 572x572) so we can apply the mask to the original image to carry out segmentation. However, from some tutorial like (this one), the author recreate the model from scratch then use "same padding" to overcome my question, but I would prefer not to use same padding to achieve same output size.
I couldn't use same padding because I choose to use pretrained ResNet34 as my encoder backbone, from PyTorch pretrained ResNet34 implementation they didn't use same padding on the encoder part, which means the result is exactly similar as what you see in the figure above (intermediate feature maps are cropped before being copied). If I would to continue building the decoder this way, the output will have smaller size compared to input image.
The question being, if I want to use the output segmentation maps, should I pad its outside until its dimension match the input, or I just resize the map? I'm worrying the first one will lost information about the boundary of image and also the latter will dilate the landmarks predictions. Is there a best practice about this?
The reason I must use a pretrained network is because my dataset is small (only 100 images), so I want to make sure the encoder can generate good enough feature maps from the experiences gained from ImageNet.
After some thinking and testing of my program, I found that PyTorch's pretrained ResNet34 didn't loose the size of image because of convolution, instead its implementation is indeed using same padding. An illustration is
Input(3,512,512)-> Layer1(64,128,128) -> Layer2(128,64,64) -> Layer3(256,32,32)
-> Layer4(512,16,16)
so we can use deconvolution (or ConvTranspose2d in PyTorch) to bring the dimension back to 128, then dilate the result 4 times bigger to get the segmentation mask (or landmarks heatmaps).

How do CNNs process RGB images

In a Convolutional Neural Network, the process of convolving is abundant.
It is known that if you take a 5x5 greyscale image (1 channel) and convolve it with a 3x3 filter (containing certain weights) you get a 3x3 feature map as a result as demonstrated by this picture: Convolutions
But what happens once you extend this notion of convolving into RGB images whereby now you have 3 channel (R,G,B) to convolve over? Well you simply add a channel to your filter proportional to the # of channel in your original image right? Lets say we did, the process of convolving with an RGB would like the following: a 6x6x3 RGB image convolved with a 3x3x3 filter. This apparently results in a 4x4x1 rather than what one would expect 4x4x3.
My question is why is this so?
If you surf the internet for visualizations of feature maps, they return with some form of colorful low & high level features. Are those visualizations of the kernels themselves or the feature maps? Either way, they all have color which means they must have more than 1 channel no?
Look at pytorch's Conv2d you'll notice that the size of the kernel is affected not only by its spatial width and height (3x3 in your question), but also by the number of input channels and output channels.
So, if you have an input RGB image (= 3 input channels) and a filter of size 3x3x3 (=a single output channel, for 3 input channels and spatial width/height = 3), then your output would indeed be 4x4x1.
You can visualize this filter since you can interpret it as a tiny 3x3 RGB image.
Visualizing features/filters that are deeper in the network is not at all trivial, and the images you see are usually the result of optimization processes designed to "uncover" the filters. this page gives an overview of some intricate methods for feature visualization.
Well, color images are :3 channels by definition, as well you can see a color picture as a stack of 3 matrices of values , so 2 Red and blue can be set up to zero,, also you should check about the sparcity of a network...

what is class score map?

I was going through vggnet paper and i came across the testing phase of vggnet.
During the testing phase, test image goes through the vggnet and a class score map is obtained. This class score map is spatially averaged to produce a fixed size vector.
I have googled class score map, but then i couldn't find any relevant results. I wish to know what is the role of class score map.
Any hint would be greatly helpful. Thanks
When you train an image recognition model, you train it for a specific image size (and resolution), let's say n_dims = [256, 256]. Now, in the prediction phase, you have images of different sizes (with respect to pixels), e.g. [1024, 1024]. You extract patches (you can resize the image first by lowering the resolution) and hover over the image patches with your model, and for each patch, you obtain a prediction for all classes (in a patch, more than one of the objects might be present), which you have to average somehow for the whole image at the end.
See OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks.
Instead, we explore the entire image by densely running the network at each location and at multiple
scales. While the sliding window approach may be computationally prohibitive for certain types
of model, it is inherently efficient in the case of ConvNets (see section 3.5). This approach yields
significantly more views for voting, which increases robustness while remaining efficient. The result
of convolving a ConvNet on an image of arbitrary size is a spatial map of C-dimensional vectors at
each scale.

Slicing up large heterogenous images with binary annotations

I'm working on a deep learning project and have encountered a problem. The images that I'm using are very large and extremely detailed. They also contain a huge amount of necessary visual information, so it's hard to downgrade the resolution. I've gotten around this by slicing my images into 'tiles,' with resolution 512 x 512. There are several thousand tiles for each image.
Here's the problem—the annotations are binary and the images are heterogenous. Thus, an annotation can be applied to a tile of the image that has no impact on the actual classification. How can I lessen the impact of tiles that are 'improperly' labeled.
One thought is to cluster the tiles with something like a t-SNE plot and compare the ratio of the binary annotations for different regions (or 'classes'). I could then assign weights to images based on where it's located and then use that as an extra layer in my training. Very new to all of this, so wouldn't be surprised if that's an awful idea! Just thought I'd take a stab.
For background, I'm using transfer learning on Inception v3.

Size of image for prediction with SageMaker object detection?

I'm using the AWS SageMaker "built in" object detection algorithm (SSD) and we've trained it on a series of annotated 512x512 images (image_shape=512). We've deployed an endpoint and when using it for prediction we're getting mixed results.
If the image we use for prediciton is around that 512x512 size we're getting great accuracy and good results. If the image is significantly larger (e.g. 8000x10000) we get either wildly inaccurate, or no results. If I manually resize those large images to 512x512pixels the features we're looking for are no longer discernable to the eye. Which suggests that if my endpoint is resizing images, then that would explain why the model is struggling.
Note: Although the size in pexels is large, my images are basically line drawings on a white background. They have very little color and large patches of solid white, so they compress very well. I'm mot running into the 6Mb request size limit.
So, my questions are:
Does training the model at image_shape=512 mean my prediction images should also be that same size?
Is there a generally accepted method for doing object detection on very large images? I can envisage how I might chop the image into smaller tiles then feed each tile to my model, but if there's something "out of the box" that will do it for me, then that'd save some effort.
Your understanding is correct. The endpoint resizes images based on the parameter image_shape. To answer your questions:
As long as the scale of objects (i.e., expansion of pixels) in the resized images are similar between training and prediction data, the trained model should work.
Cropping is one option. Another method is to train separate models for large and small images as David suggested.