how to format the image data for training/prediction when images are different in size? - deep-learning

I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?

You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.

Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .

Related

U-net how to understand the cropped output

I'm looking for U-net implementation for landmark detection task, where the architecture is intended to be similar to the figure above. For reference please see this: An Attention-Guided Deep Regression Model for Landmark Detection in Cephalograms
From the figure, we can see the input dimension is 572x572 but the output dimension is 388x388. My question is, how do we visualize and correctly understand the cropped output? From what I know, we ideally expect the output size is the same as input size (which is 572x572) so we can apply the mask to the original image to carry out segmentation. However, from some tutorial like (this one), the author recreate the model from scratch then use "same padding" to overcome my question, but I would prefer not to use same padding to achieve same output size.
I couldn't use same padding because I choose to use pretrained ResNet34 as my encoder backbone, from PyTorch pretrained ResNet34 implementation they didn't use same padding on the encoder part, which means the result is exactly similar as what you see in the figure above (intermediate feature maps are cropped before being copied). If I would to continue building the decoder this way, the output will have smaller size compared to input image.
The question being, if I want to use the output segmentation maps, should I pad its outside until its dimension match the input, or I just resize the map? I'm worrying the first one will lost information about the boundary of image and also the latter will dilate the landmarks predictions. Is there a best practice about this?
The reason I must use a pretrained network is because my dataset is small (only 100 images), so I want to make sure the encoder can generate good enough feature maps from the experiences gained from ImageNet.
After some thinking and testing of my program, I found that PyTorch's pretrained ResNet34 didn't loose the size of image because of convolution, instead its implementation is indeed using same padding. An illustration is
Input(3,512,512)-> Layer1(64,128,128) -> Layer2(128,64,64) -> Layer3(256,32,32)
-> Layer4(512,16,16)
so we can use deconvolution (or ConvTranspose2d in PyTorch) to bring the dimension back to 128, then dilate the result 4 times bigger to get the segmentation mask (or landmarks heatmaps).

CNN - Proper way to adjust existing network (e.g. UNet) to fit the input/output size

I've implemented the original UNet-Architecture from its paper. It works with 572x572 images and predicts 388x388 images (original paper is without padding). I have used this network for another task which has 2048x1024 images as input to create the same sized (2048x1024) target. This fails, because the image size doesnt agree with the network architecture. So I saw on github a code which sets padding = 1 for all convolutions and everything works. Fine.
My question: Is that a common thing? "Randomly (maybe experimentally is better)" tweakin padding or stride parameters until the dimensions fit? But then it isn't the original UNet anymore, right?
I am glad for any advices, because I want to learn a good way for using existing networks in different challenges.
Best

How to concat two tensors of size [B,C,13,18] and [B,C,14,18] respectively in Pytorch?

I often met this problem when the height or width of an image or a tensor becomes odd.
For example, suppose the original tensor is of size [B,C,13,18]. After forwarding a strided-2 conv and several other conv layers, its size will become [B,C,7,9]. If we upsample the output by 2 and concat it with the original feature map as most cases, the error occurs.
I found that in many source codes, they use even sizes like (512,512) for training, so this kind of problem won't happen. But for test, I use the original image size to keep fine details and often met this problem.
What should I do? Do I need to change the network architecture?
Concatenating tensors with incompatible shapes does not make sense. Information is missing, and you need to specify it by yourself. The question is, what do you are expected from this concatenation ? Usually, you pad the input with zeros, or truncate the output, in order to get compatible shapes (in the general case, being even is not the required condition). If the height and width are large enough, the edge effect should be negligible (well, except perhaps on the edge, it depends).
So if you are dealing with convolutions only, no need to change the architecture strictly speaking, just to add a padding layer somewhere it seems appropriate.

How to use two different sized images as input into a deep network?

I am trying to train a deep neural network which uses information from two separate images in order to get a final image output similar to this. The difference is that my two input images don't have any spatial relation as they are completely different images with different amounts of information. How can I use a two-stream CNN or any other architecture using these kinds of input?
For reference: One image has size (5184x3456) and other has size (640x240).
First of all: It doesn't matter that you have two image. You have exactly the same problem when you have one image as input that single image can have different sizes.
There are multiple strategies to solve this problem:
Cropping and scaling: Just force the input in the size you need it. The cropping is done to make sure the aspect ratio is correct. Sometimes the same image but different parts of it are then fed into the network and the results are combined (e.g. averaged).
Convolutions + Global pooling: Convolutional layers don't care about the input size. At the point where you care about it, you can make global pooling. This means you have a pooling region which will always cover the complete input, no matter of the size.
Special layers: I don't remember the concept or name, but there are some layers which allow different sized input... maybe it was one of the attention-based approaches?
Combining two inputs
Look for "merge layer" or "concatenation layer" in the framework of your choice:
Keras
See also
Keras: Variable-size image to convolutional layer
Caffe: Allow images of different sizes as inputs

Image from concept of a convolutional neural network

Suppose I have a CNN which is trained for classifying images of different animals, the output of such model will be a point (output point) in a n spatial dimension, where n is the number of animal classes the model is trained on; then that output is transformed in a way to convert it into a one-hot vector of n parameters, giving then the correct label for the image from the point of view of the CNN, but let's stick with the n dimensional point, which is the concept of an input image.
Suppose then that I want to take that point and transform it in a way so that the final output is an image with constraint width and height (the dimensions should be the same with different input images) which outputs the same point as the input image's, how do I do that?
I'm basically asking for the methods used (training mostly) for this kind of task, where an image must be reconstructed based on the output point of the CNN -I know the image will never be identical, but I'm looking for images that generate the same (or at least not so different) output point as a input image when that point is inputted to the CNN-. Take in mind that the input of the model I'm asking for is n and the output is a two (or three if it's not in grayscale) dimensional tensor. I noticed that deepdream does exactly this kind of thing (I think), but every time I put "deepdream" and "generate" in Google, an online generator is almost always shown, not the actual techniques; so if there are some answers to this I'd love to hear about them.
The output label does not contain enough information to reconstruct an entire image.
Quoting from the DeepDream example ipython notebook:
Making the "dream" images is very simple. Essentially it is just a gradient ascent process that tries to maximize the L2 norm of activations of a particular DNN layer.
So the algorithm modifies an existing image such that outputs of certain nodes in the network (can be in an intermediate layer, not necessarily the output nodes) become large. In order to do that, it has to calculate the gradient of the node output with respect to the input pixels.