Determining position of anchor boxes in original image using downsampled feature map - deep-learning

From what I have read, I understand that methods used in faster-RCNN and SSD involve generating a set of anchor boxes. We first downsample the training image using a CNN and for every pixel in the downsampled feature map (which will form the center for our anchor boxes) we project it back onto the training image. We then draw the anchor boxes centered around that pixel using our pre-determined scales and ratios. What I dont understand is why dont we directly assume the centers of our anchor boxes on the training image with a suitable stride and use the CNN to only output the classification and regression values. What are we gaining by using the CNN to determine the centers of our anchor boxes which are ultimately going to be distributed evenly on the training image ?
To state more clearly -
Where will the centers of our anchor boxes be on the training image before our first prediction of the offset values and how do we decide those?

I think the confusion comes from this:
What are we gaining by using the CNN to determine the centers of our anchor boxes which are ultimately going to be distributed evenly on the training image
The network usually doesn't predict centers but corrections to a prior belief. The initial anchor centers are distributed evenly across the image, and as such don't fit the objects in the scene tightly enough. Those anchors just constitute a prior in the probabilistic sense. What your network will exactly output is implementation dependent, but will likely just be updates, i.e. corrections to those initial priors. This means that the centers that are predicted by your network are some delta_x, delta_y that adjust the bounding boxes.
Regarding this part:
why dont we directly assume the centers of our anchor boxes on the training image with a suitable stride and use the CNN to only output the classification and regression values
The regression values should still contain sufficient information to determine a bounding box in a unique way. Predicting width, height and center offsets (corrections) is a straightforward way to do it, but it's certainly not the only way. For example, you could modify the network to predict for each pixel, the distance vector to its nearest object center, or you could use parametric curves. However, crude, fixed anchor centers are not a good idea since they will also cause problems in classification, as you use them to pool features that are representative of the object.

Related

How to do object detection on high resolution images?

I have images of around 2000 X 2000 pixels. The objects that I am trying to identify are of smaller sizes (typically around 100 X 100 pixels), but there are lot of them.
I don't want to resize the input images, apply object detection and rescale the output back to the original size. The reason for this is I have very few images to work with and I would prefer cropping (which would lead to multiple training instances per image) over resizing to smaller size (this would give me 1 input image per original image).
Is there a sophisticated way or cropping and reassembling images for object detection, especially at the time of inference on test images?
For training, I suppose I would just take out the random crops, and use those for training. But for testing, I want to know if there is a specific way of cropping the test image, applying object detection and combining the results back to get the output for the original large image.
I guess using several (I've never tried) networks simultaneously is a choice, for you, using 4*4 (500+50 * 500+50) with respect to each 1*1), then reassembling at the output stage, (probably with NMS at the border since you mentioned the target is dense).
But it's weird.
You know one insight in detection with high resolution images is altering the backbone with "U" shape shortcut, which solves some problems without resize the images. Refer U-Net.

Initializing convolutional layers in CNN

Is there a function to initialize weights of a convolutional layer to focus more on information closer to the center of input images?
All my input images are centered, so pixels further away from the center of an image matter less than pixels closer to the center.
Please see the GIFs here for a demonstration of convolutions:
https://github.com/vdumoulin/conv_arithmetic#convolution-animations
As you can see, convolutions operate the same regardless of the position in the image, so weight initialization cannot change the focus of the image.
It is also not advisable to rush into thinking about what the net will and won't need to learn your task. There are sometimes surprising amounts of signal outside what you as a human might focus on. I would suggest training the net and seeing how it performs, and then (as others have suggested) thinking about cropping.
Is there a function to initialize weights of a convolutional layer to focus more on information closer to the center of input images?
This is not possible because, initialization is there just to trigger the process of learning.
Model however, is the one that can have functions, achieving the the attention.
You don't need to initialize conv. layers also because in PyTorch this is already done automatically.

how to format the image data for training/prediction when images are different in size?

I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?
You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.
Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .

Fully convolutional autoencoder for variable-sized images in keras

I want to build a convolutional autoencoder where the size of the input in not constant. I'm doing that by stacking up conv-pool layers until I reach an encoding layer, and then doing the reverse with upsample-conv layers. the problem is that no matter what settings I use, I can't get the exact same size in the output layer as the input layer. The reason for that is that the UpSampling layer (given say (2,2) size), doubles the size of the input, so I can't get odd dimensions for instance. Is there a way to tie the output dimension of a given layer to the input dimension of a previous layer for individual samples (as I said, the input size for the max-pool layer in variable)?
Yes, there is.
You can use three methods
Padding
Resizing
Crop or Pad
Padding will only work to increase the dimensions. Not beneficial for reducing the size.
Resizing should be more costly but optimum solution for each case (up or downsampling). It will keep all the values in the range and will simply resample them to resize in a given dimension.
Crop or Pad will work as resize and it will be more compute-efficient as there is no interpolation in this method. However, if you want to resize it to a smaller dimension, it will crop from the edges.
By using those 3, you can arrange your layer's dimensions.

Isometric depth sorting issue with big objects

I'm currently building an as3 isometric game, but I'm having a lot of problem with depth sorting. I've searched for a solution, but didn't found anything that match my problem (rectangle objects).
Here is a screenshot of my game:
As you can see, depth sorting works well when it's between 1x1 tiles objects. I simply use their x and y coordinates (relative to the isometric map) to sort them.
The problem comes when I have bigger objects, like 2x2 or 1x4 or 4x1.
Any idea how should I handle depth sorting then?
I don't think it is possible to sort a scene based on a single x,y value for each object if some of them can be long enough that one end should be at a different depth than the other. For instance, consider how you'd handle the rendering if the brown chair in your picture was moved one square down-left (to the square between the blue chair and the long couch). It would be deeper in the scene than the red table behind the couch, but would need to be rendered on top of the couch, which would need to be on top of the table.
I think there are two simple solutions:
Design your
levels using only one sort of overlap for large objects. For
instance, you could specify that an object's depth is based on its
nearest corner, which would require you to avoid putting things in
front of its most distant bits (since it will render on top of them).
Or you could stick with your current code (which seems to use the
most distant corner for depth) and avoid putting anything behind the
nearer parts. You may still have trouble with characters and other
objects that move around though. You might be able to make the
troublesome tiles inaccessible if you're careful with your design,
but in some cases this may be too restrictive.
Break up your large objects into smaller ones
which would have their own depths. You will probably want to go right
down to 1x1 pieces, each of which will have an unambiguous depth. You
might choose keep the larger objects in the code as invisible
containers for the smaller pieces, or they could be eliminated
entirely, whichever makes it easier for you to load up and enable
interaction with the various bits.
Splitting larger objects in to 1x1 sized pieces can also be nice since you can make them modular. That is, you can build differently sized objects by putting together 1x1 pieces in different combinations. If you cut your 2x1 tables in your image in half vertically, for instance, and created a 1x1 middle tile that fit in between them, you could stretch the design out to 3x1 or 10x1, depending on how many times you repeat the middle tile. There's a lot of other ways to make tiled graphics look good with only a modest amount of art required.
Ultima Online emulators (specifically, POL, though there may be others) achieve this through the implementation and usage of the concept of a 'multi' -- a single object comprised of sections of cut-up larger graphics. These cut-up graphics are such that their sprites are vertically-split at the left- and right-corner points of iso grid boundaries.
Other considerations:
- render 'multi' pieces sorted screen-Y axis from top-to-bottom.
- the southern (i.e. screen bottom-left) component of a 'multi' becomes the anchoring tile position (in the case of your couch, its left-most piece).
- consider that each map location can also hold its own vertical stack of objects; offsetting each object's render by screen-Y simulates height/altitude, and these must be sorted bottom-to-top (e.g. lowest-altitude to highest altitude).
Good luck!