Can too many background images decrease YOLOv5 model performance? - deep-learning

I have a dataset with many background images (those without labels), at least 50% of all images in the dataset. Now I read in the YOLOv5 tutorials that it is recommended that about 10% of the whole dataset are such background images. But in my dataset it would be quite difficult to identify all those background images.
Thus, if a dataset includes that many background images, would that just extend training time, or would it also have a negative impact on the overall model training performance?

It will have negative impact on the overall model training performance. The recommended way is correct. You don't need to identify or label these backgrounds. Just add them as negative images to your dataset.Simply put (background)images with no label or empty label in your dataset. It will decrease the false positives.

Related

Object Classification - Augmented dataset with or without originals?

I am training on yolov5 and I had a small dataset. I decided to increase it by augmenting it with rotation, shearing, etc to increase the size and increase accuracy.
Now I have seen augmented datasets labeled as with and without original images.
I was wondering if there is difference between training with and without original images besides there just being more images?

CNN - Proper way to adjust existing network (e.g. UNet) to fit the input/output size

I've implemented the original UNet-Architecture from its paper. It works with 572x572 images and predicts 388x388 images (original paper is without padding). I have used this network for another task which has 2048x1024 images as input to create the same sized (2048x1024) target. This fails, because the image size doesnt agree with the network architecture. So I saw on github a code which sets padding = 1 for all convolutions and everything works. Fine.
My question: Is that a common thing? "Randomly (maybe experimentally is better)" tweakin padding or stride parameters until the dimensions fit? But then it isn't the original UNet anymore, right?
I am glad for any advices, because I want to learn a good way for using existing networks in different challenges.
Best

How to do object detection on high resolution images?

I have images of around 2000 X 2000 pixels. The objects that I am trying to identify are of smaller sizes (typically around 100 X 100 pixels), but there are lot of them.
I don't want to resize the input images, apply object detection and rescale the output back to the original size. The reason for this is I have very few images to work with and I would prefer cropping (which would lead to multiple training instances per image) over resizing to smaller size (this would give me 1 input image per original image).
Is there a sophisticated way or cropping and reassembling images for object detection, especially at the time of inference on test images?
For training, I suppose I would just take out the random crops, and use those for training. But for testing, I want to know if there is a specific way of cropping the test image, applying object detection and combining the results back to get the output for the original large image.
I guess using several (I've never tried) networks simultaneously is a choice, for you, using 4*4 (500+50 * 500+50) with respect to each 1*1), then reassembling at the output stage, (probably with NMS at the border since you mentioned the target is dense).
But it's weird.
You know one insight in detection with high resolution images is altering the backbone with "U" shape shortcut, which solves some problems without resize the images. Refer U-Net.

How to use two different sized images as input into a deep network?

I am trying to train a deep neural network which uses information from two separate images in order to get a final image output similar to this. The difference is that my two input images don't have any spatial relation as they are completely different images with different amounts of information. How can I use a two-stream CNN or any other architecture using these kinds of input?
For reference: One image has size (5184x3456) and other has size (640x240).
First of all: It doesn't matter that you have two image. You have exactly the same problem when you have one image as input that single image can have different sizes.
There are multiple strategies to solve this problem:
Cropping and scaling: Just force the input in the size you need it. The cropping is done to make sure the aspect ratio is correct. Sometimes the same image but different parts of it are then fed into the network and the results are combined (e.g. averaged).
Convolutions + Global pooling: Convolutional layers don't care about the input size. At the point where you care about it, you can make global pooling. This means you have a pooling region which will always cover the complete input, no matter of the size.
Special layers: I don't remember the concept or name, but there are some layers which allow different sized input... maybe it was one of the attention-based approaches?
Combining two inputs
Look for "merge layer" or "concatenation layer" in the framework of your choice:
Keras
See also
Keras: Variable-size image to convolutional layer
Caffe: Allow images of different sizes as inputs

how to format the image data for training/prediction when images are different in size?

I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?
You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.
Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .