Object Classification - Augmented dataset with or without originals? - deep-learning

I am training on yolov5 and I had a small dataset. I decided to increase it by augmenting it with rotation, shearing, etc to increase the size and increase accuracy.
Now I have seen augmented datasets labeled as with and without original images.
I was wondering if there is difference between training with and without original images besides there just being more images?

Related

Predict on images that are different size than training data (different shape but same resolution)?

I am trying to train my model UNET which segments images. I used random crop on a large image for training. The problem i have is, my images have different size in training and testing. Which method i can use for prediction on large image?
I tried to predict a full image and predict patch by patch with each patch's size correspond image size on training data. But i still don't undertand why i don't have the same result between two methods.

Can too many background images decrease YOLOv5 model performance?

I have a dataset with many background images (those without labels), at least 50% of all images in the dataset. Now I read in the YOLOv5 tutorials that it is recommended that about 10% of the whole dataset are such background images. But in my dataset it would be quite difficult to identify all those background images.
Thus, if a dataset includes that many background images, would that just extend training time, or would it also have a negative impact on the overall model training performance?
It will have negative impact on the overall model training performance. The recommended way is correct. You don't need to identify or label these backgrounds. Just add them as negative images to your dataset.Simply put (background)images with no label or empty label in your dataset. It will decrease the false positives.

what does mean by rewritten box during yolo training on custom dataset?

I am training my custom dataset on Yolo network and during training, I am getting info of rewritten box (as shown in the figure).
for example: total_bbox = 29159, rewritten_bbox = 0.006859 %
what does that mean? Is my training proceeding right?
enter image description here
optimal numbers of layers and resolution depend on dataset.
The smaller objects - the higher resolution is required.
The large objects - the more layers are required. There is an article on choosing the optimal number of layers, filters and resolution for MS COCO dataset: https://arxiv.org/pdf/1911.09070.pdf
It depends on what accuracy and speed do you want. To reduce rewritten_bbox % just increase resolution and/or move some masks from [yolo] layers with low resolution, the [yolo] layers with higher resolution, and train. Also iou_thresh=1 may reduce rewritten_bbox %

Slicing up large heterogenous images with binary annotations

I'm working on a deep learning project and have encountered a problem. The images that I'm using are very large and extremely detailed. They also contain a huge amount of necessary visual information, so it's hard to downgrade the resolution. I've gotten around this by slicing my images into 'tiles,' with resolution 512 x 512. There are several thousand tiles for each image.
Here's the problem—the annotations are binary and the images are heterogenous. Thus, an annotation can be applied to a tile of the image that has no impact on the actual classification. How can I lessen the impact of tiles that are 'improperly' labeled.
One thought is to cluster the tiles with something like a t-SNE plot and compare the ratio of the binary annotations for different regions (or 'classes'). I could then assign weights to images based on where it's located and then use that as an extra layer in my training. Very new to all of this, so wouldn't be surprised if that's an awful idea! Just thought I'd take a stab.
For background, I'm using transfer learning on Inception v3.

Why does googlenet (inception) work well on the ImageNet dataset?

Some people said that the reason that inception works well on the ImageNet dataset is that:the original images in the ImageNet dataset have different resolutions, and they are resized to the same size when they are used. So the inception which can deal with different resolutions is very suitable to the ImageNet. Whether this description is true? Can anyone give some more details explanations? I am really very confused to this. Thanks so much!
First of all, Deep Convolution Neural Nets , receive fix Input Image size(if by size,you mean,the number of pixels), so all images should be in the same size or dimension, this means same resolution. on the other hand if image resolution is high with a lot of details , result of any network gets better. Imagnet images are high resolution from fliker and resizing theme need no interpolation so resized image remain in a good shape.
Second , inception module main goal is dimension reduction, it means if we have 1X1 convolution, so coefficient in dimension calculation is ONE:
output_dim = (input_dim + 2 * pad_data[i] - kernel_extent) / stride_data[i] + 1;
Inception or in other word GoogLeNet, network is huge (more than 100 layer) and computationally impossible for many CPU's or even GPU's to go through all convolutions , so it need to reduce dimension.
You can use deeper AlexNet(with more layer) in Imagnet Data-set and i bet it will give you a good result but when you want to go deeper than 30 layer you should have a good strategy, like Inception.by the way , Imagnet data-set has over 5 million images (last time i checked), in the Deep nets more image == more accuracy