Why does googlenet (inception) work well on the ImageNet dataset? - deep-learning

Some people said that the reason that inception works well on the ImageNet dataset is that:the original images in the ImageNet dataset have different resolutions, and they are resized to the same size when they are used. So the inception which can deal with different resolutions is very suitable to the ImageNet. Whether this description is true? Can anyone give some more details explanations? I am really very confused to this. Thanks so much!

First of all, Deep Convolution Neural Nets , receive fix Input Image size(if by size,you mean,the number of pixels), so all images should be in the same size or dimension, this means same resolution. on the other hand if image resolution is high with a lot of details , result of any network gets better. Imagnet images are high resolution from fliker and resizing theme need no interpolation so resized image remain in a good shape.
Second , inception module main goal is dimension reduction, it means if we have 1X1 convolution, so coefficient in dimension calculation is ONE:
output_dim = (input_dim + 2 * pad_data[i] - kernel_extent) / stride_data[i] + 1;
Inception or in other word GoogLeNet, network is huge (more than 100 layer) and computationally impossible for many CPU's or even GPU's to go through all convolutions , so it need to reduce dimension.
You can use deeper AlexNet(with more layer) in Imagnet Data-set and i bet it will give you a good result but when you want to go deeper than 30 layer you should have a good strategy, like Inception.by the way , Imagnet data-set has over 5 million images (last time i checked), in the Deep nets more image == more accuracy

Related

what does mean by rewritten box during yolo training on custom dataset?

I am training my custom dataset on Yolo network and during training, I am getting info of rewritten box (as shown in the figure).
for example: total_bbox = 29159, rewritten_bbox = 0.006859 %
what does that mean? Is my training proceeding right?
enter image description here
optimal numbers of layers and resolution depend on dataset.
The smaller objects - the higher resolution is required.
The large objects - the more layers are required. There is an article on choosing the optimal number of layers, filters and resolution for MS COCO dataset: https://arxiv.org/pdf/1911.09070.pdf
It depends on what accuracy and speed do you want. To reduce rewritten_bbox % just increase resolution and/or move some masks from [yolo] layers with low resolution, the [yolo] layers with higher resolution, and train. Also iou_thresh=1 may reduce rewritten_bbox %

what is class score map?

I was going through vggnet paper and i came across the testing phase of vggnet.
During the testing phase, test image goes through the vggnet and a class score map is obtained. This class score map is spatially averaged to produce a fixed size vector.
I have googled class score map, but then i couldn't find any relevant results. I wish to know what is the role of class score map.
Any hint would be greatly helpful. Thanks
When you train an image recognition model, you train it for a specific image size (and resolution), let's say n_dims = [256, 256]. Now, in the prediction phase, you have images of different sizes (with respect to pixels), e.g. [1024, 1024]. You extract patches (you can resize the image first by lowering the resolution) and hover over the image patches with your model, and for each patch, you obtain a prediction for all classes (in a patch, more than one of the objects might be present), which you have to average somehow for the whole image at the end.
See OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks.
Instead, we explore the entire image by densely running the network at each location and at multiple
scales. While the sliding window approach may be computationally prohibitive for certain types
of model, it is inherently efficient in the case of ConvNets (see section 3.5). This approach yields
significantly more views for voting, which increases robustness while remaining efficient. The result
of convolving a ConvNet on an image of arbitrary size is a spatial map of C-dimensional vectors at
each scale.

What does Nvidia mean when they say samples per pixel in regards to DLSS?

"NVIDIA researchers have successfully trained a neural network to find
these jagged edges and perform high-quality anti-aliasing by
determining the best color for each pixel, and then apply proper
colors to create smoother edges and improve image quality. This
technique is known as Deep Learning Super Sample (DLSS). DLSS is like
an “Ultra AA” mode-- it provides the highest quality anti-aliasing
with fewer artifacts than other types of anti-aliasing.
DLSS requires a training set of full resolution frames of the aliased
images that use one sample per pixel to act as a baseline for
training. Another full resolution set of frames with at least 64
samples per pixel acts as the reference that DLSS aims to achieve."
https://developer.nvidia.com/rtx/ngx
At first I thought of sample as it is used in graphics, an intersection of channel and a pixel. But that really doesn't make any sense in this context, going from 1 channel to 64 channels ?
So I am thinking it is sample as in the statistics term but I don't understand how a static image could come up with 64 variations to compare to? Even going from FHD to 4K UHD is only 4 times the amount of pixels. Trying to parse that second paragraph I really can't make any sense of it.
16 bits × RGBA equals 64 samples per pixel maybe? They say at least, so higher accuracy could take as much as 32 bits × RGBA or 128 samples per pixel for doubles.

how to format the image data for training/prediction when images are different in size?

I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?
You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.
Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .

Parameter fine tuning for training Alexnet with smaller image size

Alexnet is intended to use 227x227x3 image size.
If I like to train the image size smaller like 32x80x3, what are the parameters to be fine tuned.
I initially trained with 64x80x3 image sizes with all parameters same as provided except the stride in the first Conv1 layer, it was changed to 2.
I achieved the testing accuracy very high and as high as 0.999. Then in real use also, I get reasonable high in accuracy in detection.
Then I prefer to use the smaller image size 32x80x3. I used the same parameters as trained in 64x80x3 image size, but accuracy is as low as 0.9671.
I tried to fine tune parameters like Conv1 layer's filer size to 5. Gaussian weight filter's std size to 10 times and 100 times smaller. But none of them can help achieve the accuracy achieved in training 64x80x3 images.
For smaller image sizes to train, what are the parameters to be fine tuned to achieve the higher accuracy?
I used 24000 dataset. 20000 is for training and 4000 is for testing.
For both 32x80x3 and 64x80x3, I used same images, just that image size is edited to be 32x80 and 64x80.
Maybe you can train to resize the 32x80x3 images to 64x80x3 and then use the similar parameter settings.
Also, maybe you could find some thing useful here https://github.com/BVLC/caffe/tree/master/examples/cifar10.
There are some solver and train_val files for fine tuning over CIFAR-10, which is a dataset consists of small images too.