Why SSD training doesn't give accurate resullt - deep-learning

I am trying to detect human head including neck and shoulder using single-shot multibox detector (SSD).
I have about 800 images and trained for 50,000 iterations.
But in detection, detection_eval has maximum 0.29 during training.
In deployment, there is no accurate detection.
What could be the issue?
My rectangles are around 40 x 40 pixels.
One of the images tested is attached.
The image is shown here
Is it too small number of training images or object size is too small for SSD?
What could be the issue?
Since my PC is not powerful, I used only 3 batch size, would it be issue?
It does converge in training from loss 60 to loss ~0.9 during 50,000 iterations.

The issues I found are my training labels for xmin, ymin, xmax, ymax are normalized floating values. SSD needs int (not normalized) values. Then all those feature maps have min and max sizes like conv4_3_norm_mbox_priorbox has min_size: 30.0 and max_size: 60.0. They are calculated based on min_dim = 300. If you change this min_dim value, all min/max sizes are changed and if these sizes don't fit to objects in training, the detection_eval value is low.

Related

Calculate loss (MSE) on neural network output with different measurement units

I am training a neural network with pytorch. The input/features is 6-dimensional and of the same unit of measurement. The outputs/labels is 7-dimensional with 2 different unit of measurements (in my case mm and deg). How can I compute a MSE loss on this output? The problem is that 1 deg doesn't correspond to 1 mm. So even if I normalize my data, the network cannot differentiate between the units and treats degrees like mms. I know that some people use empirical weight factors to scale the different labels to get better results. Is there a better way?
I know I could train simply two different networks with one output being in mm and the other in deg.
I am thankful for any informed advise!

How to determine steps_per_epoch while using Image augumentation as it increase the number of images

model.fit_generator(datagen.flow(X_train,y_train,batch_size=32),epochs=10,steps_per_epoch=5000,validation_data=(X_test,y_test))
My total data size is 5000 and batch size is 32 , Then how to determine value for steps_per_epoch
case 1:When not using ImageAugumentation
cas2 2:When using using ImageAugumentation(Coz number images will increase and how to include that in steps_per_epoch)
The steps_per_epoch will be the total number of samples in your training set (before augmentation) divided by the batch size. So--
steps_per_epoch = 5000 / 32 ~ 156
Using data augmentation will not affect this calculation. You can also get more info about working with this parameter, as well as the fit_generator(), in my video on Training a CNN with Keras. The steps_per_epoch coverage starts around 4:08.

How to train the RPN in Faster R-CNN?

Link to paper
I'm trying to understand the region proposal network in faster rcnn. I understand what it's doing, but I still don't understand how training exactly works, especially the details.
Let's assume we're using VGG16's last layer with shape 14x14x512 (before maxpool and with 228x228 images) and k=9 different anchors. At inference time I want to predict 9*2 class labels and 9*4 bounding box coordinates. My intermediate layer is a 512 dimensional vector.
(image shows 256 from ZF network)
In the paper they write
"we randomly sample 256 anchors in an image to compute the loss
function of a mini-batch, where the sampled positive and negative
anchors have a ratio of up to 1:1"
That's the part I'm not sure about. Does this mean that for each one of the 9(k) anchor types the particular classifier and regressor are trained with minibatches that only contain positive and negative anchors of that type?
Such that I basically train k different networks with shared weights in the intermediate layer? Therefore each minibatch would consist of the training data x=the 3x3x512 sliding window of the conv feature map and y=the ground truth for that specific anchor type.
And at inference time I put them all together.
I appreciate your help.
Not exactly. From what I understand, the RPN predicts WHk bounding boxes per feature map, and then 256 are randomly sampled per the 1:1 criteria, and these are used as part of the computation for the loss function of that particular mini-batch. You're still only training one network, not k, since the 256 random samples are not of any particular type.
Disclaimer: I only started learning about CNNs a month ago, so I may not understand what I think I understand.

How to analyse and interpret features learned by a RBM?

I've trained a single layer, 100 hidden unit RBM with binary input units and ReLU activation on the hidden layer. Using a training set of 50k MNIST images, I end up with ~5% RMSE on the 10k image test set after 500 epochs of full-batch training with momentum and L1 weight penalty.
Looking at the visualisation below, it is clear that there are big differences between the hidden units. Some appear to have converged into a very well defined response pattern, while others are indistinguishable from noise.
My question is: how would you interpret this apparent variety, and what technique could possibly help with achieving a more balanced result? Does a situation like this call for more regularization, slower learning, longer learning, or something else?
Raw weights of the 100 hidden units, reshaped into the input image size.

Small validation accuracy ResNet 50

https://github.com/slavaglaps/ResNet_cifar10/blob/master/resnet.ipynb
This is my model trained in 100 epochs
Accuracy on similar models and similar data reaches 90%
What is my problem?
I think it's worth reducing the learning rate with the passage of the epochs.
What do you think that can help me?
There are a few subtle differences.
You are trying to apply ImageNet style architecture to Cifar-10. First convolution is 3 x 3, not 7 x 7. There is no max-pooling layer. The image is downsampled purely by using stride-2 convolutions.
You should probably do mean-centering by keeping featurewise_center = True in ImageDataGenerator.
Do not use very high number of filters such as [512, 1024, 2048]. There are only 50,000 images for you to train unlike ImageNet which has about a million.
In short, read up section 4.2 in the deep residual network paper and try to replicate the network. You may also read this blog.