What is the concept of mini-batch for FCN (semantic segmentation)? - deep-learning

What is the concept of mini-batch when we are sending one image to FCN for semantic segmentation?
The default value in data layers is batch_size: 1. That means every forward and backward pass, one image is sent to the network. So what will be the mini-batch size? Is it the number of pixels in an image?
The other question is what if we send few images together to the net? Does it affect the convergence? In some papers, I see the number of 20 images.
Thanks

The batch size is the number of images sent through the network in a single training operation. The gradient will be calculated for all the sample in one swoop, resulting in large performance gains through parallelism, when training on a graphics card or cpu cluster.
The batch sizes has multiple effects on training. First it provides more stable gradient updates by averaging the gradient in the batch. This can both be beneficial and detrimental. In my experience it was more beneficial then detrimental, but others have reported other results.
To exploit parallelism the batch size is mostly a power of 2. So either 8, 16, 32, 64 or 128. Finally the batch size is limited by VRAM in the graphics card. The card needs to store all the images and results in all the nodes of the graph and additionally all the gradients.
This can blow up very fast. In this case you need to reduce the batch size or the network size.

Related

Is there an actual minimum input image size for popular computer vision models? (E.g., vgg, resnet, etc.)

According to the documentation on pre-trained computer vision models for transfer learning (e.g., here), input images should come in "mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224".
However, when running transfer learning experiments on 3-channel images with height and width smaller than expected (e.g., smaller than 224), the networks generally run smoothly and often get decent performances.
Hence, it seems to me that the "minimum height and width" is somehow a convention and not a critical parameter. Am I missing something here?
There is a limitation on your input size which corresponds to the receptive field of the last convolution layer of your network. Intuitively, you can observe the spatial dimensionality decreasing as you progress through the network. At least this is the case for feature extractor CNNs which aim at extracting feature embeddings from the input image. That is most pre-trained models such as vanilla VGG, and ResNets networks do not retain spatial dimensionality. If the input of a convolutional layer is smaller than the kernel size (even if/when padded), then you simply won't be able to perform the operation.
TLDR: adaptive pooling layer
For example, the standard resnet50 model accepts input only in ranges 193-225, and this is due to the architecture and downscaling layers (see below).
The only reason why the default pytorch model works is that it is using adaptive pooling layer which allows to not restrict input size. So it's gonna work but you should be ready for performance decay and other fun things :)
Hope you will find it useful:
https://discuss.pytorch.org/t/how-can-torchvison-models-deal-with-image-whose-size-is-not-224-224/51077/3
What is Adaptive average pooling and How does it work?
https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html
https://github.com/pytorch/vision/blob/c187c2b12d86c3909e59a40dbe49555d85b98703/torchvision/models/resnet.py#L118
https://github.com/pytorch/vision/blob/c187c2b12d86c3909e59a40dbe49555d85b98703/torchvision/models/resnet.py#L151
https://developpaper.com/pytorch-implementation-examples-of-resnet50-resnet101-and-resnet152/

How to choose the hyperparameters and strategy for neural network witg small dataset?

I'm currently doing semantic segmentation ,However I have really small dataset,
I only have around 700 images with data augmentation,for example,flipping could
make it 2100 images.
Not sure if it's quite enough for my task(semantic segmentation with four
classes).
I want to use batch normalization,and mini batch gradient descent
What's really make me scratch my head is that if the batch size is too small,
the batch normalization doesn't work well ,but with larger batch size,
it seems equivalent to full batch gradient descent
I wonder if there's something like standard ratio between #of samples and batch
size?
Let me first address the second part of your question "strategy for neural network with small dataset". You may want to take a pretrained network on a larger dataset, and fine tune that network using your smaller dataset. See, for example, this tutorial.
Second, you ask about the size of the batch. Indeed, the smaller batch will make the algorithm to wander around the optimum as in classical stochastic gradient descent, the sign of which is noisy fluctuations of your losses. Whereas with a larger batch size there is typically a more "smooth" trajectory towards optimum. In any case, I suggest that you use an algorithm with momentum such as Adam. That would aid the convergence of your training.
Heuristically, the batch size can be kept as large as your GPU memory can fit. If the amount of GPU memory is not sufficient, then the batch size is reduced.

caffe - how to properly train alexnet with only 7 classes

I have a small dataset collect from imagenet(7 classes each class with 1000 training data). I try to train it with alexnet model. But somehow the accuracy just cant go any higher(about 68% maximum). I remove conv4 and conv5 layer to prevent model overfitting also decrease the number of neuron in each layer(conv and fc). here is my setup.
Did i do anything wrong so that the accuracy is so low?
I want to sort out a few terms:
(1) A perceptron is an individual cell in a neural net.
(2) In a CNN, we generally focus on the kernel (filter) as a unit; this is the square matrix of perceptrons that forms a psuedo-visual unit.
(3) The only place it usually makes sense to focus on an individual perceptron is in the FC layers. When you talk about removing some of the perceptrons, I think you mean kernels.
The most important part of training a model is to make sure that your model is properly fitted to the problem at hand. AlexNet (and CaffeNet, the BVLC implementation) is fitted to the full ImageNet data set. Alex Krizhevsky and his colleagues spent a lot of research effort in tuning their network to the problem. You are not going to get similar accuracy -- on a severely reduced data set -- by simply removing layers and kernels at random.
I suggested that you start from CONVNET (the CIFAR-10 net) because it's much better tuned to this scale of problem. Most of all, I strongly recommend that you make constant use of your visualization tools, so that you can detect when the various kernel layers begin to learn their patterns, and to see the effects of small changes in the topology.
You need to run some experiments to tune and understand your topology. Record the kernel visualizations at chosen times during the training -- perhaps at intervals of 10% of expected convergence -- and compare the visual acuity as you remove a few kernels, or delete an entire layer, or whatever else you choose.
For instance, I expect that if you do this with your current amputated CaffeNet, you'll find that the severe losses in depth and breadth greatly change the feature recognition it's learning. The current depth of building blocks is not enough to recognize edges, then shapes, then full body parts. However, I could be wrong -- you do have three remaining layers. That's why I asked you to post the visualizations you got, to compare with published AlexNet features.
edit: CIFAR VISUALIZATION
CIFAR is much better differentiated between classes than is ILSVRC-2012. Thus, the training requires less detail per layer and fewer layers. Training is faster, and the filters are not nearly as interesting to the human eye. This is not a problem with the Gabor (not Garbor) filter; it's just that the model doesn't have to learn so many details.
For instance, for CONVNET to discriminate between a jonquil and a jet, we just need a smudge of yellow inside a smudge of white (the flower). For AlexNet to tell a jonquil from a cymbidium orchid, the network needs to learn about petal count or shape.

User defined transformation in CNTK

Problem setting
I have a dataset with N images.
A certain network (e.g - Alexnet) has to be trained from scratch over this dataset.
For each image, 10 augmented versions are to be produced. These augmentations involve resizing, cropping and flipping. For example - an image has to be resized with minimum dimension of 256 pixels and then a random crop of 224 x 224 of it is to be taken. Then it has to be flipped. 5 such random crops have to be taken and their flipped versions also have to be prepared.
Those augmented versions have to go inside the network for training instead of the original image
What would be additionally very beneficial is that, multiple images in the dataset are augmented in parallel and put in a queue or any container from where abatchsize number of samples are pushed into the GPU for training.
The reason is that we would not ideally like multiple augmented versions of the same image going into the network for training simultaneously.
Context
It is not a random feature requirement. There are some papers such as OverFeat which involve such augmentations. Moreover such a random training can be a very good idea to improve the training of the network.
My understanding
To the best of my search, I could not find any framework inside CNTK that can do this.
Questions
Is it possible to achieve in CNTK ?
Please take a look at the CNTK 201 tutorial:
https://github.com/Microsoft/CNTK/blob/penhe/reasonet_tutorial/Tutorials/CNTK_201B_CIFAR-10_ImageHandsOn.ipynb
The image reader has built in transforms that addresses many of your requirements. Unfortunately, it is not in the GPU.

How to increase validation accuracy with deep neural net?

I am trying to build a 11 class image classifier with 13000 training images and 3000 validation images. I am using deep neural network which is being trained using mxnet. Training accuracy is increasing and reached above 80% but validation accuracy is coming in range of 54-57% and its not increasing.
What can be the issue here? Should I increase the no of images?
The issue here is that your network stop learning useful general features at some point and start adapting to peculiarities of your training set (overfitting it in result). You want to 'force' your network to keep learning useful features and you have few options here:
Use weight regularization. It tries to keep weights low which very often leads to better generalization. Experiment with different regularization coefficients. Try 0.1, 0.01, 0.001 and see what impact they have on accuracy.
Corrupt your input (e.g., randomly substitute some pixels with black or white). This way you remove information from your input and 'force' the network to pick up on important general features. Experiment with noising coefficients which determines how much of your input should be corrupted. Research shows that anything in the range of 15% - 45% works well.
Expand your training set. Since you're dealing with images you can expand your set by rotating / scaling etc. your existing images (as suggested). You could also experiment with pre-processing your images (e.g., mapping them to black and white, grayscale etc. but the effectiveness of this technique will depend on your exact images and classes)
Pre-train your layers with denoising critera. Here you pre-train each layer of your network individually before fine tuning the entire network. Pre-training 'forces' layers to pick up on important general features that are useful for reconstructing the input signal. Look into auto-encoders for example (they've been applied to image classification in the past).
Experiment with network architecture. Your network might not have sufficient learning capacity. Experiment with different neuron types, number of layers, and number of hidden neurons. Make sure to try compressing architectures (less neurons than inputs) and sparse architectures (more neurons than inputs).
Unfortunately the process of training network that generalizes well involves a lot of experimentation and almost brute force exploration of parameter space with a bit of human supervision (you'll see many research works employing this approach). It's good to try 3-5 values for each parameter and see if it leads you somewhere.
When you experiment plot accuracy / cost / f1 as a function of number of iterations and see how it behaves. Often you'll notice a peak in accuracy for your test set, and after that a continuous drop. So apart from good architecture, regularization, corruption etc. you're also looking for a good number of iterations that yields best results.
One more hint: make sure each training epochs randomize the order of images.
This clearly looks like a case where the model is overfitting the Training set, as the validation accuracy was improving step by step till it got fixed at a particular value. If the learning rate was a bit more high, you would have ended up seeing validation accuracy decreasing, with increasing accuracy for training set.
Increasing the number of training set is the best solution to this problem. You could also try applying different transformations (flipping, cropping random portions from a slightly bigger image)to the existing image set and see if the model is learning better.