Mini-batch performs poorly than Batch gradient descent? - deep-learning

I am able to get pretty good results from batch gradient descent(batch size 37000) but when i try out mini-batch gradient descent i get very poor results (even with adam and dropout).
In batch gd i'm able to get 100% train and 97% dev/cv accuracy.
Whereas in mini-batch of size 128 i'm getting only around 88% accuracy in both.
The train loss seems to revolve around 1.6 and doesn't decrease with any further iteration but slowly decreases when i increase the batch size(hence improving accuracy).And eventually i arrive at batch size of 37000 for max accuracy.
I tried tweaking alpha but still same accuracy.
I'm training the mnist digits dataset.
What could be the reason? please help

In Batch Gradient Descent, all the training data is taken into consideration to take a single step. In mini batch gradient descent you consider some of data before taking a single step so the model update frequency is higher than batch gradient descent.
But mini-batch gradient descent comes with a cost:
Firstly, mini-batch makes some learning problems from technically untackleable to be tackleable due to the reduced computation demand with smaller batch size.
Secondly, reduced batch size does not necessarily mean reduced gradient accuracy. The training samples many have lots of noises or outliers or biases.
I believe that because of the oscillations in mini-batch you might fell into a local minima. Try to increase the learning rate with mini-batch it may solve the problem. also try to normalize the pictures it may help too.

I found the solution
The lmbda value i used for batch gd (i.e 10) seems to to be too big for mini batch gd.
And by decreasing it to 0.1 , i fixed the problem.

Related

batch size without batch normalization

I'm working on image super-resolution tasks with EDSR as a baseline model. Following EDSR, I'm not using any batch-norm layers in my model. I suddenly came up with a stupid question about batch-sizes.
Currently, I'm training my model with batch-size=32 (as in EDSR). But since I'm not using any batch-normalization technique, I cant see any reason for using batch sizes greater than 1. But I'm not confident with my thoughts since the author's implementations are using batch sizes greater than 1.
Could someone help me with this? What am I missing?
In Rethinking “Batch” in BatchNorm research carried out by FAIR, batch normalization and batch size are discussed. According to the graph below, you can see the relation of batch normalization and batch size. It shows that when you use smaller batch size you do not need to use batch normalization. Batch normalization is helpful when you have bigger batch size. Using smaller batch size with batch normalization leads to training/testing inconsistency.
Classification error under different normalization batch
sizes, with a fixed total batch size of 1024. Green: error rate on
unaugmented training set using mini-batch statistics; Red: error
rate on validation set using population statistics estimated by PreciseBN; Blue: error rate on validation set using mini-batch statistics of random batches (with the same normalization batch size
used in training). The gap between red and blue curves is caused
by train-test inconsistency, while the gap between blue and green
curves is the generalization gap on unseen dataset.

Neural Network : Epoch and Batch Size

I am trying to train a neural network to classify words into different categories.
I notice two things:
When I use a smaller batch_size (like 8,16,32) the loss is not decreasing, but rather sporadically varying. When I use a larger batch_size (like 128, 256), the loss is going going down, but very slowly.
More importantly, when I use a larger EPOCH value, my model does a good job at reducing the loss. However I'm using a really large value (EPOCHS = 10000).
Question:
How to get the optimal EPOCH and batch_size values?
There is no way to decide on these values based on some rules. Unfortunately, the best choices depend on the problem and the task. However, I can give you some insights.
When you train a network, you calculate a gradient which would reduce the loss. In order to do that, you need to backpropagate the loss. Now, ideally, you compute the loss based on all of the samples in your data because then you consider basically every sample and you come up with a gradient that would capture all of your samples. In practice, this is not possible due to the computational complexity of calculating gradient on all samples. Because for every update, you have to compute forward-pass for all your samples. That case would be batch_size = N, where N is the total number of data points you have.
Therefore, we use small batch_size as an approximation! The idea is instead of considering all the samples, we say I compute the gradient based on some small set of samples but the thing is I am losing information regarding the gradient.
Rule of thumb:
Smaller batch sizes give noise gradients but they converge faster because per epoch you have more updates. If your batch size is 1 you will have N updates per epoch. If it is N, you will only have 1 update per epoch. On the other hand, larger batch sizes give a more informative gradient but they convergence slower.
That is the reason why for smaller batch sizes, you observe varying losses because the gradient is noisy. And for larger batch sizes, your gradient is informative but you need a lot of epochs since you update less frequently.
The ideal batch size should be the one that gives you informative gradients but also small enough so that you can train the network efficiently. You can only find it by trying actually.

RNN L2 Regularization stops learning

I use Bidirectional RNN to detect an event of unbalanced occurence. The positive class is 100times less often than the negative class.
While no regularization use I can get 100% accuracy on train set and 30% on validation set.
I turn on l2 regularization and the result is only 30% accuracy on train set too instead of longer learning and 100% accuracy on validation set.
I was thinking that maybe my data is too small so just for experiment I merged train set with test set which I did not use before. Situation was the same as I would use l2 regularization, which I did not now. I get 30% accuracy on train+test and validation.
In use 128hidden units and 80 timesteps in the mentioned experiments
When I increased the number of hidden units to 256 I can again overfit on train+test set to get 100% accuracy but still only 30% on validation set.
I did try so many options for hyperparameters and almost no result. Maybe the weighted cross entropy is causing the problem, in given experiments the weight on positive class is 5. While trying larger weights the results are often worse around 20% of accuracy.
I tried LSTM and GRU cells, no difference.
The best results I got. I tried 2 hidden layers with 256 hidden units, it took around 3 days of computation and 8GB of GPU memory. I got around 40-50% accuracy before it starts overfitting again while l2 regularization was on but no so strong.
Is there some general guideline what to do in this situation? I was not able to find anything.
Too much hidden units can overfit your model. You can try with smaller number of hidden units. As you mentioned, training with more data might improve the performance. If you don't have enough data, you can generate some artificial data. Researchers add distortions to their training data to increase their data size but in a controlled way. This type of strategy is pretty good for image data but certainly if you are dealing with text data, probably you can use some knowledge base that can improve the performance.
There are many works going on using Knowledge-bases to solve NLP and deep learning related tasks.

Is there any ways to reduce the GPU Memory Caffe use?

I like caffe, but the amount of gpu memory caffe use is larger than mxnet(i test in ResNet-50 with mxnet-memonger). Is there any ideas, directions or alternative custom caffe for me to reduce the amount of gpu memory caffe use. Ideas and directions are enough and i will try to implement it in detail.
Short answer: The most straightforward method to reduce the memory Caffe uses is to reduce the batch size while enabling gradient accumulation to achieve the same effective batch size, which you can do using the batch_size and iter_size parameters of the solver. For example, let's say the current batch_size parameter is set to 128 and you wish to decrease the memory by half, then you would set in the solver's prototxt:
batch_size: 64
iter_size: 2
Long answer: what takes up most of the memory in Caffe are not the weights of the layers (these are mostly fixed cost), but the intermediate computations between the layers, which scale linearly with the batch size. This is why decreasing the batch size will decrease the memory usage. Of course, just decreasing the batch size will hurt performance because it increases the variance of the gradient estimation.
However, we can decrease the batch size of each forward-backward iteration without affecting the gradient estimation by using gradient accumulation. What this means is that for each forward-backward step we use a small batch size B, while we only update the weights once every N iterations and accumulate all the gradients since the last update. This will give us an effective batch size of NxB.
Lastly, you might wonder if using this method will hurt the runtime performance of training a network. While in theory it could hurt performance if the forward-backward step would have processed each element in the batch in parallel, in practice this is not how Caffe is implemented(*), and each element in the batch is processed sequentially for each layer, so the end result has little to no effect on runtime performance.
(*) As a side note, at the past I've added support for Caffe for exactly that, and you can actually gain a slight speed up (~1.5x) during training at the expense of doubling the memory.
facebook-caffe seems like what you want?
which optimizes memory usage by automatically reusing the intermediate activations when this is safe. This reduces the amount of memory required for intermediate activations by around 50% for AlexNet-style models, and around 75% for GoogLeNet-style models.
Oh, by the way I haven't tried it before.

Semantic Segmentation using deep learning

I have a 512x512 image. I want to perform per-pixel classification of that image. I have already trained the model using 80x80 patches. So, at the test time I have 512x512=262144 patches each with dimension 80x80 and this classification is too slow. How to improve the testing time? Please help me out.
I might be wrong, but there are not a lot of solution to speed up the testing phase, the main one being to reduce the NN number of neurons in order to reduce the number of operations:
80x80 patches are really big, you may want to try to reduce their size and retrain your NN. It will already reduces a lot the number of neurons.
Analyze the NN weights/inputs/outputs in order to detect the neurons that do not matter in your NN. They may for example always return 0, then they can be deleted from your NN. Then you retrain your NN with the simplified architecture.
If you have not done that already, it's much faster to give a batch (the bigger the better) of patches instead of one patch at a time.