Is there any ways to reduce the GPU Memory Caffe use? - caffe

I like caffe, but the amount of gpu memory caffe use is larger than mxnet(i test in ResNet-50 with mxnet-memonger). Is there any ideas, directions or alternative custom caffe for me to reduce the amount of gpu memory caffe use. Ideas and directions are enough and i will try to implement it in detail.

Short answer: The most straightforward method to reduce the memory Caffe uses is to reduce the batch size while enabling gradient accumulation to achieve the same effective batch size, which you can do using the batch_size and iter_size parameters of the solver. For example, let's say the current batch_size parameter is set to 128 and you wish to decrease the memory by half, then you would set in the solver's prototxt:
batch_size: 64
iter_size: 2
Long answer: what takes up most of the memory in Caffe are not the weights of the layers (these are mostly fixed cost), but the intermediate computations between the layers, which scale linearly with the batch size. This is why decreasing the batch size will decrease the memory usage. Of course, just decreasing the batch size will hurt performance because it increases the variance of the gradient estimation.
However, we can decrease the batch size of each forward-backward iteration without affecting the gradient estimation by using gradient accumulation. What this means is that for each forward-backward step we use a small batch size B, while we only update the weights once every N iterations and accumulate all the gradients since the last update. This will give us an effective batch size of NxB.
Lastly, you might wonder if using this method will hurt the runtime performance of training a network. While in theory it could hurt performance if the forward-backward step would have processed each element in the batch in parallel, in practice this is not how Caffe is implemented(*), and each element in the batch is processed sequentially for each layer, so the end result has little to no effect on runtime performance.
(*) As a side note, at the past I've added support for Caffe for exactly that, and you can actually gain a slight speed up (~1.5x) during training at the expense of doubling the memory.

facebook-caffe seems like what you want?
which optimizes memory usage by automatically reusing the intermediate activations when this is safe. This reduces the amount of memory required for intermediate activations by around 50% for AlexNet-style models, and around 75% for GoogLeNet-style models.
Oh, by the way I haven't tried it before.

Related

Neural Network : Epoch and Batch Size

I am trying to train a neural network to classify words into different categories.
I notice two things:
When I use a smaller batch_size (like 8,16,32) the loss is not decreasing, but rather sporadically varying. When I use a larger batch_size (like 128, 256), the loss is going going down, but very slowly.
More importantly, when I use a larger EPOCH value, my model does a good job at reducing the loss. However I'm using a really large value (EPOCHS = 10000).
Question:
How to get the optimal EPOCH and batch_size values?
There is no way to decide on these values based on some rules. Unfortunately, the best choices depend on the problem and the task. However, I can give you some insights.
When you train a network, you calculate a gradient which would reduce the loss. In order to do that, you need to backpropagate the loss. Now, ideally, you compute the loss based on all of the samples in your data because then you consider basically every sample and you come up with a gradient that would capture all of your samples. In practice, this is not possible due to the computational complexity of calculating gradient on all samples. Because for every update, you have to compute forward-pass for all your samples. That case would be batch_size = N, where N is the total number of data points you have.
Therefore, we use small batch_size as an approximation! The idea is instead of considering all the samples, we say I compute the gradient based on some small set of samples but the thing is I am losing information regarding the gradient.
Rule of thumb:
Smaller batch sizes give noise gradients but they converge faster because per epoch you have more updates. If your batch size is 1 you will have N updates per epoch. If it is N, you will only have 1 update per epoch. On the other hand, larger batch sizes give a more informative gradient but they convergence slower.
That is the reason why for smaller batch sizes, you observe varying losses because the gradient is noisy. And for larger batch sizes, your gradient is informative but you need a lot of epochs since you update less frequently.
The ideal batch size should be the one that gives you informative gradients but also small enough so that you can train the network efficiently. You can only find it by trying actually.

Mini-batch performs poorly than Batch gradient descent?

I am able to get pretty good results from batch gradient descent(batch size 37000) but when i try out mini-batch gradient descent i get very poor results (even with adam and dropout).
In batch gd i'm able to get 100% train and 97% dev/cv accuracy.
Whereas in mini-batch of size 128 i'm getting only around 88% accuracy in both.
The train loss seems to revolve around 1.6 and doesn't decrease with any further iteration but slowly decreases when i increase the batch size(hence improving accuracy).And eventually i arrive at batch size of 37000 for max accuracy.
I tried tweaking alpha but still same accuracy.
I'm training the mnist digits dataset.
What could be the reason? please help
In Batch Gradient Descent, all the training data is taken into consideration to take a single step. In mini batch gradient descent you consider some of data before taking a single step so the model update frequency is higher than batch gradient descent.
But mini-batch gradient descent comes with a cost:
Firstly, mini-batch makes some learning problems from technically untackleable to be tackleable due to the reduced computation demand with smaller batch size.
Secondly, reduced batch size does not necessarily mean reduced gradient accuracy. The training samples many have lots of noises or outliers or biases.
I believe that because of the oscillations in mini-batch you might fell into a local minima. Try to increase the learning rate with mini-batch it may solve the problem. also try to normalize the pictures it may help too.
I found the solution
The lmbda value i used for batch gd (i.e 10) seems to to be too big for mini batch gd.
And by decreasing it to 0.1 , i fixed the problem.

RNN L2 Regularization stops learning

I use Bidirectional RNN to detect an event of unbalanced occurence. The positive class is 100times less often than the negative class.
While no regularization use I can get 100% accuracy on train set and 30% on validation set.
I turn on l2 regularization and the result is only 30% accuracy on train set too instead of longer learning and 100% accuracy on validation set.
I was thinking that maybe my data is too small so just for experiment I merged train set with test set which I did not use before. Situation was the same as I would use l2 regularization, which I did not now. I get 30% accuracy on train+test and validation.
In use 128hidden units and 80 timesteps in the mentioned experiments
When I increased the number of hidden units to 256 I can again overfit on train+test set to get 100% accuracy but still only 30% on validation set.
I did try so many options for hyperparameters and almost no result. Maybe the weighted cross entropy is causing the problem, in given experiments the weight on positive class is 5. While trying larger weights the results are often worse around 20% of accuracy.
I tried LSTM and GRU cells, no difference.
The best results I got. I tried 2 hidden layers with 256 hidden units, it took around 3 days of computation and 8GB of GPU memory. I got around 40-50% accuracy before it starts overfitting again while l2 regularization was on but no so strong.
Is there some general guideline what to do in this situation? I was not able to find anything.
Too much hidden units can overfit your model. You can try with smaller number of hidden units. As you mentioned, training with more data might improve the performance. If you don't have enough data, you can generate some artificial data. Researchers add distortions to their training data to increase their data size but in a controlled way. This type of strategy is pretty good for image data but certainly if you are dealing with text data, probably you can use some knowledge base that can improve the performance.
There are many works going on using Knowledge-bases to solve NLP and deep learning related tasks.

Implementing large linear regression models using CUDA

For analysis of 10^6 genetic factors and their GeneXGene interactions (~5x10^11), I have numerous and independent linear regression problems which are probably suitable for analysis on GPUs.
The objective is to exhaustively search for GeneXGene interaction effects in modulating an outcome variable (a brain phenotype) using linear regression with the interaction term included.
As far as I know, the Householder QR factorization could be the solution for fitting regression models, however, given that each regression matrix in this particular work could easily approach the size of ~ 10'000x10, even each single regression matrix does not seem to fit in GPU on-chip memory (shared, registers etc.).
Should I accept this as a problem which is inherently bandwidth-limited and keep the matrices in GPU global memory during regression analysis, or are other strategies possible?
EDIT
Here are more details about the problem:
There will be approximately 10'000 subjects, each with ~1M genetic parameters (genetic matrix:10'000x10^6). The algorithm in each iteration should select two columns of this genetic matrix (10'000x2) and also around 6 other variables unrelated to genetic data (age, gender etc) so the final regression model will be dealing with a matrix like the size of 10'000x[2(genetic factors)+6(covariates)+2(intercept&interaction term)] and an outcome variable vector (10'000x1). This same process will be repeated ~5e11 times each time with a given pair of genetic factors. Those models passing a predefined statistical threshold should be saved as output.
The specific problem is that although there are ~5e11 separate regression models, even a single one does not seem to fit in on-chip memory.
I also guess that sticking with CUDA libraries may not be the solution here as this mandates most of the data manipulation to take place on the CPU side and only sending each QR decomposition to GPU?
You whole data matrix (1e4 x 1e6) may be too large to fit in the global memory, while each of your least squares solving (1e4 x 10) may be too small to fully utilize the GPU.
For each least squares problem, you could use cuSolver for QR factorization and triangular solving.
http://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-geqrf
If the problem size is too small to fully utilize the GPU, you could use concurrent kernel execution to solve multiple equations at the same time.
https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
For the whole data matrix, if it can not fit into the global memory, you could work on only part of it at a time. For example you could divide the matrix into ten (1e4 x 1e5) blocks, each time you load two of the blocks through PCIe, select all possible two-column combinations from the two blocks respectively, solve the equation and then load another two blocks. Maximize the block size will help you minimize the PCIe data transfer. With proper design, I'm sure the time for PCIe data transfer will be much smaller than solving 1e12 equations. Furthermore you could overlap the data transfer with solver kernel executions.
https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/

Why order of dimension makes big difference in performance?

To launch a CUDA kernel, we use dim3 to specify the dimensions, and I think the meaning of each dimension is opt to the user, for example, it could mean (width, height) or (rows, cols), which has the meaning reversed.
So I did an experiment with the CUDA sample in the SDK: 3_Imaging/convolutionSeparable, simply exchage .x and .y in the kernel function, and reverse the dimensions of blocks and threads used to launch the kernel, so the meaning changes from dim(width, height)/idx(x, y) to dim(rows, cols)/idx(row, col).
The result is the same, however, the performance decreases, the original one takes about 26ms, while the modified one takes about 40ms on my machine(SM 3.0).
My question is, what makes the difference? is (rows, cols) not feasible for CUDA?
P.S. I only modified convolutionRows, no convolutionColumns
EDIT: The change can be found here.
There are at least two potential consequences of your changes:
First, you are changing the memory access pattern to the main memory so the
access is as not coalesced as in the original case.
You should think about the GPU main memory in the same way as it was
a "CPU" memory, i.e., prefetching, blocking, sequential accesses...
techniques to applies in order to get performance.
If you want to know more about this topic, it is mandatory to read
this paper. What every programmer should know about memory.
You'll find an example a comparison between row and column major
access to the elements of a matrix there.
To get and idea on how important this is, think that most -if not
all- GPU high performance codes perform a matrix transposition
before any computation in order to achieve a more coalesced memory
access, and still this additional step worths in terms on
performance. (sparse matrix operations, for instance)
Second. This is more subtle, but in some scenarios it has a deep impact on the performance of a kernel; the launching configuration. It is not the same launching 20 blocks of 10 threads as launching 10 blocks of 20 threads. There is a big difference in the amount of resources a thread needs (shared memory, number of registers,...). The more resources a thread needs the less warps can be mapped on a single SM so the less occupancy... and the -most of the times- less performance.
This not applies to your question, since the number of blocks is equal to the number of threads.
When programming for GPUs you must be aware of the architecture in order to understand how that changes will modify the performance. Of course, I am not familiar with the code so there will be others factors among these two.