how can I clear graphic card memory after training in pytorch? - deep-learning

I am dealing with pytorch in colab
While training, pytorch consumes enormous memory
after training, I saved model, and loaded model to another notebook(note 2).
in note 2, after loading state_dict and everything, pytorch consumes way less memory than in training state.
So, I wonder 'useless' data is stored in graphic card memory while training(in my case, about 13gb)...
If so, how do I delete useless data after training?
plus. I tried to delete variables used while training, but wasn't big enough(about 2gb)

This is to be expected while training. During the training process, the operations themselves will take up memory.
For example, consider the following operation -
a = np.random.rand(100, 500, 300)
b = np.random.rand(200, 500, 300)
c = (a[:, None, :, :] * b[None, :, :, :]).sum(-1).sum(-1)
The memory size of a, b and c individually is around 400 MB. However, if you check
%memit (a[:, None, :, :] * b[None, :, :, :]).sum(-1).sum(-1)
That's 23 GB! The line itself takes up a lot of memory to actually do the operation because there are massive intermediate arrays involved. These arrays are temporary and are automatically deleted after the operation is over. So you deleting some variables isn't going to do much for reducing the footprint.
The way to get around this is to use memory optimized operations.
For example, doing np.tensordot(a, b, ((1, 2), (1, 2))) instead of multiplying by broadcasting leaves a much better memory footprint.
So what you need to do is to identify which operation in your code is requiring such a huge memory and see if you can replace that with a more memory efficient equivalent (which might not even be possible depending on your specific use-case).

Related

What do BatchNorm2d's running_mean / running_var mean in PyTorch?

I'd like to know what exactly the running_mean and running_var that I can call from nn.BatchNorm2d.
Example code is here where bn means nn.BatchNorm2d.
vector = torch.cat([
torch.mean(self.conv3.bn.running_mean).view(1), torch.std(self.conv3.bn.running_mean).view(1),
torch.mean(self.conv3.bn.running_var).view(1), torch.std(self.conv3.bn.running_var).view(1),
torch.mean(self.conv5.bn.running_mean).view(1), torch.std(self.conv5.bn.running_mean).view(1),
torch.mean(self.conv5.bn.running_var).view(1), torch.std(self.conv5.bn.running_var).view(1)
])
I couldn't figure out what running_mean and running_var mean in the Pytorch official documentation and user community.
What do nn.BatchNorm2.running_mean and nn.BatchNorm2.running_var mean?
From the original Batchnorm paper:
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,Seguey Ioffe and Christian Szegedy, ICML'2015
You can see on Algorithm 1. how to measure the statistics of a given batch.
However what is kept in memory across batches is the running stats, i.e. the statistics which are measured iteratively at each batch inference. The computation of the running mean and running variance is actually quite well explained in the documentation page of nn.BatchNorm2d:
By default, the momentum coefficient is set to 0.1, it regulates how much of the current batch statistics will affect the running statistics:
closer to 1 means the new running stat is closer to the current batch statistics, whereas
closer to 0 means the current batch stats will not contribute much to updating the new running stats.
It's worth pointing out that Batchnorm2d is applied across spatial dimensions, * in addition*, to the batch dimension of course. Given a batch of shape (b, c, h, w), it will compute the statistics across (b, h, w). This means the running statistics are shaped (c,), i.e. there are as many statistics components as there are in input channels (for both mean and variance).
Here is a minimal example:
>>> bn = nn.BatchNorm2d(10)
>>> x = torch.rand(2,10,2,2)
Since track_running_stats is set to True by default on BatchNorm2d, it will track the running stats when inferring on training mode.
The running mean and variance are initialized to zeros and ones, respectively.
>>> running_mean, running_var = torch.zeros(x.size(1)),torch.ones(x.size(1))
Let's perform inference on bn in training mode and check its running stats:
>>> bn(x)
>>> bn.running_mean, bn.running_var
(tensor([0.0650, 0.0432, 0.0373, 0.0534, 0.0476,
0.0622, 0.0651, 0.0660, 0.0406, 0.0446]),
tensor([0.9027, 0.9170, 0.9162, 0.9082, 0.9087,
0.9026, 0.9136, 0.9043, 0.9126, 0.9122]))
Now let's compute those stats by hand:
>>> (1-momentum)*running_mean + momentum*xmean
tensor([[0.0650, 0.0432, 0.0373, 0.0534, 0.0476,
0.0622, 0.0651, 0.0660, 0.0406, 0.0446]])
>>> (1-momentum)*running_var + momentum*xvar
tensor([[0.9027, 0.9170, 0.9162, 0.9082, 0.9087,
0.9026, 0.9136, 0.9043, 0.9126, 0.9122]])

Slow prediction speed for translation model opus-mt-en-ro

I'm using the model Helsinki-NLP/opus-mt-en-ro from huggingface.
To produce output, I'm using the following code:
inputs = tokenizer(
questions,
max_length=max_input_length,
truncation=True,
return_tensors='pt',
padding=True).to('cuda')
translation = model.generate(**inputs)
For small inputs (i.e. the number of sentences in questions), it works fine. However, when the number of sentences increases (e.g., batch size = 128), it is very slow.
I have a dataset of 100K examples and I have to produce the output. How to make it faster? (I already checked the usage of GPU and it varies between 25% and 70%).
Update: Following the comment of dennlinger, here is the additional information:
Average question length: Around 30 tokens
Definition of slowness: With a batch of 128 questions, it takes around 25 seconds. So given my dataset of 100K examples, it will take more than 5 hours. I'm using GPU Nvidia V100 (16GB) (hence to('cuda') in the code). I cannot increase the batch size because it results in out of memory error.
I didn't try different parameters, but I know by default, the number of beams equals 1.

How to convert a sparse histogram into dense histogram in CUDA?

I am implementing an algorithm using raw CUDA kernels, in which every threadblock needs the dense histogram of available data to that threadblock, now the question is that do I have to calculate the dense histogram from the scratch? (is it worth calculating the dense histogram at all, provided that i already have the sparse histogram which is implemented using shared memory)
I have come up with this idea of converting, I will try to elaborate my idea with example (temp and hist both are in shared memory)
0,1,2,3,4,5,6... //array indexes
4,3,0,2,1,0,5... //contents of hist[]
0,0,2,0,0,5,0... //contents of temp[] if(hist[x]>0)temp[x]=x;
for_every_element //this is sequential part :(
if(temp[x]>0)
shift elements from index x to 256
4,3,2,1,0,5... //pass 1 of the for loop
4,3,2,1,5... //pass 2 of the for loop
//this goes on until all the 0s are compacted
Now I know above is sequential in nature, but the shifting can be done with constant time (and in parallel) because threads_per_block is already set to 256, so shifting is not the main issue, the main issue is how to improve this (or any other suggestion is welcomed).
Edit: i am thinking of another idea, that is as follows
Assuming threads_per_block=256 if i can count which of histogram bins are non-zeros (this operation is parallel because each thread is assigned to each bin, i can atomicadd the values generated by each thread) let's say that i can then start a new shared index variable sindex=0 and each time a thread wants to store the value into d_hist[] it can take the latest value from sindex and store it's values to d_hist[sindex]=hist[treadIdx.x] after that i can atomicAdd the sindex
Now there is only one problem, there is going to be a race condition to getting the value of sindex, so i may have to setup a flag which can be locked or unlocked when a thread is adding any value to d_hist (but i think there can be a deadlock situation here)
Will this technique work? and is there any other technique better than that?
Converting a sparse histogram to a dense histogram is a scatter operation. If the sparse histogram is composed of s_index[S_N] and s_hist[S_N], then first we create a dense histogram d_hist[N] composed of all zeroes (you can do this from host code, perhaps). Then we populate the dense histogram with d_hist[s_index[i]] = s_hist[i]; This can be done in parallel and uses as many threads as there are valid indices in your sparse histogram (i < S_N). Assuming your histogram is sorted, you'll get whatever coalescing benefit may be possible based on the distribution of your sparse histogram indices.
It may not make sense for your case where each threadblock is doing a separate histogram, but you may also be interested in thrust scatter.
Well I guess the simplest method is to find out which bins>0 and after that, and exclusive scan can be done (in order to calculate the target indexes let's say sum_array[]) after that for allbins>0 move to d_hist[sum_array[threadIdx.x]-1]=s_hist[threadIdx.x]
0,1,2,3,4,5,6... //s_indexes[]
4,3,0,2,1,0,5... //contents of s_hist[]
1,1,0,1,1,0,1... //all bins which are > 0 = sum_array[]
1,2,2,3,4,4,5... //inclusive_scan of summ_array[]
//after the moving part
0,1,3,4,6... //s_indexes[]
4,3,2,1,5... //d_hist[]
0,1,2,3,4... //d_indexes[]
The reason why I am inclined to use this pattern is because it takes log_base_2(256) time in order to calculate the sum_array plus, other than that, moving and checking parts are just constant time operations, if anyone have different idea than this, please share.

Cuda linear interpolation using textures

I have a curve as follows:
float points[] = {1, 4, 6, 9, 14, 25, 69};
float images[] = {0.3, 0.4, 0.7, 0.9, 1, 2.5, 5.3};
In order to interpolate let's say f(3) I would use linear interpolation between 1 and 4
In order to interpolate let's say f(15) I would apply a binary search on the array of points and get the lowerBound which is 25 and consider interpolation in the interval [14,25] and so on..
I have found out this method is making my device function very slow. I've heard I can use texture memory and tex1D in order to do so ! is it possible even if points[] is not let's say uniform (incremented by constant step)
Any idea ?
It looks like this problem can be broken into two parts:
Use the points array to convert the x value in f(x) to a floating point index between 0 and 7 (requires binary search on points[])
Use that floating point index to get a linearly interpolated value from the images array
Cuda texture memory can make step 2 very fast. I am guessing, however, that most of the time in your kernel is spent on step 1, and I don't think texture memory can help you there.
If you aren't already taking advantage of shared memory, moving your arrays to shared memory will give you a much bigger speedup than using texture memory. There is 48k of shared memory on recent hardware, so if your arrays are less than 24k (6k elements) they should both fit in shared memory. Step 1 can benefit greatly from shared memory because it requires non-contiguous reads of points[], which is very very slow in global memory.
If your arrays don't fit in shared memory, you should break up your arrays into equally sized pieces with 6k elements each and assign each piece to a block. Have each block read through all of the points you are iterpolating, and have it ignore the point if it's not within the portion of the points[] array stored in its shared memory.

Correct Effective Bandwith calculations of y = Ax+b?

I would like to calculate the bandwith of
the matrix vector multiplication and addition: (assume A = M times N big)
y = A*x +b
But I am a bit confused about what read and write count to the number of bytes read from global memory:
is the effective bandwith:
bytesReadWrite = M*N (for reading A) + N(for read x) + M (for read b) + M(for write y)
or is it
bytesReadWrite = M*N (for reading A) + M*N (for read x) + M (for read b) + M(for write y)
M*N for x because we read once the whole x for each row basically (also if we work with shared memory, we have eventually read once the whole x vector per row)
Does somebody have some good advice of what is the right choice? I dont get this really...
I tend to use the first calculation but why? Does it make sense?
Thanks a lot!!!
It's almost certainly none of the above. In terms of memory bandwidth, modern processors will load all of the items to be operated on once into Level 2 cache, and operate on them from there, after which the results will be written back out to memory for any items changed. Effectively, your bandwidth is just the sum total size for all of the elements involved. Note: even this is an oversimplification, because it doesn't take into account the effects of streaming, not to mention memory pagination. For streaming, it's not uncommon to have a single matrix operate on a large set of data (3D graphics calculations, for example); in that case, the matrix gets loaded to L2 cache (and presumably for reasonably optimized code into the registers from there) once, and then the vectors get loaded through. Once again, the model isn't really complete without an understanding of modern memory paging techniques; there's a gigantic difference in the above if the matrix and the vectors are stored in different memory pages, for example; not to mention serious optimizations in packing vectors together for "streaming" into L2 cache. And even then, that's assuming a CPU model of performing the matrix math; bringing a GPU into the picture changes things once again very dramatically.