Parameters and computational time of epoch increases with the increasing channels of input image? - deep-learning

I have different sets of images at input (i.e. 56x56x3, 56x56x5, 56x56x10, 56x56x30, 56x56x61) with the same network.
1) I want to know that the number of parameters of network will be same for each input?
2) Computational time of each epoch is slightly higher by increasing the number of channels at input, is it normal?
UPDATE
Parameter calculation for 3 channels
3*3*3*64 = 1728
3*3*64*128 = 73728
3*3*128*256 = 294912
5*5*256*512 = 3276800
1*1*512*1024 = 524288
1*1*1024*4 = 4096
Parameter calculation for 10 channels
3*3*10*64 = 5760
3*3*64*128 = 73728
3*3*128*256 = 294912
5*5*256*512 = 3276800
1*1*512*1024 = 524288
1*1*1024*4 = 4096

For performing convolution it is necessary that any kernel (or filter) has the same number of channels as the input feature map (or image), for the corresponding layer. And the number of parameters for that layer is given as:
No of Kernels x Kernel Height x Kernel Width x No of Channels in the Kernel
So you see that the number of parameters are actually directly proportional to the number of channels in the input feature map. And it is obvious that as the number of parameters increase the number of computations also increase, hence the increased computational time.
You may see the detailed explanation of convolution operation in my post here.

Related

Slow prediction speed for translation model opus-mt-en-ro

I'm using the model Helsinki-NLP/opus-mt-en-ro from huggingface.
To produce output, I'm using the following code:
inputs = tokenizer(
questions,
max_length=max_input_length,
truncation=True,
return_tensors='pt',
padding=True).to('cuda')
translation = model.generate(**inputs)
For small inputs (i.e. the number of sentences in questions), it works fine. However, when the number of sentences increases (e.g., batch size = 128), it is very slow.
I have a dataset of 100K examples and I have to produce the output. How to make it faster? (I already checked the usage of GPU and it varies between 25% and 70%).
Update: Following the comment of dennlinger, here is the additional information:
Average question length: Around 30 tokens
Definition of slowness: With a batch of 128 questions, it takes around 25 seconds. So given my dataset of 100K examples, it will take more than 5 hours. I'm using GPU Nvidia V100 (16GB) (hence to('cuda') in the code). I cannot increase the batch size because it results in out of memory error.
I didn't try different parameters, but I know by default, the number of beams equals 1.

Depthwise separable convolutions require more GPU memory

I have read many papers and web articles that claim that depthwise-separable convolutions reduce the memory required by a deep learning model compared to standard convolution. However, I do not understand how this would be the case, since depthwise-separable convolution requires storing an extra intermediate-step matrix as well as the final output matrix.
Here are two scenarios:
Typical convolution: You have a 3x3 filter, which is applied to a 7x7 RGB input volume. This results in an output of size 5x5x1 which needs to be stored in GPU memory. Suppose activations are float32, this requires 100 bytes of memory
Depthwise-separable convolution: You have three 3x3x1 filters applied to a 7x7 RGB input volume. This results in three output volumes each of size 5x5x1. You then apply a 1x1 convolution to the concatenated 5x5x3 volume to get a final output volume of size 5x5x1. Hence, with float32 activations, this requires 300 bytes for the intermediate 5x5x3 volume, and 100 bytes for the final output. Hence a total of 400 bytes of memory
As additional evidence, when using an implementation U-Net in pytorch with typical nn.Conv2d convolutions, the model has 17.3M parameters and a forward/backward pass size of 320MB. If I replace all convolutions with depthwise-separable convolutions, the model has 2M parameters, and a forward/backward pass size of 500MB. So fewer parameters, but more memory required
I am sure I am going wrong somewhere, as every article states that depthwise-separable convolutions require less memory. Where am I going wrong with my logic?

Pytorch: How do I deal with different input sizes within one batch?

I am implementing something closely related to the DeepSets architecture on point clouds:
https://arxiv.org/abs/1703.06114
That means I am working with a set of inputs (coordinates), have fully connected layers process each of those seperately and then perform average pooling over them (to then do further processing).
The input for each sample i is a tensor of shape [L_i, 3] where L_i is the number of points and the last dimension is 3 because each points has x,y,z coordinates. Crucially, L_i depends on the sample. So I have a different number of points per instance. When I put everything into a batch, I currently have the input in the shape [B, L, 3] where L is larger than L_i for all i. The individual samples are padded with 0's. The issue is that 0's are not ignored by the network, they are processed and fed into the average pooling. Instead I would like the average pooling to only consider actual points (not padded 0's). I do have another array which stores the lengths [L_1, L_2, L_3, L_4...], but I am not sure how to use it.
My Question is: How do you handle different input sizes wihtin one batch in the most graceful manner?
This is how the model is define:
encoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 128))
x = self.encoder(x)
x = x.max(dim=1)[0]
decoder = ...

CUDA Kernel Execution Time does not change on larger array

I am profiling a cuda application on different 1d input sizes. However NSIGHT profiler kernel execution times are similar for small vector sizes. As an example, there is not any difference between vector sizes 512 and 2048.Kernel execution time increases linearly for bigger vectors but no difference on smaller vectors like vector size 512 and 2048. Is it an expected result?
Let's suppose it takes 3 microseconds of execution time to launch a kernel of any size, and, after that overhead, 1 ns of execution time per point in your vector. Now let's ask, what is the percent difference in the execution of kernels of x and 2x points, when x is small (say 1024) and when x is large (say, 1048576)?
x = 1024:
execution_time(x) = 3000+1024 = 4024ns
execution_time(2x) = 3000+2048 = 5048ns
%difference = (5048-4024)/4024 * 100% = 25.45%
x = 1048576:
execution_time(x) = 3000+1048576 = 1051576ns
execution_time(2x) = 3000+2097152 = 2100152ns
%difference = (2100152-1051576)/1051576 * 100% = 99.71%
This demonstrates what to expect when making measurements of execution time (and changes in execution time) when the execution time is small compared to the fixed overhead, vs. when it is large compared to the fixed overhead.
In the small case, the execution time is "swamped" by the overhead. Doubling the "work" does not lead to a doubling in the execution time. In the large case, the overhead is insignificant compared to the execution time. Therefore in the large case we see approximately the expected result, that doubling the "work" (vector length) approximately doubles the execution time.
Note that the "fixed overhead" here may be composed of a number of items, the "kernel launch overhead" being just one. CUDA typically has other start-up fixed "overheads" associated with initialization, that also play a role.

CUDA - device memory,searching a string in a text

I have a string of length maximum 500 characters and a text file of size 200MB. I want to write a program in CUDA to search the string in the text file. My text file is too large and I think I have to put it in a global memory of the device, but what about my string? Which is the best among the shared, constant and texture memory? and why?
Also I have an array of size maximum 2500. Which types of device memory is suitable for storing it?
For a naive implementation on Fermi:
Store the text file in global memory and the search string in constant memory. Set up a result buffer of the same size as the text file. Fill the result buffer with zeroes.
Let the number of threads per block, t, be the same as the length of the search string. To determine grid dimensions, consider the size of your text file and the grid dimension limit of 64K. To cover your whole file, select the dimension for x to be, for instance, 10K. Then find the dimension for y by dividing the size of your text file with x and rounding up the result. So 200M / 10K = 20K (which is within 64K). Launch the kernel with t threads and an (x, y) grid.
In the kernel:
Calculate the offset into the text file as d = x + 1024 * y.
Since the y dimension was rounded up above, some kernels at the end of the run must be aborted. Abort the thread if d + t is higher than the size of the text file.
Else, have the thread load one character at index t from the search string and compare it with one character at index t + d in the text file. If the characters didn't match, store a "1" in the result buffer at index d, else do nothing.
When the kernel completes, scan through the result buffer with Thrust. Each location that is 0 marks the starting point of a match.
Assuming you write the kernel so that all threads in a half-warp are accessing the same element of the search string simultaneously, constant memory is likely to yield good results. It's optimized for that case.
Here's some pseudo-code for a simple baseline implementation
...load blocksize+strlen bytes of the file into shared memory...
__syncthreads();
bool found = true;
for (int i = 0; i < strlen; i++) {
if (file_chunk_in_sharedmem[threadIdx.x + i] !=
search_str_in_constantmem[i])
{
found = false;
break;
}
}
if (found) {
...output the result...
}
If the loop is structured such that each thread is accessing a different element of the search string, 1d texture memory might be faster.
The profiler and/or cuda timing functions are your friends.