can different calls to kernel share memory? - cuda

In my code, I need to call a CUDA kernel to parallelize some matrix computation. However, this computation must be done iteratively for ~60,000 times (kernel is called inside a 60,000 iteration for loop).
That means, If I do cudaMalloc/cudaMemcpy across every single call to the kernel, most of the time will be spent doing memory allocation and transfer and I get a significant slowdown.
Is there a way to say, allocate a piece of memory before the for loop, use that memory in each iteration of the kernel, and then after the for loop, copy that memory back from device to host?
Thanks.

Yes, you can do exactly what you describe:
int *h_data, *d_data;
cudaMalloc((void **)&d_data, DSIZE*sizeof(int));
h_data = (int *)malloc(DSIZE*sizeof(int));
// fill up h_data[] with data
cudaMemcpy(d_data, h_data, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
for (int i = 0; i < 60000; i++)
my_kernel<<<grid_dim, block_dim>>>(d_data)
cudaMemcpy(h_data, d_data, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
...

Related

Transferring data from CPU to GPU and vice versa where exactly?

I understand that cudaMalloc and cudaMemcpy transfer CPU (host) data to GPU (device), but I want to know exactly from which memory to which memory (if indeed it is a memory and not a register, because I'm not sure), because I read that a GPU has more than one kind of memory.
The cudaMalloc function allocates a requested number of bytes in Device global memory of the GPU and gives back the initialised pointer to that chunk of memory.
cudaMemcpy takes 4 parameters:
Address of pointer to the destination memory where the
copy is to be done.
Source address
Number of bytes
The direction of copy i.e. Host to device or device to host.
For example
void Add(float *A, float *B, float *C, int n)
{
int size = n * sizeof(float);
float *d_A, *d_B, *d_C;
cudaMalloc((void**) &d_A, size);
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMalloc((void**) &d_B, size);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
cudaMalloc((void**) &d_C, size);
cudaMemcpy(d_C, C, size, cudaMemcpyHostToDevice);
// further processing code
........
cudaMemcpy(C, d_C, size, cudaMemcopyDeviceToHost);
.......
}
cudaMemcpyHostToDevice and cudaMemcopyDeviceToHost are constants defined in CUDA programming environment.
In CUDA, host and device have separate memory spaces. GPUs have on board DRAM and some boards may have more than 4 GB of DRAM on, it is known as Device Global Memory. To execute a kernel on a device, the programmer needs to allocate Device Global Memory and transfer the relevant data from host to device memory. After the GPU processing is done the result is transferred back to the Host. These operations are shown in the code snippet above.

How to measure overhead of a kernel launch in CUDA

I want to measure the overhead of a kernel launch in CUDA.
I understand that there are various parameters which affect this overhead. I am interested in the following:
number of threads created
size of data being copied
I am doing this mainly to measure the advantage of using managed memory which has been introduced in CUDA 6.0. I will update this question with the code I develop and from the comments. Thanks!
How to measure the overhead of a kernel launch in CUDA is dealt with in Section 6.1.1 of the "CUDA Handbook" book by N. Wilt. The basic idea is to launch an empty kernel. Here is a sample code snippet
#include <stdio.h>
__global__ void EmptyKernel() { }
int main() {
const int N = 100000;
float time, cumulative_time = 0.f;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
for (int i=0; i<N; i++) {
cudaEventRecord(start, 0);
EmptyKernel<<<1,1>>>();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cumulative_time = cumulative_time + time;
}
printf("Kernel launch overhead time: %3.5f ms \n", cumulative_time / N);
return 0;
}
On my laptop GeForce GT540M card, the kernel launch overhead is 0.00245ms.
If you want to check the dependence of this time from the number of threads launched, then just change the kernel launch configuration <<<*,*>>>. It appears that the timing does not significantly change with the number of threads launched, which is consistent with the statement of the book that most of that time is spent in the driver.
Perhaps you should be interested in these test results from the University of Virginia:
Memory transfer overhead: http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html
Kernel launch overhead: http://www.cs.virginia.edu/~mwb7w/cuda_support/kernel_overhead.html
They were measured in a similar way to JackOLantern proposal.

CUDA context lifetime

In my application I have some part of the code that works as follows
main.cpp
int main()
{
//First dimension usually small (1-10)
//Second dimension (100 - 1500)
//Third dimension (10000 - 1000000)
vector<vector<vector<double>>> someInfo;
Object someObject(...); //Host class
for (int i = 0; i < N; i++)
someObject.functionA(&(someInfo[i]));
}
Object.cpp
void SomeObject::functionB(vector<vector<double>> *someInfo)
{
#define GPU 1
#if GPU == 1
//GPU COMPUTING
computeOnGPU(someInfo, aConstValue, aSecondConstValue);
#else
//CPU COMPUTING
#endif
}
Object.cu
extern "C" void computeOnGPU(vector<vector<double>> *someInfo, int aConstValue, int aSecondConstValue)
{
//Copy values to constant memory
//Allocate memory on GPU
//Copy data to GPU global memory
//Launch Kernel
//Copy data back to CPU
//Free memory
}
So as (I hope) you can see in the code, the function that prepares the GPU is called many times depending on the value of the first dimension.
All the values that I send to constant memory always remain the same and the sizes of the pointers allocated in global memory are always the same (the data is the only one changing).
This is the actual workflow in my code but I'm not getting any speedup when using GPU, I mean the kernel does execute faster but the memory transfers became my problem (as reported by nvprof).
So I was wondering where in my app the CUDA context starts and finishes to see if there is a way to do only once the copies to constant memory and memory allocations.
Normally, the cuda context begins with the first CUDA call in your application, and ends when the application terminates.
You should be able to do what you have in mind, which is to do the allocations only once (at the beginning of your app) and the corresponding free operations only once (at the end of your app) and populate __constant__ memory only once, before it is used the first time.
It's not necessary to allocate and free the data structures in GPU memory repetetively, if they are not changing in size.

Thrust: transform_reduce : cudaMalloc in unary_op.operator

In my unary_op.operator, I need to create a temporary array.
I guess cudaMalloc is the way to go.
But, is it performance efficient or is there a better design?
struct my_unary_op
{
__host__ __device__ int operator()(const int& index) const
{
int* array;
cudaMalloc((void**)&array, 10*sizeof(int));
for(int i = 0; i < 10; i++)
array[i] = index;
int sum=0;
for(int i=0; i < 10 ; i++)
sum += array[i];
return sum;
};
};
int main()
{
thrust::counting_iterator<int> first(0);
thrust::counting_iterator<int> last = first+100;
my_unary_op unary_op = my_unary_op();
thrust::plus<int> binary_op;
int init = 0;
int sum = thrust::transform_reduce(first, last, unary_op, init, binary_op);
return 0;
};
You won't be able to compile cudaMalloc() in a __device__ function, because it is a host-only function. You can, however, use plain malloc() or new (on devices of compute capability >= 2.0), but these are not very efficient when running on the device. There are two reasons for this. The first is that concurrently running threads are serialized during the memory allocation call. The second is that the calls allocate global memory in chunks that become arranged in such a way that when the memory load and store instructions are run by the 32 threads in a warp, they are not adjacent, so you don't get properly coalesced memory accesses.
You can address both of these issues by using fixed size C style arrays in your __device__ functions (ie., int array[10];). Small, fixed size arrays can sometimes be optimized by the compiler so that they are stored in the register file, for extremely fast access. If the compiler stores them in global memory, it will use local memory. Local memory is stored in global memory, but it is interleaved in such a way that when the 32 threads in a warp run a load or store instruction, each thread accesses adjacent locations in memory, enabling the transactions to be fully coalesced.
If you don't know at runtime what the size of your C arrays will be, allocate a max size in the array and leave some of it unused.
I think that the total amount of memory that is used by the fixed sized array will depend on the total number of threads that are processed concurrently on the GPU, not on the total number of threads launched by the kernel. In this answer #mharris shows how to calculate the maximum possible number of concurrent threads, which is 24,576 for a GTX580. So, if the fixed size array is 16 32-bit values, the maximum possible amount of memory used by the array would be 1536KiB.
If you need a wide range of array sizes, you can use templates to compile kernels with a number of different sizes. Then, at runtime, select one that is able to accommodate the size that you need. However, chances are that if you simply allocate the maximum of what you might need, the memory usage will not be the limiting factor in the number of threads that you can launch.

memset in cuda that allows to set values within kernel

i am making several cudamemset calls in order to set my values to 0 as below:
void allocateByte( char **gStoreR,const int byte){
char **cStoreR = (char **)malloc(N * sizeof(char*));
for( int i =0 ; i< N ; i++){
char *c;
cudaMalloc((void**)&c, byte*sizeof(char));
cudaMemset(c,0,byte);
cStoreR[i] = c;
}
cudaMemcpy(gStoreR, cStoreR, N * sizeof(char *), cudaMemcpyHostToDevice);
}
However, this is proving to be very slow. Is there a memset function on the GPU as calling it from CPU takes lot of time. Also, does cudaMalloc((void**)&c, byte*sizeof(char)) automatically set bits that c points to to 0.
Every cudaMemset call launches a kernel, so if N is large and byte is small, then you will have a lot of kernel launch overhead slowing down the code. There is no device side memset, so the solution would be to write a kernel which traverses the allocations and zeros the storage in a single launch.
As an aside, I would strongly recommend against using a structure of arrays in CUDA. It is a lot slower and much more complex to manage that achieving the same outcome using a single large block of linear memory and indexing into that memory. In your example, it would reduce the code to a single cudaMalloc call and a single cudaMemset call. On the device side, pointer indirection, which is slow, gets replaced by a few integer operations, which are very fast. If your source material on the host is an array of structures, I would recommend using something like the excellent thrust::zip_iterator to get the data into a GPU friendly form on the device.