How to avoid memcpy if number of blocks depends on device variable? - cuda

I am computing a number, X, on the device. Now I need to launch a kernel with X threads. I can set the blockSize to 1024. Is there a way to set the number of blocks to ceil(X / 1024) without performing a memcpy?

I see two possibilities:
Use dynamic parallelism (if feasible). Rather than copying the result back to determine the execution parameters of the next launch, just have the device perform the next launch itself.
Use zero-copy or managed memory. In that case the GPU writes directly to CPU memory over the PCI-e bus, rather than requiring an explicit memory transfer.
Of those options, dynamic parallelism and managed memory require hardware features which are not available on all GPUs. Zero-copy memory is supported by all GPUs with compute capability >= 1.1, which in practice is just about every CUDA compatible device ever made.

An example of using managed memory, as outlined by #talonmies, allowing kernel1 to determine the number of blocks for kernel2 without an explicit memcpy.
#include <stdio.h>
#include <cuda.h>
using namespace std;
__device__ __managed__ int kernel2_blocks;
__global__ void kernel1() {
if (threadIdx.x == 0) {
kernel2_blocks = 42;
}
}
__global__ void kernel2() {
if (threadIdx.x == 0) {
printf("block: %d\n", blockIdx.x);
}
}
int main() {
kernel1<<<1, 1024>>>();
cudaDeviceSynchronize();
kernel2<<<kernel2_blocks, 1024>>>();
cudaDeviceSynchronize();
return 0;
}

Related

kernels accessing host memory

I am trying to get a better grasp of memory management in cuda. There is Something that is just now occurring to me as a major lack of understanding. How do kernels access values that, as I understand it, should be in host memory.
When vectorAdd() is called, it runs the function on the device. But only the elements are stored on the device memory. the length of the vectors are stored on the host. How is it that the kernel does not exit with an error from trying to access foo.length, something that should be on the host.
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
typedef struct{
float *elements;
int length;
}vector;
__global__ void vectorAdd(vector foo, vector bar){
int idx = threadIdx.x + blockDim.x * blockId.x.x;
if(idx < foo.length){ //this is the part that I do not understand
foo.elements[idx] += bar.elements[idx];
}
}
int main(void){
vector foo, bar;
foo.length = bar.length = 50;
cudaMalloc(&(foo.elements), sizeof(float)*50);
cudaMalloc(&(bar.elements), sizeof(float)*50);
//these vectors are empty, so adding is just a 0.0 += 0.0
int blocks_per_grid = 10;
int threads_per_block = 5;
vectorAdd<<<blocks_per_grid, threads_per_block>>>(foo, bar);
return 0;
}
In C and C++, a typical mechanism for making arguments available to the body of a function call is pass-by-value. The basic idea is that a separate copy of the arguments are made, for use by the function.
CUDA claims compliance to C++ (subject to various limitations), and it therefore provides a mechanism for pass-by-value. On a kernel call, the CUDA compiler and runtime will make copies of the arguments, for use by the function (kernel). In the case of a kernel call, these copies are placed in a particular area of __constant__ memory which is in the GPU and within GPU memory space, and therefore "accessible" to device code.
So, in your example, the entire structures passed as the arguments for the parameters vector foo, vector bar are copied to GPU device memory (specifically, constant memory) by the CUDA runtime. The CUDA device code is structured in such a way by the compiler to access these arguments as needed directly from constant memory.
Since those structures contain both the elements pointer and the scalar quantity length, both items are accessible in CUDA device code, and the compiler will structure references to them (e.g. foo.length) so as to retrieve the needed quantities from constant memory.
So the kernels are not accessing host memory in your example. The pass-by-value mechanism makes the quantities available to device code, in GPU constant memory.

How to measure overhead of a kernel launch in CUDA

I want to measure the overhead of a kernel launch in CUDA.
I understand that there are various parameters which affect this overhead. I am interested in the following:
number of threads created
size of data being copied
I am doing this mainly to measure the advantage of using managed memory which has been introduced in CUDA 6.0. I will update this question with the code I develop and from the comments. Thanks!
How to measure the overhead of a kernel launch in CUDA is dealt with in Section 6.1.1 of the "CUDA Handbook" book by N. Wilt. The basic idea is to launch an empty kernel. Here is a sample code snippet
#include <stdio.h>
__global__ void EmptyKernel() { }
int main() {
const int N = 100000;
float time, cumulative_time = 0.f;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
for (int i=0; i<N; i++) {
cudaEventRecord(start, 0);
EmptyKernel<<<1,1>>>();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cumulative_time = cumulative_time + time;
}
printf("Kernel launch overhead time: %3.5f ms \n", cumulative_time / N);
return 0;
}
On my laptop GeForce GT540M card, the kernel launch overhead is 0.00245ms.
If you want to check the dependence of this time from the number of threads launched, then just change the kernel launch configuration <<<*,*>>>. It appears that the timing does not significantly change with the number of threads launched, which is consistent with the statement of the book that most of that time is spent in the driver.
Perhaps you should be interested in these test results from the University of Virginia:
Memory transfer overhead: http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html
Kernel launch overhead: http://www.cs.virginia.edu/~mwb7w/cuda_support/kernel_overhead.html
They were measured in a similar way to JackOLantern proposal.

CUDA context lifetime

In my application I have some part of the code that works as follows
main.cpp
int main()
{
//First dimension usually small (1-10)
//Second dimension (100 - 1500)
//Third dimension (10000 - 1000000)
vector<vector<vector<double>>> someInfo;
Object someObject(...); //Host class
for (int i = 0; i < N; i++)
someObject.functionA(&(someInfo[i]));
}
Object.cpp
void SomeObject::functionB(vector<vector<double>> *someInfo)
{
#define GPU 1
#if GPU == 1
//GPU COMPUTING
computeOnGPU(someInfo, aConstValue, aSecondConstValue);
#else
//CPU COMPUTING
#endif
}
Object.cu
extern "C" void computeOnGPU(vector<vector<double>> *someInfo, int aConstValue, int aSecondConstValue)
{
//Copy values to constant memory
//Allocate memory on GPU
//Copy data to GPU global memory
//Launch Kernel
//Copy data back to CPU
//Free memory
}
So as (I hope) you can see in the code, the function that prepares the GPU is called many times depending on the value of the first dimension.
All the values that I send to constant memory always remain the same and the sizes of the pointers allocated in global memory are always the same (the data is the only one changing).
This is the actual workflow in my code but I'm not getting any speedup when using GPU, I mean the kernel does execute faster but the memory transfers became my problem (as reported by nvprof).
So I was wondering where in my app the CUDA context starts and finishes to see if there is a way to do only once the copies to constant memory and memory allocations.
Normally, the cuda context begins with the first CUDA call in your application, and ends when the application terminates.
You should be able to do what you have in mind, which is to do the allocations only once (at the beginning of your app) and the corresponding free operations only once (at the end of your app) and populate __constant__ memory only once, before it is used the first time.
It's not necessary to allocate and free the data structures in GPU memory repetetively, if they are not changing in size.

Free shared memory in CUDA

Is there any application level API available to free shared memory allocated by CTA in CUDA? I want to reuse my CTA for another task and before starting that task I should clean memory used by previous task.
Shared memory is statically allocated at kernel launch time. You can optionally specify an unsized shared allocation in the kernel:
__global__ void MyKernel()
{
__shared__ int fixedShared;
extern __shared__ int extraShared[];
...
}
The third kernel launch parameter then specifies how much shared memory corresponds to that unsized allocation.
MyKernel<<<blocks, threads, numInts*sizeof(int)>>>( ... );
The total amount of shared memory allocated for the kernel launch is the sum of the amount declared in the kernel, plus the shared memory kernel parameter, plus alignment overhead. You cannot "free" it - it stays allocated for the duration of the kernel launch.
For kernels that go through multiple phases of execution and need to use the shared memory for different purposes, what you can do is reuse the memory with shared memory pointers - use pointer arithmetic on the unsized declaration.
Something like:
__global__ void MyKernel()
{
__shared__ int fixedShared;
extern __shared__ int extraShared[];
...
__syncthreads();
char *nowINeedChars = (char *) extraShared;
...
}
I don't know of any SDK samples that use this idiom, though the threadFenceReduction sample declares a __shared__ bool and also uses shared memory to hold the partial sums of the reduction.

__syncthreads not working in CUDA

I wrote simple kernel to test functionality of CUDA __syncthreads. In kernel I've managed to print from each thread if updated value is not visible to other threads. Ideally no thread should print Not visible to me error message but some threads end up printing this message.
Here is the kernel.
__device__ int a=0;
__global__ void kernel()
{
isItOK=false;
if(threadIdx.x==0 && blockIdx.x==0)
{
atomicAdd(&a,1);
__threadfence();
}
__syncthreads();
if(atomicAdd(&a,0)==0)
{
cuPrintf("Not Visible to me\n");
}
}
int main()
{
int *a;
cudaPrintfInit();
kernel<<<16,16>>>();
cudaPrintfDisplay(stdout,true);
cudaPrintfEnd();
}
Please help me with this, very simple test program but still not working. Do we need some compiler flags to set ?
__syncthreads() is a synchronization barrier primitive that only synchronizes threads in the same block.
CUDA has no mechanism for safely synchronizing across thread blocks.
Communication and synchronization between thread blocks is not recommended because it breaks scalability of execution across GPUs with varying numbers of multiprocessors, which is the reason for having thread blocks in the first place.