can cudaMemcpy accept variables from device as its argument? - cuda

cudaMemcpy(dst, src, filesize, cudaMemcpyDeviceToHost);
Where filesize is a variable stored in device global memory.

Simple answer is no.
The argument is passed by value, meaning the value must be known in the host code. Therefore you should have a first call to cudaMemcpy() to get the size and a second call to cudaMemcpy() to perform the actual copy.
If you're using Thrust vectors you can just read the element in the host code, but that's because Thrust handles the copy for you.

Related

Can I obtain the amount of allocated dynamic shared memory from within a kernel?

On the host side, I can save the amount of dynamic shared memory I intend to launch a kernel with, and use it. I can even pass that as an argument to the kernel. But - is there a way to get it directly from device code, without help from the host side? That is, have the code for a kernel determine, as it runs, how much dynamic shared memory it has available?
Yes, there's a special register holding that value. named %dynamic_smem_size. You can obtain this register's value in your CUDA C/C++ code by wrapping some inline PTX with a getter function:
__device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
You can similarly obtain the total size of allocated shared memory (static + dynamic) from the register %total_smem_size.

Is there any way to dynamically allocate constant memory? CUDA

I'm confused about copying arrays to constant memory.
According to programming guide there's at least one way to allocate constant memory and use it in order to store an array of values. And this is called static memory allocation:
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));
According to programming guide again we can use:
__device__ float* devPointer;
float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));
It looks like dynamic constant memory allocation is used, but I'm not sure about it. And also no qualifier __constant__ is used here.
So here are some questions:
Is this pointer stored in constant memory?
Is assigned (by this pointer) memory stored in constant memory too?
Is this pointer constant? And it's not allowed to change that pointer using device or host function. But is changing values of array prohibited or not? If changing values of array is allowed, then does it mean that constant memory is not used to store this values?
The developer can declare up to 64K of constant memory at file scope. In SM 1.0, the constant memory used by the toolchain (e.g. to hold compile-time constants) was separate and distinct from the constant memory available to developers, and I don't think this has changed since. The driver dynamically manages switching between different views of constant memory as it launches kernels that reside in different compilation units. Although you cannot allocate constant memory dynamically, this pattern suffices because the 64K limit is not system-wide, it only applies to compilation units.
Use the first pattern cited in your question: statically declare the constant data and update it with cudaMemcpyToSymbol before launching kernels that reference it. In the second pattern, only reads of the pointer itself will go through constant memory. Reads using the pointer will be serviced by the normal L1/L2 cache hierarchy.

Passing value from device memory as kernel parameter in CUDA

I'm writing a CUDA application that has a step where the variance of some complex-valued input data is computed, and then that variance is used to threshold the data. I've got a reduction kernel that computes the variance for me, but I'm not sure if I have to pull the value back to the host to pass it to the thresholding kernel or not.
Is there a way to pass the value directly from device memory?
You can use a __device__ variable to hold the variance value in-between kernel calls.
Put this before the definition of the kernels that use it:
__device__ float my_variance = 0.0f;
Variables defined this way can be used by any kernel executing on the device (without requiring that they be explicitly passed as a kernel function parameter) and persist for the lifetime of the context, i.e. beyond the lifetime of any single kernel call.
It's not entirely clear from your question, but you can also define an array of data this way.
__device__ float my_variance[32] = {0.0f};
Likewise, allocations created by cudaMalloc live for the duration of the application/context (or until an appropriate cudaFree is encountered) and so there is no need to "pull back the data" to the host if you want to use it in a successive kernel:
float *d_variance;
cudaMalloc((void **)&d_variance), sizeof(float));
my_reduction_kernel<<<...>>>(..., d_variance, ...);
my_thresholding_kernel<<<...>>>(..., d_variance, ...);
Any value set in *d_variance by the reduction kernel above will be properly observed by the thresholding kernel.

why do we have to pass a pointer to a pointer to cudaMalloc

The following codes are widely used for GPU global memory allocation:
float *M;
cudaMalloc((void**)&M,size);
I wonder why do we have to pass a pointer to a pointer to cudaMalloc, and why it was not designed like:
float *M;
cudaMalloc((void*)M,size);
Thanks for any plain descriptions!
cudaMalloc needs to write the value of the pointer to M (not *M), so M must be passed by reference.
Another way would be to return the pointer in the classic malloc fashion. Unlike malloc, however, cudaMalloc returns an error status, like all CUDA runtime functions.
To explain the need in a little more detail:
Before the call to cudaMalloc, M points... anywhere, undefined. After the call to cudaMalloc you want a valid array to be present at the memory location where it points at. One could naïvely say "then just allocate the memory at this location", but that's of course not possible in general: the undefined address will normally not even be inside valid memory. cudaMalloc need to be able to choose the location. But if the pointer is called by value, there's no way to tell the caller where.
In C++, one could make the signature
template<typename PointerType>
cudaStatus_t cudaMalloc(PointerType& ptr, size_t);
where passing ptr by reference allows the function to change the location, but since cudaMalloc is part of the CUDA C API this is not an option. The only way to pass something as modifiable in C is to pass a pointer to it. And the object is itself a pointer what you need to pass is a pointer to a pointer.

Variable Sizes Array in CUDA

Is there any way to declare an array such as:
int arraySize = 10;
int array[arraySize];
inside a CUDA kernel/function? I read in another post that I could declare the size of the shared memory in the kernel call and then I would be able to do:
int array[];
But I cannot do this. I get a compile error: "incomplete type is not allowed". On a side note, I've also read that printf() can be called from within a thread and this also throws an error: "Cannot call host function from inside device/global function".
Is there anything I can do to make a variable sized array or equivalent inside CUDA? I am at compute capability 1.1, does this have anything to do with it? Can I get around the variable size array declarations from within a thread by defining a typedef struct which has a size variable I can set? Solutions for compute capabilities besides 1.1 are welcome. This is for a class team project and if there is at least some way to do it I can at least present that information.
About the printf, the problem is it only works for compute capability 2.x. There is an alternative cuPrintf that you might try.
For the allocation of variable size arrays in CUDA you do it like this:
Inside the kernel you write extern __shared__ int[];
On the kernel call you pass as the third launch parameter the shared memory size in bytes like mykernel<<<gridsize, blocksize, sharedmemsize>>>();
This is explained in the CUDA C programming guide in section B.2.3 about the __shared__ qualifier.
If your arrays can be large, one solution would be to have one kernel that computes the required array sizes, stores them in an array, then after that invocation, the host allocates the necessary arrays and passes an array of pointers to the threads, and then you run your computation as a second kernel.
Whether this helps depends on what you have to do, because it would be arrays allocated in global memory. If the total size (per block) of your arrays is less than the size of the available shared memory, you could have a sufficiently-large shared memory array and let your threads negociate splitting it amongst themselves.