2d array use in the kernel - cuda

In the CUDA examples i read, I don't find any direct use of 2D array notation [][] in the kernel code when the array is in the global memory unlike when it is in the shared memory, e.g. matrix multiplication. Is there any performance related reason behind this?
Also, i read in a old thread that the following code is incorrect
int **d_array;
cudaMalloc( (void**)&d_array , 5 * sizeof(int*) );
for(int i = 0 ; i < 5 ; i++)
{
cudaMalloc((void **)&d_array[i],10 * sizeof(int));
}
According to the author, "once the main thread assigns memory on the device the main thread loses access to it, that is, it can only be accessed within kernels. So, When you try call cudaMalloc on the 2nd dimension of the array it throws an "Access violation writing location" exception."
I don't understand what the author really means; actually, i find the above code correct
Thank your for your help
SS

Is there any performance related reason behind this?
Yes, a doubly-subscripted array normally requires an extra pointer lookup, i.e. an extra memory read, before the data referenced can be accessed. By using "simulated" 2D access:
int val = d[i*columns+j];
instead of:
int val = d[i][j];
then only a single memory read access is required. The proper indexing is computed directly, rather than requiring the read of a row-pointer. GPUs generally have lots of compute capability compared to memory bandwidth.
I don't understand what the author really means; actually, i find the above code correct
The code is in fact incorrect.
This operation:
cudaMalloc( (void**)&d_array , 5 * sizeof(int*) );
creates a single contiguous allocation on the device, of length equal to 5 pointers storage, and takes the starting address of that allocation, and stores it in the host memory location associated with d_array. That is what cudaMalloc does: it creates a device allocation of the requested length, and stores the starting device address of that allocation in the provided host memory variable.
So let's deconstruct what is being asked for here:
cudaMalloc((void **)&d_array[i],10 * sizeof(int));
This says, create a device allocation of length 10*sizeof(int) and store the starting address of it in the location d_array[i]. But the location associated with d_array[i] is on the device, not the host, and requires dereferencing of the d_array pointer to actually access it, to store something there.
cudaMalloc does not do this. You cannot ask for the starting address of the device allocation to be stored in device memory. You can only ask for the starting address of the device allocation to be stored in host memory.
&d_array
is a pointer to host memory.
&d_array[i]
is a pointer to device memory.
The canonical 2D array worked example is now referenced in the cuda tag info link.

Related

Is there any way to dynamically allocate constant memory? CUDA

I'm confused about copying arrays to constant memory.
According to programming guide there's at least one way to allocate constant memory and use it in order to store an array of values. And this is called static memory allocation:
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));
According to programming guide again we can use:
__device__ float* devPointer;
float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));
It looks like dynamic constant memory allocation is used, but I'm not sure about it. And also no qualifier __constant__ is used here.
So here are some questions:
Is this pointer stored in constant memory?
Is assigned (by this pointer) memory stored in constant memory too?
Is this pointer constant? And it's not allowed to change that pointer using device or host function. But is changing values of array prohibited or not? If changing values of array is allowed, then does it mean that constant memory is not used to store this values?
The developer can declare up to 64K of constant memory at file scope. In SM 1.0, the constant memory used by the toolchain (e.g. to hold compile-time constants) was separate and distinct from the constant memory available to developers, and I don't think this has changed since. The driver dynamically manages switching between different views of constant memory as it launches kernels that reside in different compilation units. Although you cannot allocate constant memory dynamically, this pattern suffices because the 64K limit is not system-wide, it only applies to compilation units.
Use the first pattern cited in your question: statically declare the constant data and update it with cudaMemcpyToSymbol before launching kernels that reference it. In the second pattern, only reads of the pointer itself will go through constant memory. Reads using the pointer will be serviced by the normal L1/L2 cache hierarchy.

Writing from Device to Host and notifying the host

Using CUDA 5 with VS 2012 and capability 3.5 (Titan and K20).
At particular stages of my kernel execution, I want to send a generated data chunk to the host memory and notify the host that the data is ready, so the host will operate on it.
I cannot wait until the end of the kernel execution to read the data back from the device, because:
The data is no longer relevant to the device once it is calculated, so there is no point keeping it to the end.
The data size is too large to fit on the device memory and wait until the end.
The host should not have to wait until the end of the kernel execution to start processing the data.
Could you point me to the path I have to take and the possible cuda concepts and functions I have to use to achieve my requirements? Put simply, how can I write to the host and notify the host that a chunk data is ready for host processing?
N.B. Each thread does not share any generated data with any other thread, they run independently. So, as far as I know (and please correct me if I am wrong), the concept of blocks, threads and warps do not affect the question. Or in other words, if they aid the answer, I am free to alter their combination.
Below is a sample code that shows that I am trying to do:
#pragma once
#include <conio.h>
#include <cstdio>
#include <cuda_runtime_api.h>
__global__ void Kernel(size_t length, float* hResult)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// Processing multiple data chunks
for(int i = 0;i < length;i++)
{
// Once this is assigned, I don't need it on the device anymore.
hResult[i + (tid * length)] = i * 100;
}
}
void main()
{
size_t length = 10;
size_t threads = 2;
float* hResult;
// An array that will hold all data from all threads
cudaMallocHost((void**)&hResult, threads * length * sizeof(float));
Kernel<<<threads,1>>>(length, hResult);
// I DO NOT want to wait to the end and block to get the data
cudaError_t error = cudaDeviceSynchronize();
if (error != cudaSuccess) { throw error; }
for(int i = 0;i < threads * length;i++)
{
printf("%f\n", hResult[i]);;
}
cudaFreeHost(hResult);
system("pause");
}
Here is one possible approach. At a high level, on the device:
You'll need to write the data to either device global memory (allocated previously with cudaMalloc) or else directly to host memory (allocated previously with cudaHostAlloc). This memory should be accessed via a volatile pointer.
You may wish to do all the data writing to this region from a single threadblock, to be sure that all the data is written prior to the following steps
You'll then want to issue a threadfence() (if you're using device global memory) or threadfence_system() call (if using host memory) prior to the following steps
Next you'll write to a special location in device global memory or host memory, let's call it the mailbox location, with a specific value indicating the data is ready. This location should also be accessed with a volatile pointer.
Optionally issue another threadfence or threadfence_system call
for device memory usage on the receiving end, again both regions (payload and "mailbox") should be accessed using a volatile pointer.
On the host:
Before launching the kernel, the host will need to set the mailbox location to a default value.
After launching the kernel, the host thread will need to "poll" the mailbox location, looking for the specific value indicating data is ready
Once the specific value is seen, indicating that the data is ready, the host can consume the data
Optionally, if you want to repeat this process, the host can reset the mailbox location to the default value. The device can check for this default value before updating the data block with new data.
Both the mailbox location and the payload region should be accessed by the host thread using a volatile pointer.
Note that even with the above process, there is still an implied device-wide synchronization needed, if the data is being generated/created from multiple threadblocks. The only straightforward device-wide synchronization available is the kernel launch (or completion of the kernel, specifically). Copying the data from a single threadblock simply moves the requirement for device-wide sync out of this particular sequence (to somewhere before this sequence).
The reasons you give don't really suggest to me that the code could not be refactored to create the data on a kernel-launch by kernel-launch basis, which would neatly solve these issues and eliminate the need for the above process as well.
EDIT: responding to a question in the comments.
It's difficult to be more specific about how to refactor the code to deliver one data chunk per kernel call, without a specific example.
Let's take an image processing case, where I have a video sequence of 30 frames stored in global memory. The kernel will process each frame according to some algorithm, then make the processed data available to the host.
In your proposal, after the kernel is done processing a frame, it can signal to the host that the data is ready, and go on to process the next frame. The problem is, if the frame is processed by multiple threadblocks, there's no easy way to know when all threadblocks are done processing that frame. A device-wide synchronization barrier might be what is needed, but it doesn't exist conveniently, except via the kernel call mechanism. However, presumably inside such a kernel we might have a sequence like this:
while (more_frames)
process a frame
signal host
increment frame pointer
In a refactored approach, we would move the loop outside the kernel, to host code:
while (more_frames)
call kernel to process frame
consume frame
increment frame pointer
By doing this, the kernel marks the explicit synchronization needed to know when the frame processing is complete, and the data can be consumed.

CUDA allocation alignment is 256 bytes - seriously?

In "CUDA C Programming Guide 5.0", p73 (also here) says "Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes". I do not know the exact meaning of this sentence. Could anyone show an example for me? Many thanks.
A derivative question:
So, what about allocating an one-dimensional array of basic elements (like int) or self-defined ones? The starting address of the array will be multiples of 256B, while the address of each element in the array is not necessarily multiples of 256B?
The pointers which are allocated by using any of the CUDA Runtime's device memory allocation functions e.g cudaMalloc or cudaMallocPitch are guaranteed to be 256 byte aligned, i.e. the address is a multiple of 256.
Consider the following example:
char *ptr1, *ptr2;
int bytes = 1;
cudaMalloc((void**)&ptr1,bytes);
cudaMalloc((void**)&ptr2,bytes);
Suppose the address returned in ptr1 is some multiple of 256, then the address returned in ptr2 will be atleast (ptr1 + 256).
This is a restriction imposed by the device on which the memory is being allocated. Mostly, pointers are aligned due to performance purposes. (Some NVIDIA guy should be able to tell if there is some other reason also).
Important:
Pointer alignment is not always 256. On my device (GTX460M), it is 512. You can get the device pointer alignment by the cudaDeviceProp::textureAlignment field.
Alignment of pointers is also a requirement for binding the pointer to textures.

How to share a common value between threads in a given block?

I have a kernel that, for each thread in a given block, computes a for loop with a different number of iterations. I use a buffer of size N_BLOCKS to store the number of iterations required for each block. Hence, each thread in a given block must know the number of iterations specific to its block.
However, I'm not sure which way is the best (performance speaking) to read the value and distribute it to all the other threads. I see only one good way (please tell me if there is something better): store the value in shared memory and have each thread read it. For example:
__global__ void foo( int* nIterBuf )
{
__shared__ int nIter;
if( threadIdx.x == 0 )
nIter = nIterBuf[blockIdx.x];
__syncthreads();
for( int i=0; i < nIter; i++ )
...
}
Any other better solutions? My app will use a lot of data, so I want the best performance.
Thanks!
Read-only values that are uniform across all threads in a block are probably best stored in __constant__ arrays. On some CUDA architectures such as Fermi (SM 2.x), if you declare the array or pointer argument using the C++ const keyword AND you access it uniformly within the block (i.e. the index only depends on blockIdx, not threadIdx), then the compiler may automatically promote the reference to constant memory.
The advantage of constant memory is that it goes through a dedicated cache, so it doesn't pollute the L1, and if the amount of data you are accessing per block is relatively small, after the first access within each block, you should always hit in the cache after the initial compulsory miss in each thread block.
You also won't need to use any shared memory or transfer from global to shared memory.
If my info is up-to-date, the shared memory is the second fastest memory, second only to the registers.
If reading this data from shared memory every iteration slows you down and you still have registers available (refer to your GPU's compute capability and specs), you could perhaps try to store a copy of this value in every thread's register (using a local variable).

Variable Sizes Array in CUDA

Is there any way to declare an array such as:
int arraySize = 10;
int array[arraySize];
inside a CUDA kernel/function? I read in another post that I could declare the size of the shared memory in the kernel call and then I would be able to do:
int array[];
But I cannot do this. I get a compile error: "incomplete type is not allowed". On a side note, I've also read that printf() can be called from within a thread and this also throws an error: "Cannot call host function from inside device/global function".
Is there anything I can do to make a variable sized array or equivalent inside CUDA? I am at compute capability 1.1, does this have anything to do with it? Can I get around the variable size array declarations from within a thread by defining a typedef struct which has a size variable I can set? Solutions for compute capabilities besides 1.1 are welcome. This is for a class team project and if there is at least some way to do it I can at least present that information.
About the printf, the problem is it only works for compute capability 2.x. There is an alternative cuPrintf that you might try.
For the allocation of variable size arrays in CUDA you do it like this:
Inside the kernel you write extern __shared__ int[];
On the kernel call you pass as the third launch parameter the shared memory size in bytes like mykernel<<<gridsize, blocksize, sharedmemsize>>>();
This is explained in the CUDA C programming guide in section B.2.3 about the __shared__ qualifier.
If your arrays can be large, one solution would be to have one kernel that computes the required array sizes, stores them in an array, then after that invocation, the host allocates the necessary arrays and passes an array of pointers to the threads, and then you run your computation as a second kernel.
Whether this helps depends on what you have to do, because it would be arrays allocated in global memory. If the total size (per block) of your arrays is less than the size of the available shared memory, you could have a sufficiently-large shared memory array and let your threads negociate splitting it amongst themselves.