How to handle a device variable using OpenACC - cuda

I'm trying to optimize my code which I accelerated using basically OpenACC only.
Is it a good approach to insert CUDA such as in the example that follows?
In this case, u_device and v_device are used by the device only. Using cudaMalloc assures me that I allocate memory on the device memory and not on the host memory too.
int size = NVAR * sizeof(double);
// Declare pointers that will point to the memory allocated on the device.
double* v_device;
double* u_device;
// Allocate memory on the device
cudaMalloc(&v_device, size);
cudaMalloc(&u_device, size);
#pragma acc parallel loop private(v_device, u_device)
for (i = ibeg; i <= iend; i++){
#pragma acc loop
for (nv = 0; nv < NVAR; nv++) v_device[nv] = V[nv][k][j][i];
PrimToCons (v_device, u_device);
#pragma acc loop
for (nv = 0; nv < NVAR; nv++) U[k][j][i][nv] = u_device[nv];
}
cudaFree(u_device);
cudaFree(v_device);
Before I would have used OpenACC and written something like this:
double* v_device = (double*)malloc(size);
double* u_device = (double*)malloc(size);
#pragma acc enter data create(u_device[:size],v[:device])
#pragma acc parallel loop private(v_device, u_device)
for (i = ibeg; i <= iend; i++){
...
}
#pragma acc exit data delete(u_device[:size],v[:device])
Is there a way with OpenACC to avoid host memory allocation?
Another doubt I have regarding cudaMalloc is the possibility to put the routine inside the kernel, in order to make the arrays private:
#pragma acc parallel loop private(v_device, u_device)
for (i = ibeg; i <= iend; i++){
double* v_device;
double* u_device;
// Allocate memory on the device
cudaMalloc(&v_device, size);
cudaMalloc(&u_device, size);
.
.
.
cudaFree(u_device);
cudaFree(v_device);
}
Writing in this way I get the error:
182, Accelerator restriction: call to 'cudaMalloc' with no acc routine information

Is there a way with OpenACC to avoid host memory allocation?
You can use cudaMalloc, but for pure OpenACC, you'd use "acc_malloc" and "acc_free". For example: https://github.com/rmfarber/ParallelProgrammingWithOpenACC/blob/master/Chapter05/acc_malloc.c
Note the use the "deviceptr" clause which indicates that the pointer is a device pointer. Though here, you're wanting to privatize these arrays so you can keep the private.
I've never used a device pointer in a private clause, but just tried and it seems to work. Which make sense since all the compiler really needs is the size and type of the private array to make the private copies. In this case since it's on the gang loop, the compiler will attempt to put the private arrays in shared memory, assuming they aren't too big to fit. I'd recommend using the triplet notation for the array, i.e. "private(v_device[:NVAR],...) so the compiler will know the size.
Though I'm not sure there's much of an advantage to using device arrays here. The device memory you're allocating isn't going to be used taking up space on the device. Device memory is often much smaller than host memory, so if you do need to waste space, probably better this be on the host. Plus having to use acc_malloc or cudaMalloc limits portability of the code. Not that there isn't cases where using device only memory is beneficial, I just don't think it is for this case.
Note you can call "malloc" within device code, but it's not recommended. Malloc's get serialized causing performance issues, but also the default heap is relatively small which can lead to heap overflows. Granted, this can be increased by either calling cudaDeviceLimits or via the environment variable "NV_ACC_CUDA_HEAPSIZE".

Related

CUDA shared memory under the hood questions

I have several questions regarding to CUDA shared memory.
First, as mentioned in this post, shared memory may declare in two different ways:
Either dynamically shared memory allocated, like the following
// Lunch the kernel
dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
This may use inside a kernel as mention:
extern __shared__ int s[];
Or static shared memory, which can use in kernel call like the following:
__shared__ int s[64];
Both are use for different reasons, however which one is better and why ?
Second, I'm running a multi blocks, 256 threads per block kernel. I'm using static shared memory in global and device kernels, both of them uses shared memory. An example is given:
__global__ void startKernel(float* p_d_array)
{
__shared double matA[3*3];
float a1 =0 ;
float a2 = 0;
float a3 = 0;
float b = p_d_array[threadidx.x];
a1 += reduce( b, threadidx.x);
a2 += reduce( b, threadidx.x);
a3 += reduce( b, threadidx.x);
// continue...
}
__device__ reduce ( float data , unsigned int tid)
{
__shared__ float data[256];
// do reduce ...
}
I'd like to know how the shared memory is allocated in such case. I presume each block receive its own shared memory.
What's happening when block # 0 goes into reduce function?
Does the shared memory is allocated in advance to the function call?
I call three different reduce device function, in such case, theoretically in block # 0 , threads # [0,127] may still execute ("delayed due hard word") on the first reduce call, while threads # [128,255] may operate on the second reduce call. In this case, I'd like to know if both reduce function are using the same shared memory?
Even though if they are called from two different function calls ?
On the other hand, Is that possible that a single block may allocated 3*256*sizeof(float) shared memory for both functions calls? That's seems superfluous in CUDA manners, but I still want to know how CUDA operates in such case.
Third, is that possible to gain higher performance in shared memory due to compiler optimization using
const float* p_shared ;
or restrict keyword after the data assignment section?
AFAIR, there is little difference whether you request shared memory "dynamically" or "statically" - in either case it's just a kernel launch parameter be it set by your code or by code generated by the compiler.
Re: 2nd, compiler will sum the shared memory requirement from the kernel function and functions called by kernel.

std::vector to array in CUDA

Is there a way to convert a 2D vector into an array to be able to use it in CUDA kernels?
It is declared as:
vector<vector<int>> information;
I want to cudaMalloc and copy from host to device, what would be the best way to do it?
int *d_information;
cudaMalloc((void**)&d_information, sizeof(int)*size);
cudaMemcpy(d_information, information, sizeof(int)*size, cudaMemcpyHostToDevice);
In a word, no there isn't. The CUDA API doesn't support deep copying and also doesn't know anything about std::vector either. If you insist on having a vector of vectors as a host source, it will require doing something like this:
int *d_information;
cudaMalloc((void**)&d_information, sizeof(int)*size);
int *dst = d_information;
for (std::vector<std::vector<int> >::iterator it = information.begin() ; it != information.end(); ++it) {
int *src = &((*it)[0]);
size_t sz = it->size();
cudaMemcpy(dst, src, sizeof(int)*sz, cudaMemcpyHostToDevice);
dst += sz;
}
[disclaimer: written in browser, not compiled or tested. Use at own risk]
This would copy the host memory to an allocation in GPU linear memory, requiring one copy for each vector. If the vector of vectors is a "jagged" array, you will want to store an indexing somewhere for the GPU to use as well.
As far as I understand, the vector of vectors do not need to reside in a contiguous memory, i.e. they can be fragmented.
Depending on the amount of memory you need to transfer I would do one of two issues:
Reorder your memory to be a single vector, and then use your cudaMemcpy.
Create a series of cudaMemcpyAsync, where each copy handles a single vector in your vector of vectors, and then synchronize.

Is this way of allocating a device object "correct"?

SO I asked a question before about how to allocate an object on the device directly instead of the "normal":
Allocate on host
Copy to device
Copy dynamically allocated fields to device one by one
The main reason I want it to be allocated directly on the device is that I don't want to copy each dynamically allocated field inside one by one manually.
Anyway, so I think I have actually found a way to do this, and I would like to see some input from more experienced CUDA programmers (like Robert Crovella).
Let's see the code first:
class Particle
{
public:
int *data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle **result)
{
Particle *p = new Particle();
result[0] = p; // store memory location
}
__global__ void test2(Particle *p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
// initialise and allocate an object on device
Particle **d_p_addr;
cudaMalloc((void**)&d_p_addr, sizeof(Particle*));
test<<<1,1>>>(d_p_addr);
// copy pointer to host memory
Particle **p_addr = new Particle*[1];
cudaMemcpy(p_addr, d_p_addr, sizeof(Particle*), cudaMemcpyDeviceToHost);
// test:
test2<<<1,1>>>(p_addr[0]);
cudaDeviceSynchronize();
printf("Done!\n");
}
As you can see, what I do is:
Call a kernel that initialises an object on the device and stores its pointer an output parameter
Copy the pointer to the allocated object from device memory to host memory
Now you can pass that pointer to another kernel just fine !
This code actually works, but I'm not sure if there are drawbacks.
Cheers
EDIT: as pointed out by Robert, there was no point of creating a pointer on host first, so I removed that part from the code.
Yes, you can do that.
You are allocating an object on the device, and passing a pointer to it from one kernel to the next. Since a characteristic of device malloc/new is that allocations persist for the lifetime of the context (not just the kernel), the allocations do not disappear at the end of the kernel. This is basically standard C++ behavior, but I thought it might be worth repeating. The pointer(s) that you are passing from one kernel to the next are therefore valid in any subsequent device code in the context of your program.
There is a wrinkle you might want to be aware of, however. Pointers returned by dynamic allocations done on the device (such as via new or malloc in device code) are not usable for transferring data from device to host, at least in the present incarnation of cuda (cuda 5.0 and earlier). The reasons for this are somewhat arcane (translation: I can't adequately explain it) but it's instructive to think about the fact that dynamic allocations come out of the device heap, a region that is logically separate from the region of global memory that runtime API functions like cudaMalloc and cudaMemcpy use. An oblique indication of this is given here:
Memory reserved for the device heap is in addition to memory allocated through host-side CUDA API calls such as cudaMalloc().
If you want to prove this wrinkle to yourself, try adding the following seemingly innocuous code after your second kernel call:
Particle *q;
q = (Particle *)malloc(sizeof(Particle));
cudaMemcpy(q, p_addr[0], sizeof(Particle), cudaMemcpyDeviceToHost);
If you then check the API error value returned from that cudaMemcpy operation, you will observe the error.
As an unrelated comment, your use of the pointer *p is a little freaky, in my book, and the compiler warning given about it is an indication of the wierdness. It's not technically illegal, since you're not actually doing anything meaningful with that pointer (you immediately replace it in your kernel 1) but nevertheless it's wierd because you're passing a pointer to a kernel that you haven't properly cudaMalloc'ed. In the context of what you're demonstrating, it's completely unnecessary, and your first parameter to kernel 1 could be eliminated and replaced with a local variable, eliminating the wierdness and compiler warning.

Copying whole global memory buffer many times to shared memory buffer

I have a buffer in global memory that I want to copy in shared memory for each block as to speed up my read-only access. Each thread in each block will use the whole buffer at different positions concurrently.
How does one do that?
I know the size of the buffer only at run time:
__global__ void foo( int *globalMemArray, int N )
{
extern __shared__ int s_array[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if( idx < N )
{
...?
}
}
The first point to make is that shared memory is limited to a maximum of either 16kb or 48kb per streaming multiprocessor (SM), depending on which GPU you are using and how it is configured, so unless your global memory buffer is very small, you will not be able to load all of it into shared memory at the same time.
The second point to make is that the contents of shared memory only has the scope and lifetime of the block it is associated with. Your sample kernel only has a single global memory argument, which makes me think that you are either under the misapprehension that the contents of a shared memory allocation can be preserved beyond the life span of the block that filled it, or that you intend to write the results of the block calculations back into same global memory array from which the input data was read. The first possibility is wrong and the second will result in memory races and inconsistant results. It is probably better to think of shared memory as a small, block scope L1 cache which is fully programmer managed than some sort of faster version of global memory.
With those points out of the way, a kernel which loaded sucessive segments of a large input array, processed them and then wrote some per thread final result back input global memory might look something like this:
template <int blocksize>
__global__ void foo( int *globalMemArray, int *globalMemOutput, int N )
{
__shared__ int s_array[blocksize];
int npasses = (N / blocksize) + (((N % blocksize) > 0) ? 1 : 0);
for(int pos = threadIdx.x; pos < (blocksize*npasses); pos += blocksize) {
if( pos < N ) {
s_array[threadIdx.x] = globalMemArray[pos];
}
__syncthreads();
// Calculations using partial buffer contents
.......
__syncthreads();
}
// write final per thread result to output
globalMemOutput[threadIdx.x + blockIdx.x*blockDim.x] = .....;
}
In this case I have specified the shared memory array size as a template parameter, because it isn't really necessary to dynamically allocate the shared memory array size at runtime, and the compiler has a better chance at performing optimizations when the shared memory array size is known at compile time (perhaps in the worst case there could be selection between different kernel instances done at run time).
The CUDA SDK contains a number of good example codes which demonstrate different ways that shared memory can be used in kernels to improve memory read and write performance. The matrix transpose, reduction and 3D finite difference method examples are all good models of shared memory usage. Each also has a good paper which discusses the optimization strategies behind the shared memory use in the codes. You would be well served by studying them until you understand how and why they work.

memset in cuda that allows to set values within kernel

i am making several cudamemset calls in order to set my values to 0 as below:
void allocateByte( char **gStoreR,const int byte){
char **cStoreR = (char **)malloc(N * sizeof(char*));
for( int i =0 ; i< N ; i++){
char *c;
cudaMalloc((void**)&c, byte*sizeof(char));
cudaMemset(c,0,byte);
cStoreR[i] = c;
}
cudaMemcpy(gStoreR, cStoreR, N * sizeof(char *), cudaMemcpyHostToDevice);
}
However, this is proving to be very slow. Is there a memset function on the GPU as calling it from CPU takes lot of time. Also, does cudaMalloc((void**)&c, byte*sizeof(char)) automatically set bits that c points to to 0.
Every cudaMemset call launches a kernel, so if N is large and byte is small, then you will have a lot of kernel launch overhead slowing down the code. There is no device side memset, so the solution would be to write a kernel which traverses the allocations and zeros the storage in a single launch.
As an aside, I would strongly recommend against using a structure of arrays in CUDA. It is a lot slower and much more complex to manage that achieving the same outcome using a single large block of linear memory and indexing into that memory. In your example, it would reduce the code to a single cudaMalloc call and a single cudaMemset call. On the device side, pointer indirection, which is slow, gets replaced by a few integer operations, which are very fast. If your source material on the host is an array of structures, I would recommend using something like the excellent thrust::zip_iterator to get the data into a GPU friendly form on the device.