#pragma acc host_data use_device() with complex variables - cuda

I need to pass to #pragma acc host_data use_device() an element of a vector of pointers of vectors:
static double *send_bufL[3];
static double *recv_bufL[3];
send_bufL[IDIR] = ARRAY_1D(NVAR*grid->nghost[IDIR]*nx2*nx3, double);
recv_bufL[IDIR] = ARRAY_1D(NVAR*grid->nghost[IDIR]*nx2*nx3, double);
#pragma acc enter data copyin(send_bufL[IDIR:1][:NVAR*grid->nghost[IDIR]*nx2*nx3], \
recv_bufL[IDIR:1][:NVAR*grid->nghost[IDIR]*nx2*nx3])
#pragma acc parallel loop collapse(4) present(d, grid, send_bufL[1:1][:NVAR*grid->nghost[JDIR]*nx1*nx3])
for (nv = 0; nv < NVAR; nv++){
for (k = kbeg; k <= kend; k++){
for (i = ibeg; i <= iend; i++){
for (j = 0; j < grid->nghost[JDIR]; j++){
index = nv*nx3*nx1*grid->nghost[JDIR]+(k-kbeg)*nx1*grid->nghost[JDIR]+(i-ibeg)*grid->nghost[JDIR]+j;
send_bufL[JDIR][index] = d->Vc[nv][k][jbeg+j][i];
}}}}
count = NVAR*nx3*nx2*nghost;
#pragma acc host_data use_device(send_bufL, recv_bufL)
{
MPI_Isend (send_bufL[IDIR], count, MPI_DOUBLE, nrnks[IDIR][0], 0, MPI_COMM_WORLD, &req[0]);
MPI_Irecv (recv_bufL[IDIR], count, MPI_DOUBLE, nrnks[IDIR][0], 0, MPI_COMM_WORLD, &req[1]);
}
Writing like this, with send_bufL, recv_bufL only I get:
[marco-Inspiron-7501:41130:0:41130] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f7f4fafa608)
[marco-Inspiron-7501:41131:0:41131] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd697afa608)*
Trying to add dimensions I get compiling errors. What can I do?
I have to add the apparently, for some elements of the vector of pointers there's no need to use #pragma acc host_data use_device(). The compiler seems able to use the device buffers correctly. Anyway, for others elements this doesn't work and it uses host buffers, generating wrong results.

The problem is with the "send_bufL[IDIR]" and "recv_bufL[IDIR]". Since these are device pointers, dereferencing them on the host will give a seg fault.
I'm thinking that the best solution here would be to use temp pointers to the correct element in the buffers. Something like:
double * sbtmp;
double * rbtmp;
...
sbtmp = send_bufL[IDIR];
rbtmp = recv_bufL[IDIR];
#pragma acc host_data use_device(sbtmp, rbtmp)
{
MPI_Isend (sbtmp, count, MPI_DOUBLE, nrnks[IDIR][0], 0, MPI_COMM_WORLD, &req[0]);
MPI_Irecv (rbtmp, count, MPI_DOUBLE, nrnks[IDIR][0], 0, MPI_COMM_WORLD, &req[1]);
}
If this doesn't work, please post a small reproducing example and I'll see if we can find a solution.

Related

Nested Directives in OpenACC

I'm trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I've Tesla 40c and my OpenACC compiler is PGI version 15.7.
My code is so simple. When I try to compile following code compiler returns me these messages
PGCC-S-0155-Illegal context for pragma: acc parallel loop (test.cpp: 158)
PGCC/x86 Linux 15.7-0: compilation completed with severe errors
My code structure:
#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
// << computation >>
int ss = A[tid].start;
int ee = A[tid].end;
#pragma acc parallel loop
for(j = ss; j< ( ee + ss); j++)
{
// << computation >>
}
I've also tried to change my code to use routine directives. But I couldn't compile again
#pragma acc routine workers
foo(...)
{
#pragma acc parallel loop
for(j = ss; j< ( ee + ss); j++)
{
// << computation >>
}
}
#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
// << computation >>
int ss = A[tid].start;
int ee = A[tid].end;
foo(...);
}
I've tried of course only with routine (seq,worker,gang) without inner parallel loop directive. It has been compiler but dynamic parallelism hasn't been activated.
37, Generating acc routine worker
Generating Tesla code
42, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y */
Loop is parallelizable
How am I supposed to use dynamic parallelism in OpenACC?
How am I supposed to use dynamic parallelism in OpenACC?
Although nested regions (which would presumably use dynamic parallelism) is a new feature in the OpenACC 2.0 specification, I don't believe it is implemented yet in PGI 15.7. PGI 15.7 represents a partial implementation of the OpenACC 2.0 specification.
This limitation is documented in the PGI 15.7 release notes that should ship with your PGI 15.7 compiler (pgirn157.pdf) in section 2.7 (those release notes are currently available here):
OpenACC 2.0 Missing Features
‣ The declare link directive for global data is not implemented.
‣ Nested parallelism (parallel and kernels constructs within a parallel or kernels region) is not
implemented.
Based on the comments, there is some concern about #pragma acc routine worker, so here is a fully worked example with PGI 15.7 of that:
$ cat t1.c
#include <stdio.h>
#include <stdlib.h>
#define D1 4096
#define D2 4096
#define OFFS 2
#pragma acc routine worker
void my_set(int *d, int len, int val){
int i;
for (i = 0; i < len; i++) d[i] += val+OFFS;
}
int main(){
int i,*data;
data = (int *)malloc(D1*D2*sizeof(int));
for (i = 0; i < D1*D2; i++) data[i] = 1;
#pragma acc kernels copy(data[0:D1*D2])
for (i = 0; i < D1; i++)
my_set(data+(i*D2), D2, 1);
printf("%d\n", data[0]);
return 0;
}
$ pgcc -acc -ta=tesla -Minfo=accel t1.c -o t1
my_set:
8, Generating acc routine worker
Generating Tesla code
10, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y */
Loop is parallelizable
main:
20, Generating copy(data[:16777216])
21, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
21, #pragma acc loop gang /* blockIdx.x */
$ ./t1
4
$
Note that the gang parallelism has been performed at the outer loop, and the worker parallelism has been performed in the inner (routine) loop.
This method does not depend on dynamic parallelism (instead, it relies on a partitioning of parallelism between worker at the routine level and gang at the caller level) and will not invoke dynamic parallelism.
The native use of dynamic parallelism (CDP) is not supported currently in PGI 15.7. It should be possible to call (i.e. interoperate with) other functions (e.g. CUDA, or libraries) that make use of CDP from OpenACC code, but currently, natively it is not used (and not supported) in PGI 15.7
try replacing "#pragma acc parallel loop" with #pragma acc loop"

CUDA Dynamic Parallelizm; stream synchronization from device

I am basically looking for a way to synchronize a stream from within the device. I want to avoid using cudaDeviceSynchronize(), as it would serialize execution of my kernel that I want to execute concurrently using streams;
More detailed description: I have written a kernel, that is a stabilized bi-conjugate gradient solver. I want to lunch this kernel concurrently on different data using streams.
This kernel uses cublas functions. They are called from within the kernel.
One of operations required by the solver is calculation of a dot product of two vectors. This can be done with cublasdot(). But as this call is synchronous, execution of kernels in different streams get serialized. Instead of calling a dot product function, I calculate the dot product using cublasspmv(), which is called asynchronously. The problem is that this function returns before the result is calculated. I want therefore to synchronize the stream from the device - I am looking for an equivalent of cudaStreamSynchronize() but callable from the device.
__device__ float _cDdot(cublasHandle_t & cublasHandle, const int n, real_t * x, real_t * y) {
float *norm; norm = new float;
float alpha = 1.0f; float beta = 0.0f;
cublasSgemv_v2(cublasHandle, CUBLAS_OP_N ,1 , n, &alpha, x, 1, y, 1, &beta, norm, 1);
return *norm;
}
What can I do to make sure, that the result is calculated before the function returns? Of course insertion of cudaDeviceSynchronize() works, but as I mentioned, it serializes the execution of my kernel across streams.
Probably if you read the programming guide dynamic parallelism section carefully (especially streams, events, and synchronization), you may get some ideas. Here's what I came up with:
There is an implicit NULL stream (on the device) associated with the execution sequence that calls your _cDdot function (oddly named, IMHO, since you're working with float quantities in that case, i.e. using Sgemv). Therefore, any cuda kernel or API call issued after the call to cublasSgemv_v2 in your function should wait until any cuda activity associated with the cublasSgemv_v2 function is complete. If you insert an innocuous cuda API call, or else a dummy kernel call, after the call to cublasSgemv_v2, it should wait for that to be complete. This should give you the thread-level synchronization you are after. You might also be able to use a cudaEventRecord call followed by a cudaStreamWaitEvent call.
Here's an example to show the implicit stream synchronization approach:
#include <stdio.h>
#include <cublas_v2.h>
#define SZ 16
__global__ void dummy_kernel(float *in, float *out){
*out = *in;
}
__device__ float _cDdot(cublasHandle_t & cublasHandle, const int n, float * x, float * y, const int wait) {
float *norm; norm = new float;
float alpha = 1.0f; float beta = 0.0f;
*norm = 0.0f;
cublasSgemv_v2(cublasHandle, CUBLAS_OP_N ,1 , n, &alpha, x, 1, y, 1, &beta, norm, 1);
if (wait){
dummy_kernel<<<1,1>>>(norm, norm);
}
return *norm;
}
__global__ void compute(){
cublasHandle_t my_h;
cublasStatus_t status;
status = cublasCreate(&my_h);
if (status != CUBLAS_STATUS_SUCCESS) printf("cublasCreate fail\n");
float *x, *y;
x = new float[SZ];
y = new float[SZ];
for (int i = 0; i < SZ; i++){
x[i] = 1.0f;
y[i] = 1.0f;}
float result = _cDdot(my_h, SZ, x, y, 0);
printf("result with no wait = %f\n", result);
result = _cDdot(my_h, SZ, x, y, 1);
printf("result with wait = %f\n", result);
}
int main(){
compute<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
compile with:
nvcc -arch=sm_35 -rdc=true -o t302 t302.cu -lcudadevrt -lcublas -lcublas_device
results:
$ ./t302
result with no wait = 0.000000
result with wait = 16.000000
$
Unfortunately I tried a completely empty dummy_kernel; that did not work, unless I compiled with -G. So the compiler may be smart enough to optimize out a complete empty child kernel call.

Why is my rather trivial CUDA program erring with certain arguments?

I made a simple CUDA program for practice. It simply copies over data from one array to another:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pycuda.compiler import SourceModule
# Global constants
N = 2**20 # size of array a
a = np.linspace(0, 1, N)
e = np.empty_like(a)
block_size_x = 512
# Instantiate block and grid sizes.
block_size = (block_size_x, 1, 1)
grid_size = (N / block_size_x, 1)
# Create the CUDA kernel, and run it.
mod = SourceModule("""
__global__ void D2x_kernel(double* a, double* e, int N) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
if (tid > 0 && tid < N - 1) {
e[tid] = a[tid];
}
}
""")
func = mod.get_function('D2x_kernel')
func(a, cuda.InOut(e), np.int32(N), block=block_size, grid=grid_size)
print str(e)
However, I get this error: pycuda._driver.LogicError: cuLaunchKernel failed: invalid value
When I get rid of the second argument double* e in my kernel function and invoke the kernel without the argument e, the error goes away. Why is that? What does this error mean?
Your a array does not exist in device memory, so I suspect that PyCUDA is ignoring (or otherwise handling) the first argument to your kernel invocation and only passing in e and N...so you get an error because the kernel was expecting three arguments and it has only received two. Removing double* e from your kernel definition might eliminate the error message you're getting, but your kernel still won't work properly.
A quick fix to this should be to wrap a in a cuda.In() call, which instructs PyCUDA to copy a to the device before launching the kernel. That is, your kernel launch line should be:
func(cuda.In(a), cuda.InOut(e), np.int32(N), block=block_size, grid=grid_size)
Edit: Also, do you realize that your kernel is not copying the first and last elements of a to e? Your if (tid > 0 && tid < N - 1) statement is preventing that. For the entire array, it should be if (tid < N).

Use of shared memory with OpenACC

I'm trying to use shared memory to cache things with OpenACC.
Basically what I'm working on is a matrix multiplication, and what I have is this:
typedef float ff;
// Multiplies two square row-major matrices a and b, puts the result in c.
void mmul(const restrict ff* a,
const restrict ff* b,
restrict ff* c,
const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{
#pragma acc region
{
#pragma acc loop independent vector(16)
for (int i = 0; i < n; ++i) {
#pragma acc loop independent vector(16)
for (int j = 0; j < n; ++j) {
ff sum = 0;
for (int k = 0; k < n; ++k) {
sum += a[i + n * k] * b[k + n * j];
}
c[i + n * j] = sum;
}
}
}
}
}
What I would like to do is use shared memory to cache tiles of the matrices 'a' and 'b' to use in the computation of 'c', in a similar fashion to what the CUDA mmul algorithm does.
Basically on CUDA I would know the exact size of my blocks, and would be able to:
declare a shared memory with the size of the block
copy the 'relevant' part of the data to the block
use this data
I understand I can use the
#pragma acc cached
directive, and that I can specify block sizes with the vector and gang options, but I'm having some trouble understanding how that's gonna be mapped to the CUDA architecture.
Is there a way to achieve something similar with OpenACC? Is there a good tutorial/resource on the use of the cached directive or on how to map some of the power of shared memory from CUDA to OpenACC?
If you are using PGI Accelerator Compiler, you can dump out the generated PTX file and see what is going on in underling of execution:
pgcc -acc -fast -Minfo -ta=nvidia,cc13,keepptx matrixMult.c -o matrixMult
The generated PTX will be stored in the current directory.
EDIT: You may prefer to see the high-level code (CUDA for C or Fortran). So use following -ta=nvidia,cc13,keepptx,keepgpu .

CUDA-GDB crashes in Kernel

I've been trying to debug my code, as I know something is going wrong in the Kernel, and I've been trying to figure out what specifically. If I try to step into the kernel it seems to completely step over the kernel functions, and will eventually cause an error on quitting:
Single stepping until exit from function dyld_stub_cudaSetupArgument,
which has no line number information.
[Launch of CUDA Kernel 0 (incrementArrayOnDevice<<<(3,1,1),(4,1,1)>>>) on
Device 0]
[Termination of CUDA Kernel 0 (incrementArrayOnDevice<<<(3,1,1),
(4,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 1 (fillinBoth<<<(40,1,1),(1,1,1)>>>) on Device 0]
[Termination of CUDA Kernel 1 (fillinBoth<<<(40,1,1),(1,1,1)>>>) on Device 0]
add (below=0x124400, newtip=0x124430, newfork=0x125ac0) at test.cu:1223
And if I try to break in the Kernel my entire computer crashes and I have to restart it.
I figure there must be something wrong with the way I'm calling the kernel, but I can't figure out what.
The code is rather long, so I'm only including an excerpt of it:
__global__ void fillinOne(seqptr qset, long max) {
int i, j;
aas aa;
int idx = blockIdx.x;
__shared__ long qs[3];
if(idx < max)
{
memcpy(qs, qset[idx], sizeof(long[3]));
for (i = 0; i <= 1; i++)
{
for (aa = ala; (long)aa <= (long)stop; aa = (aas)((long)aa + 1))
{
if (((1L << ((long)aa)) & qs[i]) != 0)
{
for (j = i + 1; j <= 2; j++)
qs[j] |= cudaTranslate[(long)aa - (long)ala][j - i];
}
}
}
}
}
//Kernel for left!= NULL and rt != NULL
void fillin(node *p, node *left, node *rt)
{
cudaError_t err = cudaGetLastError();
size_t stepsize = chars * sizeof(long);
size_t sitesize = chars * sizeof(sitearray);
//int i, j;
if (left == NULL)
{
//copy rt->numsteps into p->numsteps--doesn't actually require CUDA, because no computation to do
memcpy(p->numsteps, rt->numsteps, stepsize);
checkCUDAError("memcpy");
//allocate siteset (array of sitearrays) on device
seqptr qsites; //as in array of qs's
cudaMalloc((void **) &qsites, sitesize);
checkCUDAError("malloc");
//copy rt->siteset into device array (equivalent to memcpy(qs, rs) but for whole array)
cudaMemcpy(qsites, rt->siteset, sitesize, cudaMemcpyHostToDevice);
checkCUDAError("memcpy");
//do loop in device
int block_size = 1; //each site operated on independently
int n_blocks = chars;
fillinOne <<< n_blocks, block_size>>> (qsites, chars);
cudaThreadSynchronize();
//put qset in p->siteset--equivalent to memcpy(p->siteset[m], qs)
cudaMemcpy(p->siteset, qsites, sitesize, cudaMemcpyDeviceToHost);
checkCUDAError("memcpy");
//Cleanup
cudaFree(qsites);
}
If anyone has any ideas at all, please resond! Thanks in advance!
I suppose you have a single card configuration. When you are debugging a cuda kernel and you break inside it you effectively put the display driver in pause. That causes what you think is a crash. If you want to use the cuda-gdb with only one graphics card you must use it in command line mode (don't start X or press ctrl-alt-fn from X).
If you have two cards you must run the code in the card not running the display. Use cudaSelectDevice(n).