Nested Directives in OpenACC - cuda

I'm trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I've Tesla 40c and my OpenACC compiler is PGI version 15.7.
My code is so simple. When I try to compile following code compiler returns me these messages
PGCC-S-0155-Illegal context for pragma: acc parallel loop (test.cpp: 158)
PGCC/x86 Linux 15.7-0: compilation completed with severe errors
My code structure:
#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
// << computation >>
int ss = A[tid].start;
int ee = A[tid].end;
#pragma acc parallel loop
for(j = ss; j< ( ee + ss); j++)
{
// << computation >>
}
I've also tried to change my code to use routine directives. But I couldn't compile again
#pragma acc routine workers
foo(...)
{
#pragma acc parallel loop
for(j = ss; j< ( ee + ss); j++)
{
// << computation >>
}
}
#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
// << computation >>
int ss = A[tid].start;
int ee = A[tid].end;
foo(...);
}
I've tried of course only with routine (seq,worker,gang) without inner parallel loop directive. It has been compiler but dynamic parallelism hasn't been activated.
37, Generating acc routine worker
Generating Tesla code
42, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y */
Loop is parallelizable
How am I supposed to use dynamic parallelism in OpenACC?

How am I supposed to use dynamic parallelism in OpenACC?
Although nested regions (which would presumably use dynamic parallelism) is a new feature in the OpenACC 2.0 specification, I don't believe it is implemented yet in PGI 15.7. PGI 15.7 represents a partial implementation of the OpenACC 2.0 specification.
This limitation is documented in the PGI 15.7 release notes that should ship with your PGI 15.7 compiler (pgirn157.pdf) in section 2.7 (those release notes are currently available here):
OpenACC 2.0 Missing Features
‣ The declare link directive for global data is not implemented.
‣ Nested parallelism (parallel and kernels constructs within a parallel or kernels region) is not
implemented.
Based on the comments, there is some concern about #pragma acc routine worker, so here is a fully worked example with PGI 15.7 of that:
$ cat t1.c
#include <stdio.h>
#include <stdlib.h>
#define D1 4096
#define D2 4096
#define OFFS 2
#pragma acc routine worker
void my_set(int *d, int len, int val){
int i;
for (i = 0; i < len; i++) d[i] += val+OFFS;
}
int main(){
int i,*data;
data = (int *)malloc(D1*D2*sizeof(int));
for (i = 0; i < D1*D2; i++) data[i] = 1;
#pragma acc kernels copy(data[0:D1*D2])
for (i = 0; i < D1; i++)
my_set(data+(i*D2), D2, 1);
printf("%d\n", data[0]);
return 0;
}
$ pgcc -acc -ta=tesla -Minfo=accel t1.c -o t1
my_set:
8, Generating acc routine worker
Generating Tesla code
10, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y */
Loop is parallelizable
main:
20, Generating copy(data[:16777216])
21, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
21, #pragma acc loop gang /* blockIdx.x */
$ ./t1
4
$
Note that the gang parallelism has been performed at the outer loop, and the worker parallelism has been performed in the inner (routine) loop.
This method does not depend on dynamic parallelism (instead, it relies on a partitioning of parallelism between worker at the routine level and gang at the caller level) and will not invoke dynamic parallelism.
The native use of dynamic parallelism (CDP) is not supported currently in PGI 15.7. It should be possible to call (i.e. interoperate with) other functions (e.g. CUDA, or libraries) that make use of CDP from OpenACC code, but currently, natively it is not used (and not supported) in PGI 15.7

try replacing "#pragma acc parallel loop" with #pragma acc loop"

Related

-ta=tesla:managed:cuda8 but cuMemAllocManaged returned error 2: Out of memory

I'm new to OpenACC. I like it very much so far as I'm familiar with OpenMP.
I have 2 1080Ti cards each with 9GB and I've 128GB of RAM. I'm trying a very basic test to allocate an array, initialize it, then sum it up in parallel. This works for 8 GB but when I increase to 10 GB I get out-of-memory error. My understanding was that with unified memory of Pascal (which these card are) and CUDA 8, I could allocate an array larger than the GPU's memory and the hardware will page in and page out on demand.
Here's my full C code test :
$ cat firstAcc.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 10
int main()
{
float *a;
size_t n = GB*1024*1024*1024/sizeof(float);
size_t s = n * sizeof(float);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
float sum=0.0;
#pragma acc loop reduction (+:sum)
for (int i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
As per the "Enable Unified Memory" section of this article I compile it with :
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo firstAcc.c
main:
20, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
28, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
I need to understand those messages but for now I don't think they are relevant. Then I run it :
$ ./a.out
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted (core dumped)
This works fine if I change GB to 8. I expected 10GB to work (despite the GPU card having 9GB) thanks to Pascal 1080Ti and CUDA 8.
Have I misunderstand, or what am I doing wrong? Thanks in advance.
$ pgcc -V
pgcc 17.4-0 64-bit target on x86-64 Linux -tp haswell
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
$ cat /usr/local/cuda-8.0/version.txt
CUDA Version 8.0.61
Besides what Bob mentioned, I made a few more fixes.
First, you're not actually generating an OpenACC compute region since you only have a "#pragma acc loop" directive. This should be "#pragma acc parallel loop". You can see this in the compiler feedback messages where it's only showing host code optimizations.
Second, the "i" index should be declared as a "long". Otherwise, you'll overflow the index.
Finally, you need to add "cc60" to your target accelerator options to tell the compiler to target a Pascal based GPU.
% cat mi.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 20ULL
int main()
{
float *a;
size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
double sum=0.0;
#pragma acc parallel loop reduction (+:sum)
for (long i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
% pgcc -fast -acc -ta=tesla:managed,cuda8.0,cc60 -Minfo=accel mi.c
main:
21, Accelerator kernel generated
Generating Tesla code
21, Generating reduction(+:sum)
22, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
21, Generating implicit copyin(a[:5368709120])
% ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
I believe a problem is here:
size_t n = GB*1024*1024*1024/sizeof(float);
when I compile that line of code with g++, I get a warning about integer overflow. For some reason the PGI compiler is not warning, but the same badness is occurring under the hood. After the declarations of s, and n, if I add a printout like this:
size_t n = GB*1024*1024*1024/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s); // add this line
and compile with PGI 17.04, and run (on a P100, with 16GB) I get output like this:
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
16, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
22, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
$ ./a.out
n = 4611686017890516992, s = 18446744071562067968
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted
$
so it's evident that n and s are not what you intended.
We can fix this by marking all of those constants with ULL, and then things seem to work correctly for me:
$ cat m1.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 20ULL
int main()
{
float *a;
size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
double sum=0.0;
#pragma acc loop reduction (+:sum)
for (int i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
16, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
22, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
$ ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
$
Note that I've made another change above as well. I changed the sum accumulation variable from float to double. This is necessary to preserve somewhat "sensible" results when doing a very large reduction across very small quantities.
And, as #MatColgrove pointed out in his answer, I missed a few other things as well.

verifying NVIDIA __shared__ memory is used when cache directive is present

I'm experimenting with OpenACC's cache clause using PGI 14.10. I've got a simple loop based on the one in the slides at [1]:
#include <stdlib.h>
int main(int argc, char **argv) {
int N = 1024;
int *restrict x = (int *)malloc(sizeof(int) * N);
int *restrict y = (int *)malloc(sizeof(int) * N);
#pragma acc parallel loop copy(x[0:N], y[0:N])
for (int i = 1; i < N - 1; i++) {
#pragma acc cache(x[i-1:2])
y[i] = (x[i - 1] + x[i + 1]) / 2.0;
}
return 0;
}
When I run this under nvprof with --metrics shared_load_transactions,shared_store_transactions it reports no loads or stores. So is the cache directive not having the effect I want (and if so why isn't it working)? Or is using nvprof to measure shared transactions incorrect?
Minfo output is below.
[1] http://www.pgroup.com/lit/presentations/cea-3.pdf
main:
6, Generating copy(x[:N])
Generating copy(y[:N])
Accelerator kernel generated
9, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
6, Generating Tesla code
Answered on the PGI Forums: http://www.pgroup.com/userforum/viewtopic.php?t=4611&start=0&postdays=0&postorder=asc&highlight=
Apparently, the cache directive was almost entirely disabled on PGI 14.x compilers, but will be available in the 2015 version.

Multi-gpu CUDA Thrust

I have a Cuda C++ code that uses Thrust currently working properly on a single GPU. I'd now like to modify it for multi-gpu. I have a host function that includes a number of Thrust calls that sort, copy, calculate differences etc on device arrays. I want to use each GPU to run this sequence of Thrust calls on it's own (independent) set of arrays at the same time. I've read that Thrust functions that return values are synchronous but can I use OpenMP to have each host thread call up a function (with Thrust calls) that runs on a separate GPU?
For example (coded in browser):
#pragma omp parallel for
for (int dev=0; dev<Ndev; dev++){
cudaSetDevice(dev);
runthrustfunctions(dev);
}
void runthrustfunctions(int dev){
/*lots of Thrust functions running on device arrays stored on corresponding GPU*/
//for example this is just a few of the lines"
thrust::device_ptr<double> pos_ptr = thrust::device_pointer_cast(particle[dev].pos);
thrust::device_ptr<int> list_ptr = thrust::device_pointer_cast(particle[dev].list);
thrust::sequence(list_ptr,list_ptr+length);
thrust::sort_by_key(pos_ptr, pos_ptr+length,list_ptr);
thrust::device_vector<double> temp(length);
thrust::gather(list_ptr,list_ptr+length,pos_ptr,temp.begin());
thrust::copy(temp.begin(), temp.end(), pos_ptr);
}`
I think I also need the structure "particle[0]" to be stored on GPU 0, particle[1] on GPU 1 etc and I my guess is this not possible. An option might be to use "switch" with separate code for each GPU case.
I'd like to know if this is a correct approach or if there is a better way?
Thanks
Yes, you can combine thrust and OpenMP.
Here's a complete worked example with results:
$ cat t340.cu
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <time.h>
#include <sys/time.h>
#define DSIZE 200000000
using namespace std;
int main(int argc, char *argv[])
{
timeval t1, t2;
int num_gpus = 0; // number of CUDA GPUs
printf("%s Starting...\n\n", argv[0]);
// determine the number of CUDA capable GPUs
cudaGetDeviceCount(&num_gpus);
if (num_gpus < 1)
{
printf("no CUDA capable devices were detected\n");
return 1;
}
// display CPU and GPU configuration
printf("number of host CPUs:\t%d\n", omp_get_num_procs());
printf("number of CUDA devices:\t%d\n", num_gpus);
for (int i = 0; i < num_gpus; i++)
{
cudaDeviceProp dprop;
cudaGetDeviceProperties(&dprop, i);
printf(" %d: %s\n", i, dprop.name);
}
printf("initialize data\n");
// initialize data
typedef thrust::device_vector<int> dvec;
typedef dvec *p_dvec;
std::vector<p_dvec> dvecs;
for(unsigned int i = 0; i < num_gpus; i++) {
cudaSetDevice(i);
p_dvec temp = new dvec(DSIZE);
dvecs.push_back(temp);
}
thrust::host_vector<int> data(DSIZE);
thrust::generate(data.begin(), data.end(), rand);
// copy data
for (unsigned int i = 0; i < num_gpus; i++) {
cudaSetDevice(i);
thrust::copy(data.begin(), data.end(), (*(dvecs[i])).begin());
}
printf("start sort\n");
gettimeofday(&t1,NULL);
// run as many CPU threads as there are CUDA devices
omp_set_num_threads(num_gpus); // create as many CPU threads as there are CUDA devices
#pragma omp parallel
{
unsigned int cpu_thread_id = omp_get_thread_num();
cudaSetDevice(cpu_thread_id);
thrust::sort((*(dvecs[cpu_thread_id])).begin(), (*(dvecs[cpu_thread_id])).end());
cudaDeviceSynchronize();
}
gettimeofday(&t2,NULL);
printf("finished\n");
unsigned long et = ((t2.tv_sec * 1000000)+t2.tv_usec) - ((t1.tv_sec * 1000000) + t1.tv_usec);
if (cudaSuccess != cudaGetLastError())
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
printf("sort time = %fs\n", (float)et/(float)(1000000));
// check results
thrust::host_vector<int> result(DSIZE);
thrust::sort(data.begin(), data.end());
for (int i = 0; i < num_gpus; i++)
{
cudaSetDevice(i);
thrust::copy((*(dvecs[i])).begin(), (*(dvecs[i])).end(), result.begin());
for (int j = 0; j < DSIZE; j++)
if (data[j] != result[j]) { printf("mismatch on device %d at index %d, host: %d, device: %d\n", i, j, data[j], result[j]); return 1;}
}
printf("Success\n");
return 0;
}
$ nvcc -Xcompiler -fopenmp -O3 -arch=sm_20 -o t340 t340.cu -lgomp
$ CUDA_VISIBLE_DEVICES="0" ./t340
./t340 Starting...
number of host CPUs: 12
number of CUDA devices: 1
0: Tesla M2050
initialize data
start sort
finished
sort time = 0.398922s
Success
$ ./t340
./t340 Starting...
number of host CPUs: 12
number of CUDA devices: 4
0: Tesla M2050
1: Tesla M2070
2: Tesla M2050
3: Tesla M2070
initialize data
start sort
finished
sort time = 0.460058s
Success
$
We can see that when I restrict the program to using a single device, the sort operation takes about 0.4 seconds. Then when I allow it to use all 4 devices (repeating the same sort on all 4 devices) the overall operation only take 0.46 seconds, even though we're doing 4 times as much work.
For this particular case I happened to be using CUDA 5.0 with thrust v1.7, and gcc 4.4.6 (RHEL 6.2)

not able to use printf in cuda kernel function

It seems that printf doesn't work inside the Kernel of a cuda code
#include "Common.h"
#include<cuda.h>
#include <stdio.h>
__device__ __global__ void Kernel(float *a_d , float *b_d ,int size)
{
int idx = threadIdx.x ;
int idy = threadIdx.y ;
//Allocating memory in the share memory of the device
__shared__ float temp[16][16];
//Copying the data to the shared memory
temp[idy][idx] = a_d[(idy * (size+1)) + idx] ;
printf("idx=%d, idy=%d, size=%d\n", idx, idy, size);
for(int i =1 ; i<size ;i++) {
if((idy + i) < size) { // NO Thread divergence here
float var1 =(-1)*( temp[i-1][i-1]/temp[i+idy][i-1]);
temp[i+idy][idx] = temp[i-1][idx] +((var1) * (temp[i+idy ][idx]));
}
__syncthreads(); //Synchronizing all threads before Next iterat ion
}
b_d[idy*(size+1) + idx] = temp[idy][idx];
}
when compiling, it says:
error: calling a host function("printf") from a __device__/__global__ function("Kernel") is not allowed
The cuda version is 4
Quoting the CUDA Programming Guide "Formatted output is only supported by devices of compute capability 2.x and higher". See the programming guide for additional information.
Devices of compute capability < 2.x can use cuPrintf.
If you are on a 2.x and above device and you are trying to use printf make sure you have specified arch=sm_20 (or higher). The default is sm_10 which does not have sufficient features to support printf.
NVIDIA offers three source level debuggers for CUDA. You may find these more useful than printf for inspecting variables.
- Nsight Visual Studio Edition CUDA Debugger
- Nsight Eclipse Edition CUDA Debugger
- cuda-gdb
You need to use cuPrintf, as in this example. Note that printf is a pretty limited way of debugging, the Nsight or Nsight eclipse edition IDEs are much nicer.

Use of shared memory with OpenACC

I'm trying to use shared memory to cache things with OpenACC.
Basically what I'm working on is a matrix multiplication, and what I have is this:
typedef float ff;
// Multiplies two square row-major matrices a and b, puts the result in c.
void mmul(const restrict ff* a,
const restrict ff* b,
restrict ff* c,
const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{
#pragma acc region
{
#pragma acc loop independent vector(16)
for (int i = 0; i < n; ++i) {
#pragma acc loop independent vector(16)
for (int j = 0; j < n; ++j) {
ff sum = 0;
for (int k = 0; k < n; ++k) {
sum += a[i + n * k] * b[k + n * j];
}
c[i + n * j] = sum;
}
}
}
}
}
What I would like to do is use shared memory to cache tiles of the matrices 'a' and 'b' to use in the computation of 'c', in a similar fashion to what the CUDA mmul algorithm does.
Basically on CUDA I would know the exact size of my blocks, and would be able to:
declare a shared memory with the size of the block
copy the 'relevant' part of the data to the block
use this data
I understand I can use the
#pragma acc cached
directive, and that I can specify block sizes with the vector and gang options, but I'm having some trouble understanding how that's gonna be mapped to the CUDA architecture.
Is there a way to achieve something similar with OpenACC? Is there a good tutorial/resource on the use of the cached directive or on how to map some of the power of shared memory from CUDA to OpenACC?
If you are using PGI Accelerator Compiler, you can dump out the generated PTX file and see what is going on in underling of execution:
pgcc -acc -fast -Minfo -ta=nvidia,cc13,keepptx matrixMult.c -o matrixMult
The generated PTX will be stored in the current directory.
EDIT: You may prefer to see the high-level code (CUDA for C or Fortran). So use following -ta=nvidia,cc13,keepptx,keepgpu .