Use of shared memory with OpenACC - cuda

I'm trying to use shared memory to cache things with OpenACC.
Basically what I'm working on is a matrix multiplication, and what I have is this:
typedef float ff;
// Multiplies two square row-major matrices a and b, puts the result in c.
void mmul(const restrict ff* a,
const restrict ff* b,
restrict ff* c,
const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{
#pragma acc region
{
#pragma acc loop independent vector(16)
for (int i = 0; i < n; ++i) {
#pragma acc loop independent vector(16)
for (int j = 0; j < n; ++j) {
ff sum = 0;
for (int k = 0; k < n; ++k) {
sum += a[i + n * k] * b[k + n * j];
}
c[i + n * j] = sum;
}
}
}
}
}
What I would like to do is use shared memory to cache tiles of the matrices 'a' and 'b' to use in the computation of 'c', in a similar fashion to what the CUDA mmul algorithm does.
Basically on CUDA I would know the exact size of my blocks, and would be able to:
declare a shared memory with the size of the block
copy the 'relevant' part of the data to the block
use this data
I understand I can use the
#pragma acc cached
directive, and that I can specify block sizes with the vector and gang options, but I'm having some trouble understanding how that's gonna be mapped to the CUDA architecture.
Is there a way to achieve something similar with OpenACC? Is there a good tutorial/resource on the use of the cached directive or on how to map some of the power of shared memory from CUDA to OpenACC?

If you are using PGI Accelerator Compiler, you can dump out the generated PTX file and see what is going on in underling of execution:
pgcc -acc -fast -Minfo -ta=nvidia,cc13,keepptx matrixMult.c -o matrixMult
The generated PTX will be stored in the current directory.
EDIT: You may prefer to see the high-level code (CUDA for C or Fortran). So use following -ta=nvidia,cc13,keepptx,keepgpu .

Related

Is DeviceToDevice copy in a CUDA program needed?

I am doing the following two operations:
Addition of two array => a + b = AddResult
Multiply of two arrays => AddResult * a = MultiplyResult
In the above logic, the AddResult is an intermediate result and is used in the next mupltiplication operation as input.
#define N 4096 // size of array
__global__ void add(const int* a, const int* b, int* c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N)
{
c[tid] = a[tid] + b[tid];
}
}
__global__ void multiply(const int* a, const int* b, int* c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N)
{
c[tid] = a[tid] * b[tid];
}
}
int main()
{
int T = 1024, B = 4; // threads per block and blocks per grid
int a[N], b[N], c[N], d[N], e[N];
int* dev_a, * dev_b, * dev_AddResult, * dev_Temp, * dev_MultiplyResult;
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_AddResult, N * sizeof(int));
cudaMalloc((void**)&dev_Temp, N * sizeof(int));
cudaMalloc((void**)&dev_MultiplyResult, N * sizeof(int));
for (int i = 0; i < N; i++)
{
// load arrays with some numbers
a[i] = i;
b[i] = i * 1;
}
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_AddResult, c, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_Temp, d, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_MultiplyResult, e, N * sizeof(int), cudaMemcpyHostToDevice);
//ADD
add << <B, T >> > (dev_a, dev_b, dev_AddResult);
cudaDeviceSynchronize();
//Multiply
cudaMemcpy(dev_Temp, dev_AddResult, N * sizeof(int), cudaMemcpyDeviceToDevice); //<---------DO I REALLY NEED THIS?
multiply << <B, T >> > (dev_a, dev_Temp, dev_MultiplyResult);
//multiply << <B, T >> > (dev_a, dev_AddResult, dev_MultiplyResult);
//Copy Final Results D to H
cudaMemcpy(e, dev_MultiplyResult, N * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++)
{
printf("(%d+%d)*%d=%d\n", a[i], b[i], a[i], e[i]);
}
// clean up
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_AddResult);
cudaFree(dev_Temp);
cudaFree(dev_MultiplyResult);
return 0;
}
In the above sample code, I am transferring the Addition results (i.e. dev_AddResult) to another device array (i.e. dev_Temp) to perform the multiplication operation.
QUESTION: Since the Addition results array (i.e. dev_AddResult) is already on the GPU device, do I really need to transfer it to another array? I have already tried to execute the next kernel by directly providing dev_AddResult as input and it produced the same results. Is there any risk involved in directly passing the output of one kernel as the input of the next kernel? Any best practices to follow?
Yes, for the case you have shown, you can use the "output" of one kernel as the "input" to the next, without any copying. You've already done that and confirmed it works, so I will dispense with any example. The changes are trivial anyway - eliminate the intervening cudaMemcpy operation, and use the same dev_AddResult pointer in place of the dev_Temp pointer on your multiply kernel invocation.
Regarding "risks" I'm not aware of any for the example you have given. Moving away from that example to possibly more general usage, you would want to make sure that the add output calculations are finished before being used somewhere else.
Your example already does this, redundantly, using at least 2 mechanisms:
intervening cudaDeviceSynchronize() - this forces the previously issued work to complete
stream semantics - one rule of stream semantics is that work issued into a particular stream will execute in issue order. Item B issued into stream X, will not begin until the previously issued item A into stream X has completed.
So you don't really need the cudaDeviceSynchronize() in this case. It isn't "hurting" anything from a functionality perspective, but it is probably adding a few microseconds to the overall execution time.
More generally, if you had issued your add and multiply kernel into separate streams, then CUDA provides no guarantees of execution order, even though you "issued" the multiply kernel after the add kernel.
In that case (not the one you have here) if you needed the multiply operation to use the previously computed add results, you would need to enforce that somehow (enforce the completion of the add kernel before the multiply kernel). You have already shown one method to do that here, using a synchronize call.

-ta=tesla:managed:cuda8 but cuMemAllocManaged returned error 2: Out of memory

I'm new to OpenACC. I like it very much so far as I'm familiar with OpenMP.
I have 2 1080Ti cards each with 9GB and I've 128GB of RAM. I'm trying a very basic test to allocate an array, initialize it, then sum it up in parallel. This works for 8 GB but when I increase to 10 GB I get out-of-memory error. My understanding was that with unified memory of Pascal (which these card are) and CUDA 8, I could allocate an array larger than the GPU's memory and the hardware will page in and page out on demand.
Here's my full C code test :
$ cat firstAcc.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 10
int main()
{
float *a;
size_t n = GB*1024*1024*1024/sizeof(float);
size_t s = n * sizeof(float);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
float sum=0.0;
#pragma acc loop reduction (+:sum)
for (int i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
As per the "Enable Unified Memory" section of this article I compile it with :
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo firstAcc.c
main:
20, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
28, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
I need to understand those messages but for now I don't think they are relevant. Then I run it :
$ ./a.out
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted (core dumped)
This works fine if I change GB to 8. I expected 10GB to work (despite the GPU card having 9GB) thanks to Pascal 1080Ti and CUDA 8.
Have I misunderstand, or what am I doing wrong? Thanks in advance.
$ pgcc -V
pgcc 17.4-0 64-bit target on x86-64 Linux -tp haswell
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
$ cat /usr/local/cuda-8.0/version.txt
CUDA Version 8.0.61
Besides what Bob mentioned, I made a few more fixes.
First, you're not actually generating an OpenACC compute region since you only have a "#pragma acc loop" directive. This should be "#pragma acc parallel loop". You can see this in the compiler feedback messages where it's only showing host code optimizations.
Second, the "i" index should be declared as a "long". Otherwise, you'll overflow the index.
Finally, you need to add "cc60" to your target accelerator options to tell the compiler to target a Pascal based GPU.
% cat mi.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 20ULL
int main()
{
float *a;
size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
double sum=0.0;
#pragma acc parallel loop reduction (+:sum)
for (long i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
% pgcc -fast -acc -ta=tesla:managed,cuda8.0,cc60 -Minfo=accel mi.c
main:
21, Accelerator kernel generated
Generating Tesla code
21, Generating reduction(+:sum)
22, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
21, Generating implicit copyin(a[:5368709120])
% ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
I believe a problem is here:
size_t n = GB*1024*1024*1024/sizeof(float);
when I compile that line of code with g++, I get a warning about integer overflow. For some reason the PGI compiler is not warning, but the same badness is occurring under the hood. After the declarations of s, and n, if I add a printout like this:
size_t n = GB*1024*1024*1024/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s); // add this line
and compile with PGI 17.04, and run (on a P100, with 16GB) I get output like this:
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
16, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
22, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
$ ./a.out
n = 4611686017890516992, s = 18446744071562067968
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted
$
so it's evident that n and s are not what you intended.
We can fix this by marking all of those constants with ULL, and then things seem to work correctly for me:
$ cat m1.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 20ULL
int main()
{
float *a;
size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
double sum=0.0;
#pragma acc loop reduction (+:sum)
for (int i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
16, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
22, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
$ ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
$
Note that I've made another change above as well. I changed the sum accumulation variable from float to double. This is necessary to preserve somewhat "sensible" results when doing a very large reduction across very small quantities.
And, as #MatColgrove pointed out in his answer, I missed a few other things as well.

Controlling the threads that go into CUDA function from OpenACC compute region

Im calling a cuda function from OpenACC compute region, and I want to specify the number of threads that should go into the cuda function, but it seems that I couldn't figure how to control that.
%main.cpp
..
#pragma acc routine vector
extern "C" void CUDA_KERNEL_FUNCTION(double *B, int ldb,const double *A, int lda);
..
#pragma acc parallel loop independent collapse(3) gang vector(128)
for(int i0 = 0; i0 < size0 - 31; i0+= 32)
for(int i1 = 0; i1 < size1 - 31; i1+= 32)
for(int i2 = 0; i2 < size2; i2+= 1)
CUDA_KERNEL_FUNCTION(B, ldb, A, lda);
..
..
%cuda_code.cu
extern "C" __device__ void CUDA_KERNEL_FUNCTION(double *B, int ldb,const double *A, int lda)
{
Num_Threads_gpu = blockDim.x * blockDim.y* blockDim.z;
//Num_Threads_gpu is always 32
}
The compilation is fine. But No matter what vector length I use, the number of threads that go into the cuda function is always 32. Is there any way to specify that?
I using "cuda/7.0.28" and "pgi/15.10"
Thanks
Try changing vector(128) to vector_length(128). I think PGI 15.10 supports both syntaxes, but just in case...
If that doesn't work, can you please post the compiler output with -Minfo=accel so that we can see what the compiler is doing?

Nested Directives in OpenACC

I'm trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I've Tesla 40c and my OpenACC compiler is PGI version 15.7.
My code is so simple. When I try to compile following code compiler returns me these messages
PGCC-S-0155-Illegal context for pragma: acc parallel loop (test.cpp: 158)
PGCC/x86 Linux 15.7-0: compilation completed with severe errors
My code structure:
#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
// << computation >>
int ss = A[tid].start;
int ee = A[tid].end;
#pragma acc parallel loop
for(j = ss; j< ( ee + ss); j++)
{
// << computation >>
}
I've also tried to change my code to use routine directives. But I couldn't compile again
#pragma acc routine workers
foo(...)
{
#pragma acc parallel loop
for(j = ss; j< ( ee + ss); j++)
{
// << computation >>
}
}
#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
// << computation >>
int ss = A[tid].start;
int ee = A[tid].end;
foo(...);
}
I've tried of course only with routine (seq,worker,gang) without inner parallel loop directive. It has been compiler but dynamic parallelism hasn't been activated.
37, Generating acc routine worker
Generating Tesla code
42, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y */
Loop is parallelizable
How am I supposed to use dynamic parallelism in OpenACC?
How am I supposed to use dynamic parallelism in OpenACC?
Although nested regions (which would presumably use dynamic parallelism) is a new feature in the OpenACC 2.0 specification, I don't believe it is implemented yet in PGI 15.7. PGI 15.7 represents a partial implementation of the OpenACC 2.0 specification.
This limitation is documented in the PGI 15.7 release notes that should ship with your PGI 15.7 compiler (pgirn157.pdf) in section 2.7 (those release notes are currently available here):
OpenACC 2.0 Missing Features
‣ The declare link directive for global data is not implemented.
‣ Nested parallelism (parallel and kernels constructs within a parallel or kernels region) is not
implemented.
Based on the comments, there is some concern about #pragma acc routine worker, so here is a fully worked example with PGI 15.7 of that:
$ cat t1.c
#include <stdio.h>
#include <stdlib.h>
#define D1 4096
#define D2 4096
#define OFFS 2
#pragma acc routine worker
void my_set(int *d, int len, int val){
int i;
for (i = 0; i < len; i++) d[i] += val+OFFS;
}
int main(){
int i,*data;
data = (int *)malloc(D1*D2*sizeof(int));
for (i = 0; i < D1*D2; i++) data[i] = 1;
#pragma acc kernels copy(data[0:D1*D2])
for (i = 0; i < D1; i++)
my_set(data+(i*D2), D2, 1);
printf("%d\n", data[0]);
return 0;
}
$ pgcc -acc -ta=tesla -Minfo=accel t1.c -o t1
my_set:
8, Generating acc routine worker
Generating Tesla code
10, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y */
Loop is parallelizable
main:
20, Generating copy(data[:16777216])
21, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
21, #pragma acc loop gang /* blockIdx.x */
$ ./t1
4
$
Note that the gang parallelism has been performed at the outer loop, and the worker parallelism has been performed in the inner (routine) loop.
This method does not depend on dynamic parallelism (instead, it relies on a partitioning of parallelism between worker at the routine level and gang at the caller level) and will not invoke dynamic parallelism.
The native use of dynamic parallelism (CDP) is not supported currently in PGI 15.7. It should be possible to call (i.e. interoperate with) other functions (e.g. CUDA, or libraries) that make use of CDP from OpenACC code, but currently, natively it is not used (and not supported) in PGI 15.7
try replacing "#pragma acc parallel loop" with #pragma acc loop"

Shared memory, branching performance and register count

I came across some peculiar performance behaviour when trying out the CUDA shuffle instruction. The test kernel below is based on an image processing algorithm which adds input-dependent values to all neighbouring pixels within a square of side rad. The output for each block is added in shared memory. If only one thread per warp adds its result to shared memory, the performance is poor (Option 1), whereas on the other hand, if all threads add to shared memory (one thread adds the desired value, the rest just add 0), the execution time drops by 2-3 times (Option 2).
#include <iostream>
#include "cuda_runtime.h"
#define warpSz 32
#define tileY 32
#define rad 32
__global__ void test(float *out, int pitch)
{
// Set shared mem to 0
__shared__ float tile[(warpSz + 2*rad) * (tileY + 2*rad)];
for (int i = threadIdx.y*blockDim.x+threadIdx.x; i<(tileY+2*rad)*(warpSz+2*rad); i+=blockDim.x*blockDim.y) {
tile[i] = 0.0f;
}
__syncthreads();
for (int row=threadIdx.y; row<tileY; row += blockDim.y) {
// Loop over pixels in neighbourhood
for (int i=0; i<2*rad+1; ++i) {
float res = 0.0f;
int rowStartIdx = (row+i)*(warpSz+2*rad);
for (int j=0; j<2*rad+1; ++j) {
res += float(threadIdx.x+row); // Substitute for real calculation
// Option 1: one thread writes to shared mem
if (threadIdx.x == 0) {
tile[rowStartIdx + j] += res;
res = 0.0f;
}
//// Option 2: all threads write to shared mem
//float tmp = 0.0f;
//if (threadIdx.x == 0) {
// tmp = res;
// res = 0.0f;
//}
//tile[rowStartIdx + threadIdx.x+j] += tmp;
res = __shfl(res, (threadIdx.x+1) % warpSz);
}
res += float(threadIdx.x+row);
tile[rowStartIdx + threadIdx.x+2*rad] += res;
__syncthreads();
}
}
// Add result back to global mem
for (int row=threadIdx.y; row<tileY+2*rad; row+=blockDim.y) {
for (int col=threadIdx.x; col<warpSz+2*rad; col+=warpSz) {
int idx = (blockIdx.y*tileY + row)*pitch + blockIdx.x*warpSz + col;
atomicAdd(out+idx, tile[row*(warpSz+2*rad) + col]);
}
}
}
int main(void)
{
int2 dim = make_int2(512, 512);
int pitchOut = (((dim.x+2*rad)+warpSz-1) / warpSz) * warpSz;
int sizeOut = pitchOut*(dim.y+2*rad);
dim3 gridDim((dim.x+warpSz-1)/warpSz, (dim.y+tileY-1)/tileY, 1);
float *devOut;
cudaMalloc((void**)&devOut, sizeOut*sizeof(float));
cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaFree(0);
cudaEventRecord(start, 0);
test<<<gridDim, dim3(warpSz, 8)>>>(devOut, pitchOut);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaFree(devOut);
cudaDeviceReset();
std::cout << "Elapsed time: " << elapsedTime << " ms.\n";
std::cin.ignore();
}
Is this expected behaviour/can anyone explain why this happens?
One thing I have noted is that Option 1 uses only 15 registers, whereas Option 2 uses 37, which seems a big difference to me.
Another is that the if-statement in the innermost loop is converted to explicit bra instructions in the PTX code for Option 1, whereas for Option 2 it is converted to two selp instructions. Could it be that the explicit branching is behind the 2-3 times slow down similar to what's suspected in this question?
There are two reasons why I am reluctant to go for Option 2. First, when profiling the original application it seems to be limited by share memory bandwidth, which indicates that there is potential to increase the performance by having fewer threads accessing it. Second, unless we use the volatile keyword, writes to shared memory can be optimised to registers. Since we are only interested in the contribution from last the thread to access each memory location (threadIdx.x == 0), and all others add 0, this is not a problem as long as all changes temporarily located in registers are guaranteed to be written back to shared memory in the same order they were issued. Is this the case though? (This far, both options have produced the exact same result.)
Any thoughts or ideas are much appreciated!
PS. I compile for compute capability 3.0. (However, the shuffle instruction is not necessary to demonstrate the behaviour and can be commented out.)