verifying NVIDIA __shared__ memory is used when cache directive is present - cuda

I'm experimenting with OpenACC's cache clause using PGI 14.10. I've got a simple loop based on the one in the slides at [1]:
#include <stdlib.h>
int main(int argc, char **argv) {
int N = 1024;
int *restrict x = (int *)malloc(sizeof(int) * N);
int *restrict y = (int *)malloc(sizeof(int) * N);
#pragma acc parallel loop copy(x[0:N], y[0:N])
for (int i = 1; i < N - 1; i++) {
#pragma acc cache(x[i-1:2])
y[i] = (x[i - 1] + x[i + 1]) / 2.0;
}
return 0;
}
When I run this under nvprof with --metrics shared_load_transactions,shared_store_transactions it reports no loads or stores. So is the cache directive not having the effect I want (and if so why isn't it working)? Or is using nvprof to measure shared transactions incorrect?
Minfo output is below.
[1] http://www.pgroup.com/lit/presentations/cea-3.pdf
main:
6, Generating copy(x[:N])
Generating copy(y[:N])
Accelerator kernel generated
9, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
6, Generating Tesla code

Answered on the PGI Forums: http://www.pgroup.com/userforum/viewtopic.php?t=4611&start=0&postdays=0&postorder=asc&highlight=
Apparently, the cache directive was almost entirely disabled on PGI 14.x compilers, but will be available in the 2015 version.

Related

-ta=tesla:managed:cuda8 but cuMemAllocManaged returned error 2: Out of memory

I'm new to OpenACC. I like it very much so far as I'm familiar with OpenMP.
I have 2 1080Ti cards each with 9GB and I've 128GB of RAM. I'm trying a very basic test to allocate an array, initialize it, then sum it up in parallel. This works for 8 GB but when I increase to 10 GB I get out-of-memory error. My understanding was that with unified memory of Pascal (which these card are) and CUDA 8, I could allocate an array larger than the GPU's memory and the hardware will page in and page out on demand.
Here's my full C code test :
$ cat firstAcc.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 10
int main()
{
float *a;
size_t n = GB*1024*1024*1024/sizeof(float);
size_t s = n * sizeof(float);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
float sum=0.0;
#pragma acc loop reduction (+:sum)
for (int i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
As per the "Enable Unified Memory" section of this article I compile it with :
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo firstAcc.c
main:
20, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
28, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
I need to understand those messages but for now I don't think they are relevant. Then I run it :
$ ./a.out
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted (core dumped)
This works fine if I change GB to 8. I expected 10GB to work (despite the GPU card having 9GB) thanks to Pascal 1080Ti and CUDA 8.
Have I misunderstand, or what am I doing wrong? Thanks in advance.
$ pgcc -V
pgcc 17.4-0 64-bit target on x86-64 Linux -tp haswell
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
$ cat /usr/local/cuda-8.0/version.txt
CUDA Version 8.0.61
Besides what Bob mentioned, I made a few more fixes.
First, you're not actually generating an OpenACC compute region since you only have a "#pragma acc loop" directive. This should be "#pragma acc parallel loop". You can see this in the compiler feedback messages where it's only showing host code optimizations.
Second, the "i" index should be declared as a "long". Otherwise, you'll overflow the index.
Finally, you need to add "cc60" to your target accelerator options to tell the compiler to target a Pascal based GPU.
% cat mi.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 20ULL
int main()
{
float *a;
size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
double sum=0.0;
#pragma acc parallel loop reduction (+:sum)
for (long i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
% pgcc -fast -acc -ta=tesla:managed,cuda8.0,cc60 -Minfo=accel mi.c
main:
21, Accelerator kernel generated
Generating Tesla code
21, Generating reduction(+:sum)
22, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
21, Generating implicit copyin(a[:5368709120])
% ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
I believe a problem is here:
size_t n = GB*1024*1024*1024/sizeof(float);
when I compile that line of code with g++, I get a warning about integer overflow. For some reason the PGI compiler is not warning, but the same badness is occurring under the hood. After the declarations of s, and n, if I add a printout like this:
size_t n = GB*1024*1024*1024/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s); // add this line
and compile with PGI 17.04, and run (on a P100, with 16GB) I get output like this:
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
16, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
22, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
$ ./a.out
n = 4611686017890516992, s = 18446744071562067968
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted
$
so it's evident that n and s are not what you intended.
We can fix this by marking all of those constants with ULL, and then things seem to work correctly for me:
$ cat m1.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 20ULL
int main()
{
float *a;
size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
double sum=0.0;
#pragma acc loop reduction (+:sum)
for (int i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
16, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
22, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
$ ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
$
Note that I've made another change above as well. I changed the sum accumulation variable from float to double. This is necessary to preserve somewhat "sensible" results when doing a very large reduction across very small quantities.
And, as #MatColgrove pointed out in his answer, I missed a few other things as well.

Nested Directives in OpenACC

I'm trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I've Tesla 40c and my OpenACC compiler is PGI version 15.7.
My code is so simple. When I try to compile following code compiler returns me these messages
PGCC-S-0155-Illegal context for pragma: acc parallel loop (test.cpp: 158)
PGCC/x86 Linux 15.7-0: compilation completed with severe errors
My code structure:
#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
// << computation >>
int ss = A[tid].start;
int ee = A[tid].end;
#pragma acc parallel loop
for(j = ss; j< ( ee + ss); j++)
{
// << computation >>
}
I've also tried to change my code to use routine directives. But I couldn't compile again
#pragma acc routine workers
foo(...)
{
#pragma acc parallel loop
for(j = ss; j< ( ee + ss); j++)
{
// << computation >>
}
}
#pragma acc parallel loop
for( i = 0; i < N; i++ )
{
// << computation >>
int ss = A[tid].start;
int ee = A[tid].end;
foo(...);
}
I've tried of course only with routine (seq,worker,gang) without inner parallel loop directive. It has been compiler but dynamic parallelism hasn't been activated.
37, Generating acc routine worker
Generating Tesla code
42, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y */
Loop is parallelizable
How am I supposed to use dynamic parallelism in OpenACC?
How am I supposed to use dynamic parallelism in OpenACC?
Although nested regions (which would presumably use dynamic parallelism) is a new feature in the OpenACC 2.0 specification, I don't believe it is implemented yet in PGI 15.7. PGI 15.7 represents a partial implementation of the OpenACC 2.0 specification.
This limitation is documented in the PGI 15.7 release notes that should ship with your PGI 15.7 compiler (pgirn157.pdf) in section 2.7 (those release notes are currently available here):
OpenACC 2.0 Missing Features
‣ The declare link directive for global data is not implemented.
‣ Nested parallelism (parallel and kernels constructs within a parallel or kernels region) is not
implemented.
Based on the comments, there is some concern about #pragma acc routine worker, so here is a fully worked example with PGI 15.7 of that:
$ cat t1.c
#include <stdio.h>
#include <stdlib.h>
#define D1 4096
#define D2 4096
#define OFFS 2
#pragma acc routine worker
void my_set(int *d, int len, int val){
int i;
for (i = 0; i < len; i++) d[i] += val+OFFS;
}
int main(){
int i,*data;
data = (int *)malloc(D1*D2*sizeof(int));
for (i = 0; i < D1*D2; i++) data[i] = 1;
#pragma acc kernels copy(data[0:D1*D2])
for (i = 0; i < D1; i++)
my_set(data+(i*D2), D2, 1);
printf("%d\n", data[0]);
return 0;
}
$ pgcc -acc -ta=tesla -Minfo=accel t1.c -o t1
my_set:
8, Generating acc routine worker
Generating Tesla code
10, #pragma acc loop vector, worker /* threadIdx.x threadIdx.y */
Loop is parallelizable
main:
20, Generating copy(data[:16777216])
21, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
21, #pragma acc loop gang /* blockIdx.x */
$ ./t1
4
$
Note that the gang parallelism has been performed at the outer loop, and the worker parallelism has been performed in the inner (routine) loop.
This method does not depend on dynamic parallelism (instead, it relies on a partitioning of parallelism between worker at the routine level and gang at the caller level) and will not invoke dynamic parallelism.
The native use of dynamic parallelism (CDP) is not supported currently in PGI 15.7. It should be possible to call (i.e. interoperate with) other functions (e.g. CUDA, or libraries) that make use of CDP from OpenACC code, but currently, natively it is not used (and not supported) in PGI 15.7
try replacing "#pragma acc parallel loop" with #pragma acc loop"

Use of shared memory with OpenACC

I'm trying to use shared memory to cache things with OpenACC.
Basically what I'm working on is a matrix multiplication, and what I have is this:
typedef float ff;
// Multiplies two square row-major matrices a and b, puts the result in c.
void mmul(const restrict ff* a,
const restrict ff* b,
restrict ff* c,
const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{
#pragma acc region
{
#pragma acc loop independent vector(16)
for (int i = 0; i < n; ++i) {
#pragma acc loop independent vector(16)
for (int j = 0; j < n; ++j) {
ff sum = 0;
for (int k = 0; k < n; ++k) {
sum += a[i + n * k] * b[k + n * j];
}
c[i + n * j] = sum;
}
}
}
}
}
What I would like to do is use shared memory to cache tiles of the matrices 'a' and 'b' to use in the computation of 'c', in a similar fashion to what the CUDA mmul algorithm does.
Basically on CUDA I would know the exact size of my blocks, and would be able to:
declare a shared memory with the size of the block
copy the 'relevant' part of the data to the block
use this data
I understand I can use the
#pragma acc cached
directive, and that I can specify block sizes with the vector and gang options, but I'm having some trouble understanding how that's gonna be mapped to the CUDA architecture.
Is there a way to achieve something similar with OpenACC? Is there a good tutorial/resource on the use of the cached directive or on how to map some of the power of shared memory from CUDA to OpenACC?
If you are using PGI Accelerator Compiler, you can dump out the generated PTX file and see what is going on in underling of execution:
pgcc -acc -fast -Minfo -ta=nvidia,cc13,keepptx matrixMult.c -o matrixMult
The generated PTX will be stored in the current directory.
EDIT: You may prefer to see the high-level code (CUDA for C or Fortran). So use following -ta=nvidia,cc13,keepptx,keepgpu .

Simplest Possible Example to Show GPU Outperform CPU Using CUDA

I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.
To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.
First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.
Use something like this in C++:
#define N (1024*1024)
#define M (1000000)
int main()
{
float data[N]; int count = 0;
for(int i = 0; i < N; i++)
{
data[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
{
data[i] = data[i] * data[i] - 0.25f;
}
}
int sel;
printf("Enter an index: ");
scanf("%d", &sel);
printf("data[%d] = %f\n", sel, data[sel]);
}
Use something like this in CUDA/C:
#define N (1024*1024)
#define M (1000000)
__global__ void cudakernel(float *buf)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
buf[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
buf[i] = buf[i] * buf[i] - 0.25f;
}
int main()
{
float data[N]; int count = 0;
float *d_data;
cudaMalloc(&d_data, N * sizeof(float));
cudakernel<<<N/256, 256>>>(d_data);
cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_data);
int sel;
printf("Enter an index: ");
scanf("%d", &sel);
printf("data[%d] = %f\n", sel, data[sel]);
}
If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.
A very, very simple method would be to calculate the squares for, say, the first 100,000 integers, or a large matrix operation. Ita easy to implement and lends itself to the the GPUs strengths by avoiding branching, not requiring a stack, etc. I did this with OpenCL vs C++ awhile back and got some pretty astonishing results. (A 2GB GTX460 achieved about 40x the performance of a dual core CPU.)
Are you looking for example code, or just ideas?
Edit
The 40x was vs a dual core CPU, not a quad core.
Some pointers:
Make sure you're not running, say, Crysis while running your benchmarks.
Shot down all unnecessary apps and services that might be stealing CPU time.
Make sure your kid doesn't start watching a movie on your PC while the benchmarks are running. Hardware MPEG decoding tends to influence the outcome. (Autoplay let my two year old start Despicable Me by inserting the disk. Yay.)
As I said in my comment response to #Paul R, consider using OpenCL as it'll easily let you run the same code on the GPU and CPU without having to reimplement it.
(These are probably pretty obvious in retrospect.)
For reference, I made a similar example with time measurements. With GTX 660, the GPU speedup was 24X where its operation includes data transfers in addition to actual computation.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <time.h>
#define N (1024*1024)
#define M (10000)
#define THREADS_PER_BLOCK 1024
void serial_add(double *a, double *b, double *c, int n, int m)
{
for(int index=0;index<n;index++)
{
for(int j=0;j<m;j++)
{
c[index] = a[index]*a[index] + b[index]*b[index];
}
}
}
__global__ void vector_add(double *a, double *b, double *c)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
for(int j=0;j<M;j++)
{
c[index] = a[index]*a[index] + b[index]*b[index];
}
}
int main()
{
clock_t start,end;
double *a, *b, *c;
int size = N * sizeof( double );
a = (double *)malloc( size );
b = (double *)malloc( size );
c = (double *)malloc( size );
for( int i = 0; i < N; i++ )
{
a[i] = b[i] = i;
c[i] = 0;
}
start = clock();
serial_add(a, b, c, N, M);
printf( "c[0] = %d\n",0,c[0] );
printf( "c[%d] = %d\n",N-1, c[N-1] );
end = clock();
float time1 = ((float)(end-start))/CLOCKS_PER_SEC;
printf("Serial: %f seconds\n",time1);
start = clock();
double *d_a, *d_b, *d_c;
cudaMalloc( (void **) &d_a, size );
cudaMalloc( (void **) &d_b, size );
cudaMalloc( (void **) &d_c, size );
cudaMemcpy( d_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( d_b, b, size, cudaMemcpyHostToDevice );
vector_add<<< (N + (THREADS_PER_BLOCK-1)) / THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( d_a, d_b, d_c );
cudaMemcpy( c, d_c, size, cudaMemcpyDeviceToHost );
printf( "c[0] = %d\n",0,c[0] );
printf( "c[%d] = %d\n",N-1, c[N-1] );
free(a);
free(b);
free(c);
cudaFree( d_a );
cudaFree( d_b );
cudaFree( d_c );
end = clock();
float time2 = ((float)(end-start))/CLOCKS_PER_SEC;
printf("CUDA: %f seconds, Speedup: %f\n",time2, time1/time2);
return 0;
}
I agree with David's comments about OpenCL being a great way to test this, because of how easy it is to switch between running code on the CPU vs. GPU. If you're able to work on a Mac, Apple has a nice bit of sample code that does an N-body simulation using OpenCL, with kernels running on the CPU, GPU, or both. You can switch between them in real time, and the FPS count is displayed onscreen.
For a much simpler case, they have a "hello world" OpenCL command line application that calculates squares in a manner similar to what David describes. That could probably be ported to non-Mac platforms without much effort. To switch between GPU and CPU usage, I believe you just need to change the
int gpu = 1;
line in the hello.c source file to 0 for CPU, 1 for GPU.
Apple has some more OpenCL example code in their main Mac source code listing.
Dr. David Gohara had an example of OpenCL's GPU speedup when performing molecular dynamics calculations at the very end of this introductory video session on the topic (about around minute 34). In his calculation, he sees a roughly 27X speedup by going from a parallel implementation running on 8 CPU cores to a single GPU. Again, it's not the simplest of examples, but it shows a real-world application and the advantage of running certain calculations on the GPU.
I've also done some tinkering in the mobile space using OpenGL ES shaders to perform rudimentary calculations. I found that a simple color thresholding shader run across an image was roughly 14-28X faster when run as a shader on the GPU than the same calculation performed on the CPU for this particular device.

cuda programming problem

I'm very new to cuda .I'm using cuda on my ubuntu 10.04 in device emulation mode.
I write a code to compute the square of array which is following :
#include <stdio.h>
#include <cuda.h>
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x + threadIdx.x;
if (idx<=N)
a[idx] = a[idx] * a[idx];
}
int main(void)
{
float *a_h, *a_d;
const int N = 10;
size_t size = N * sizeof(float);
a_h = (float *)malloc(size);
cudaMalloc((void **) &a_d, size);
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
square_array <<< 1,10>>> (a_d, N);
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf(" %f\n", a_h[i]);
free(a_h);
cudaFree(a_d);
return 0;
}
When I run this code it show no problem it give me proper output.
Now my problem is that when i use <<<2,5>>> or<<<5,2>>> the result is same . what is happening on gpu ?
All I understand is that I just launch cuda kernel with 5 blocks containing 2 thread.
Can anyone explain me how Gpu handle this or implement the launch(kernel call)?
Now my real problem is that when i call the kernel with <<<1,10>>> It is ok . It shows the perfect result.
but when i call the kernel with <<<1,5>> the result is following:
0.000000
1.000000
4.000000
9.000000
16.000000
5.000000
6.000000
7.000000
8.000000
9.000000
similarly when i reduce or increase the second parameter in kernel call it show different result for example when i change it to <<1,4>> it shows following result:
0.000000
1.000000
4.000000
9.000000
4.000000
5.000000
6.000000
7.000000
8.000000
9.000000
Why this result is coming ?
Can any body explain the working of kernel launch call ?
what is blockdim type variable contain ?
Please help me to understand the concept of kernel call launching and working ?
I searched the programming guide but they didn't explain it very well.
The calculation of idx in your kernel code is incorrect. If you change it to:
int idx = blockDim.x * blockIdx.x + threadIdx.x;
You might find the results a little easier to understand.
EDIT: For any given kernel launch
square_array<<<gridDim,blockDim>>>(...)
in the GPU, the automatic variable blockDim will contain the x,y, and z components of the blockDim argument passed in the host side kernel launch. Similarly gridDim will contain the x and y components of the gridDim argument passed in the launch.
Apart from what talonmies has said, you may need to do the following to have better performance in real world applications.
if (idx < N) {
tmp = a[idx];
a[idx] = tmp * tmp;
}
The way kernels are invoked in CUDA is like so:
kernel<<<numBlocks,numThreads>>>(Kernel arguments);
This means that there will be numBlocks blocks with numThreads threads running in each block. For example, if you call
kernel<<<1,5>>>(Kernel args);
then 1 block will run with 5 threads running in parallel. and if you call
kernel<<<2,5>>>(Kernel args);
then there well be 2 blocks with 5 threads running in each. Unless you alter your device code, the maximum dimension of the array that you are "squaring" is the product numBlocks*numThreads. This explains why not all of the values in your original array were squared.
I suggest you read through the CUDA_C_Programming_Guide.pdf that comes with the CUDA toolkit.