I am trying to use !$acc cache for a specific loop inside a Laplace 2D solver. When I analyse the code with -Mcuda=ptxinfo, it shows no use of shared memory (smem) but the code runs slower than the base condition?!
Here is a part of the code:
!$acc parallel loop reduction(max:error) num_gangs(n/THREADS) vector_length(THREADS)
do j=2,m-1
do i=2,n-1
#ifdef SHARED
!$acc cache(A(i-1:i+1,j),A(i,j-1:j+1))
#endif
Anew(i,j) = 0.25 * ( A(i+1,j) + A(i-1,j) + A(i,j-1) + A(i,j+1) )
error = max( error, abs( Anew(i,j) - A(i,j) ) )
end do
end do
!$acc end parallel
This is the output with using !$acc cache
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_20'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 96 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_20'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 96 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_20'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 64 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_30'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 37 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_30'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_30'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 352 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_35'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 38 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_35'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_35'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 39 registers, 352 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_50'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 37 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_50'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_50'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 30 registers, 352 bytes cmem[0]
This is the output without cache:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_20'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 23 registers, 88 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_20'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 88 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_20'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 64 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_30'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 29 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_30'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_30'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 352 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_35'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 36 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_35'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_35'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 39 registers, 352 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_50'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 38 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_50'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_50'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 30 registers, 352 bytes cmem[0]
Also it shows by -Minfo=accel that some amount of memory has been cached:
acc_lap2d:
17, Generating copy(a(:4096,:4096))
Generating create(anew(:4096,:4096))
39, Accelerator kernel generated
Generating Tesla code
39, Max reduction generated for error
40, !$acc loop gang(256) ! blockidx%x
41, !$acc loop vector(16) ! threadidx%x
Cached references to size [(x)x3] block of a
Loop is parallelizable
58, Accelerator kernel generated
Generating Tesla code
59, !$acc loop gang ! blockidx%x
60, !$acc loop vector(128) ! threadidx%x
Loop is parallelizable
I am wondering how to use the cache (shared memory in CUDA sense) efficiently in OpenACC?
Thank you so much for your help.
Behzad
The compiler should be flagging this as an error. You can't have the same variable listed twice in the same cache directive. Since I work for PGI, I've added a technical problem report (TPR#21898) requesting we detect this error. Although not specifically illegal in the current OpenACC specification, we'll bring it up with the standards committee. The problem being that the compiler wont be able to tell which of the two cached arrays to use in which case.
The fix would be to combine the two references:
!$acc cache(A(i-1:i+1,j-1:j+1))
Note that the PTX info wont show the shared memory usage since this only shows the fixed size shared memory. We dynamically adjust the shared memory size when the CUDA kernel is launched. In looking through the generated CUDA C code (-ta=tesla:nollvm,keep), I see that the shared memory references are getting generated.
Also note that using shared memory does not guarantee better performance. There is overhead in populating a shared array and the generated kernel will need to synchronize threads. Unless there's a lot of reuse, "cache" may not be beneficial.
If the PGI compiler can determine that an array is "read-only", either via analysis or when declared with "INTENT(IN)", and we're targeting a device with compute capability 3.5 or greater, then we will try to use textured memory. In this case, putting "A" in textured memory may be more beneficial.
Hope this helps,
Mat
Related
I have problem with:
cudaStatus = cudaSetDevice(gpuNo); //GPU select
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, gpuNo);
int ileWatkow = 512;
int ileSM = prop.multiProcessorCount;
int ileBlokow;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&ileBlokow, KernelTransformataPET_optymalizacja2, ileWatkow, 0);
in 1080Ti return ileBlokow=1, but if I compile and run in 2080Ti return ileBlokow=0,
sometimes if I copy paste kernel in new project in 1080Ti will return ileBlokow=0. I do not know what is going on...
kernel is a little big... contain a lot of code lines...
1> ptxas info : Compiling entry function '_Z36KernelTransformataPET_optymalizacja2P6float2S0_S0_PfS1_S1_S1_PiS2_S2_' for 'sm_61'
1> ptxas info : Function properties for _Z36KernelTransformataPET_optymalizacja2P6float2S0_S0_PfS1_S1_S1_PiS2_S2_
1> 408 bytes stack frame, 490 bytes spill stores, 904 bytes spill loads
1> ptxas info : Used 255 registers, 424 bytes cumulative stack size, 16384 bytes smem, 400 bytes cmem[0]
1> ptxas info : Function properties for _ZnwyPv
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas info : Function properties for __brev
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
This is what return the ptxas info.
if I compile with CC 6.1 for 1080 or for 7.5 for 2080 it is no different between it.
there is no different between cuda toolkit 9.1 or 10.1, in both is the same problem
someone can help, give me some tips or advice, where i should looking my problem ?
platform: Windows + VS 2015
I resolved this problem.
There was a 2 things:
I send to kernel information about iteration as INT, but I copied as SHORT
Here was a big problem for "newest" GPU, the highest bit was random... but for 1080Ti was zero, so works normally.
-> this was my mistake but also I thing problem with cuda copy from host to device method for newest card than 1080ti.
I reduce used registers from 255 do 230... and after recompiled it in CudaToolkit 12 - this registers decreased to 83 :) It is amazing for me :)
So, my problem is resolved, thank everyone for help :) Special thank you for #Robert Crovella
This question is a continuation of Interpreting the verbose output of ptxas, part I .
When we compile a kernel .ptx file with ptxas -v, or compile it from a .cu file with -ptxas-options=-v, we get a few lines of output such as:
ptxas info : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20'
ptxas info : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*)
72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 176 bytes cmem[0], 16 bytes cmem[14]
(same example as in the linked-to question; but with name demangling)
This question regards the last line. A few more examples from other kernels:
ptxas info : Used 19 registers, 336 bytes cmem[0], 4 bytes cmem[2]
...
ptxas info : Used 19 registers, 336 bytes cmem[0]
...
ptxas info : Used 6 registers, 16 bytes smem, 328 bytes cmem[0]
How do we interpret the information on this line, other than the number of registers used? Specifically:
Is cmem short for constant memory?
Why are there different categories of cmem, i.e. cmem[0], cmem[2], cmem[14]?
smem probably stands for shared memory; is it only static shared memory?
Under which conditions does each kind of entry appear on this line?
Is cmem short for constant memory?
Yes
Why are there different categories of cmem, i.e. cmem[0], cmem[2], cmem[14]?
They represent different constant memory banks. cmem[0] is the reserved bank for kernel arguments and statically sized constant values.
smem probably stands for shared memory; is it only static shared memory?
It is, and how could it be otherwise.
Under which conditions does each kind of entry appear on this line?
Mostly answered here.
Collected and reformatted...
Resources on the last ptxas info line:
registers - in the register file on every SM (multiprocessor)
gmem - Global memory
smem - Static Shared memory
cmem[N] - Constant memory bank with index N.
cmem[0] - Bank reserved for kernel argument and statically-sized constant values
cmem[2] - ???
cmem[4] - ???
cmem[14] - ???
Each of these categories will be shown if the kernel uses any such memory (Registers - probably always shown); thus it is no surprise all the examples show some cmem[0] usage.
You can read a bit more on the CUDA memory hierarchy in Section 2.3 of the Programming Guide and the links there. Also, there's this blog post about static vs dynamic shared memory.
a I compiled my program with "nvcc -ccbin=icpc source/* -Iinclude -arch=sm_35 --ptxas-options=-v
". Output is below:
ptxas info : 0 bytes gmem
ptxas info : 0 bytes gmem
ptxas info : 450 bytes gmem
ptxas info : Compiling entry function '_Z21process_full_instancePiPViS1_S_' for 'sm_35'
ptxas info : Function properties for _Z21process_full_instancePiPViS1_S_
408 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 174 registers, 9748 bytes smem, 352 bytes cmem[0]
I think gmem refers to global memory, but why the first line and third line have different values (0 vs 450) for geme?
smem is shared memory, how about cmem?
Is the memory usage for a block or a SM (stream processor)? Blocks are dynamically assigned to SM. Can we infer how many blocks will concurrently run on a SM?
My GPU is K20.
smem is shared memory, how about cmem?
cmem stands for constant memory
gmem stands for global memory
smem stands for shared memory
lmem stands for local memory
stack frame is part of local memory
spill loads and store use part of the stack frame
Is the memory usage for a block or a SM (stream processor)?
No, the number of registers is per thread while the shared memory is per block.
Can we infer how many blocks will concurrently run on a SM?
No. Since you can't determine the number of threads per block you cannot calculate the resources each block requires.
i used --ptax-options=-v while compiling my .cu code, it gave the following:
ptxas info: Used 74 registers, 124 bytes smem, 16 bytes cmem[1]
devQuery for my card returns the following:
rev: 2.0
name: tesla c2050
total shared memory per block: 49152
total reg. per block: 32768
now, i input these data into cuda occupancy calculator as follows:
1.) 2.0
1.b) 49152
2.) threads per block: x
registers per thread: 74
shared memory per block (bytes): 124
i was varying the x (threads per block) so that x*74<=32768. for example, i enter 128 (or 256) in place of x. Am I entering all the required values by occupancy calculator correctly? thanks.
ptxas-options=--verbose (or -v) produces output of the format
ptxas : info : Compiling entry function '_Z13matrixMulCUDAILi16EEvPfS0_S0_ii' for 'sm_10'
ptxas : info : Used 15 registers, 2084 bytes smem, 12 bytes cmem[1]
The critical information is
1st line has the target architecture
2nd line has <Registers Per Thread>, <Static Shared Memory Per Block>, <Constant Memory Per Kernel>
When you fill in the occupancy calculator
Set field 1.) Select Compute Capability to 'sm_10' in the above example
Set field 2.) Register Per Thread to
Set field 2.) Share Memory Per Block to + DynamicSharedMemoryPerBlock passed as 3rd parameter to <<<GridDim, BlockDim, DynamicSharedMemoryPerBlock, Stream>>>
The Occupancy Calculator Help tab contains additional information.
In your example I believe you are not correctly setting field 1 as Fermi architecture is limited to 63 Registers Per Thread. sm_1* supports a limit of 124 Registers Per Thread.
The kernel uses: (--ptxas-options=-v)
0 bytes stack frame, 0 bytes spill sotes, 0 bytes spill loads
ptxas info: Used 45 registers, 49152+0 bytes smem, 64 bytes cmem[0], 12 bytes cmem[16]
Launch with: kernelA<<<20,512>>>(float parmA, int paramB); and it will run fine.
Launch with: kernelA<<<20,513>>>(float parmA, int paramB); and it get the out of resources error. (too many resources requested for launch).
The Fermi device properties: 48KB of shared mem per SM, constant mem 64KB, 32K registers per SM, 1024 maximum threads per block, comp capable 2.1 (sm_21)
I'm using all my shared mem space.
I'll run out of block register space around 700 threads/block. The kernel will not launch if I ask for more than half the number of MAX_threads/block. It may just be a coincidence, but I doubt it.
Why can't I use a full block of threads (1024)?
Any guess as to which resource I'm running out of?
I have often wondered where the stalled thread data/state goes between warps. What resource holds these?
When I did the reg count, I commented out the printf's. Reg count= 45
When it was running, it had the printf's coded. Reg count= 63 w/plenty of spill "reg's".
I suspect each thread really has 64 reg's, with only 63 available to the program.
64 reg's * 512 threads = 32K - The maximum available to a single block.
So I suggest the # of available "code" reg's to a block = cudaDeviceProp::regsPerBlock - blockDim i.e. The kernel doesn't have access to all 32K registers.
The compiler currently limits the # of reg's per thread to 63, (or they spill over to lmem). I suspect this 63, is a HW addressing limitation.
So it looks like I'm running out of register space.