cuda Occupancy Max Active Blocks Per Multiprocessor == 0? - cuda

I have problem with:
cudaStatus = cudaSetDevice(gpuNo); //GPU select
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, gpuNo);
int ileWatkow = 512;
int ileSM = prop.multiProcessorCount;
int ileBlokow;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&ileBlokow, KernelTransformataPET_optymalizacja2, ileWatkow, 0);
in 1080Ti return ileBlokow=1, but if I compile and run in 2080Ti return ileBlokow=0,
sometimes if I copy paste kernel in new project in 1080Ti will return ileBlokow=0. I do not know what is going on...
kernel is a little big... contain a lot of code lines...
1> ptxas info : Compiling entry function '_Z36KernelTransformataPET_optymalizacja2P6float2S0_S0_PfS1_S1_S1_PiS2_S2_' for 'sm_61'
1> ptxas info : Function properties for _Z36KernelTransformataPET_optymalizacja2P6float2S0_S0_PfS1_S1_S1_PiS2_S2_
1> 408 bytes stack frame, 490 bytes spill stores, 904 bytes spill loads
1> ptxas info : Used 255 registers, 424 bytes cumulative stack size, 16384 bytes smem, 400 bytes cmem[0]
1> ptxas info : Function properties for _ZnwyPv
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas info : Function properties for __brev
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
This is what return the ptxas info.
if I compile with CC 6.1 for 1080 or for 7.5 for 2080 it is no different between it.
there is no different between cuda toolkit 9.1 or 10.1, in both is the same problem
someone can help, give me some tips or advice, where i should looking my problem ?
platform: Windows + VS 2015

I resolved this problem.
There was a 2 things:
I send to kernel information about iteration as INT, but I copied as SHORT
Here was a big problem for "newest" GPU, the highest bit was random... but for 1080Ti was zero, so works normally.
-> this was my mistake but also I thing problem with cuda copy from host to device method for newest card than 1080ti.
I reduce used registers from 255 do 230... and after recompiled it in CudaToolkit 12 - this registers decreased to 83 :) It is amazing for me :)
So, my problem is resolved, thank everyone for help :) Special thank you for #Robert Crovella

Related

Interpreting the verbose output of ptxas, part II

This question is a continuation of Interpreting the verbose output of ptxas, part I .
When we compile a kernel .ptx file with ptxas -v, or compile it from a .cu file with -ptxas-options=-v, we get a few lines of output such as:
ptxas info : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20'
ptxas info : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*)
72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 176 bytes cmem[0], 16 bytes cmem[14]
(same example as in the linked-to question; but with name demangling)
This question regards the last line. A few more examples from other kernels:
ptxas info : Used 19 registers, 336 bytes cmem[0], 4 bytes cmem[2]
...
ptxas info : Used 19 registers, 336 bytes cmem[0]
...
ptxas info : Used 6 registers, 16 bytes smem, 328 bytes cmem[0]
How do we interpret the information on this line, other than the number of registers used? Specifically:
Is cmem short for constant memory?
Why are there different categories of cmem, i.e. cmem[0], cmem[2], cmem[14]?
smem probably stands for shared memory; is it only static shared memory?
Under which conditions does each kind of entry appear on this line?
Is cmem short for constant memory?
Yes
Why are there different categories of cmem, i.e. cmem[0], cmem[2], cmem[14]?
They represent different constant memory banks. cmem[0] is the reserved bank for kernel arguments and statically sized constant values.
smem probably stands for shared memory; is it only static shared memory?
It is, and how could it be otherwise.
Under which conditions does each kind of entry appear on this line?
Mostly answered here.
Collected and reformatted...
Resources on the last ptxas info line:
registers - in the register file on every SM (multiprocessor)
gmem - Global memory
smem - Static Shared memory
cmem[N] - Constant memory bank with index N.
cmem[0] - Bank reserved for kernel argument and statically-sized constant values
cmem[2] - ???
cmem[4] - ???
cmem[14] - ???
Each of these categories will be shown if the kernel uses any such memory (Registers - probably always shown); thus it is no surprise all the examples show some cmem[0] usage.
You can read a bit more on the CUDA memory hierarchy in Section 2.3 of the Programming Guide and the links there. Also, there's this blog post about static vs dynamic shared memory.

NVCC ptas=-v output

a I compiled my program with "nvcc -ccbin=icpc source/* -Iinclude -arch=sm_35 --ptxas-options=-v
". Output is below:
ptxas info : 0 bytes gmem
ptxas info : 0 bytes gmem
ptxas info : 450 bytes gmem
ptxas info : Compiling entry function '_Z21process_full_instancePiPViS1_S_' for 'sm_35'
ptxas info : Function properties for _Z21process_full_instancePiPViS1_S_
408 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 174 registers, 9748 bytes smem, 352 bytes cmem[0]
I think gmem refers to global memory, but why the first line and third line have different values (0 vs 450) for geme?
smem is shared memory, how about cmem?
Is the memory usage for a block or a SM (stream processor)? Blocks are dynamically assigned to SM. Can we infer how many blocks will concurrently run on a SM?
My GPU is K20.
smem is shared memory, how about cmem?
cmem stands for constant memory
gmem stands for global memory
smem stands for shared memory
lmem stands for local memory
stack frame is part of local memory
spill loads and store use part of the stack frame
Is the memory usage for a block or a SM (stream processor)?
No, the number of registers is per thread while the shared memory is per block.
Can we infer how many blocks will concurrently run on a SM?
No. Since you can't determine the number of threads per block you cannot calculate the resources each block requires.

cuda - loop unrolling issue

I have a kernel with a #pragma unroll 80 and I'm running it with NVIDIA GT 285, compute capability 1.3,
with grid architecture: dim3 thread_block( 16, 16 ) and dim3 grid( 40 , 30 ) and it works fine.
When I tried running it with NVIDIA GT 580, compute capability 2.0 and with the above grid architecture it works fine.
When I change the grid architecture on the GT 580 to
dim3 thread_block( 32 , 32 ) and dim3 grid( 20 , 15 ), thus producing the same number of threads as above, I get incorrect results.
If I remove #pragma unroll 80 or replace it with #pragma unroll 1 in GT 580 it works fine. If I don't then the kernel crashes.
Would anyone know why does this happen? Thank you in advance
EDIT: checked for kernel errors on both devices and I got the "invalid argument".
As I searched for the causes of this error I found that this happens when the dimensions of the grid and the block exceed their limits.
But this is not the case for me since I use 16x16=256 threads per block and 40x30=1200 total blocks. As far as I know these values are in the boundaries of the GPU grid for compute capability 1.3.
I would like to know if this could have anything to do with the loop unrolling issue I have.
I figured out what the problem was.
After some bug fixes I got the "Too Many Resources Requested for Launch" error.
For a loop unroll, extra registers per thread are needed and I was running out of registers, hence the error and the kernel fail.
I needed 22 registers per thread, and I have 1024 threads per block.
By inserting my data into the CUDA_Occupancy_calculator it showed me that 1 block per SM is scheduled, leaving me with 32678 registers for a whole block on the compute capability 2.0 device.
22 registers*1024 threads = 22528 registers<32678 which should have worked.
But I was compiling with nvcc -arch sm_13 using the C.C. 1.3 characteristic of 16384 registers per SM
I compiled with nvcc -arch sm_20 taking advantage of the 32678 registers, more than enough for the needed 22528, and it works fine now.
Thanks to everyone, I learned about kernel errors.

cuda occupancy calculator

i used --ptax-options=-v while compiling my .cu code, it gave the following:
ptxas info: Used 74 registers, 124 bytes smem, 16 bytes cmem[1]
devQuery for my card returns the following:
rev: 2.0
name: tesla c2050
total shared memory per block: 49152
total reg. per block: 32768
now, i input these data into cuda occupancy calculator as follows:
1.) 2.0
1.b) 49152
2.) threads per block: x
registers per thread: 74
shared memory per block (bytes): 124
i was varying the x (threads per block) so that x*74<=32768. for example, i enter 128 (or 256) in place of x. Am I entering all the required values by occupancy calculator correctly? thanks.
ptxas-options=--verbose (or -v) produces output of the format
ptxas : info : Compiling entry function '_Z13matrixMulCUDAILi16EEvPfS0_S0_ii' for 'sm_10'
ptxas : info : Used 15 registers, 2084 bytes smem, 12 bytes cmem[1]
The critical information is
1st line has the target architecture
2nd line has <Registers Per Thread>, <Static Shared Memory Per Block>, <Constant Memory Per Kernel>
When you fill in the occupancy calculator
Set field 1.) Select Compute Capability to 'sm_10' in the above example
Set field 2.) Register Per Thread to
Set field 2.) Share Memory Per Block to + DynamicSharedMemoryPerBlock passed as 3rd parameter to <<<GridDim, BlockDim, DynamicSharedMemoryPerBlock, Stream>>>
The Occupancy Calculator Help tab contains additional information.
In your example I believe you are not correctly setting field 1 as Fermi architecture is limited to 63 Registers Per Thread. sm_1* supports a limit of 124 Registers Per Thread.

Cuda: Chasing an insufficient resource issue

The kernel uses: (--ptxas-options=-v)
0 bytes stack frame, 0 bytes spill sotes, 0 bytes spill loads
ptxas info: Used 45 registers, 49152+0 bytes smem, 64 bytes cmem[0], 12 bytes cmem[16]
Launch with: kernelA<<<20,512>>>(float parmA, int paramB); and it will run fine.
Launch with: kernelA<<<20,513>>>(float parmA, int paramB); and it get the out of resources error. (too many resources requested for launch).
The Fermi device properties: 48KB of shared mem per SM, constant mem 64KB, 32K registers per SM, 1024 maximum threads per block, comp capable 2.1 (sm_21)
I'm using all my shared mem space.
I'll run out of block register space around 700 threads/block. The kernel will not launch if I ask for more than half the number of MAX_threads/block. It may just be a coincidence, but I doubt it.
Why can't I use a full block of threads (1024)?
Any guess as to which resource I'm running out of?
I have often wondered where the stalled thread data/state goes between warps. What resource holds these?
When I did the reg count, I commented out the printf's. Reg count= 45
When it was running, it had the printf's coded. Reg count= 63 w/plenty of spill "reg's".
I suspect each thread really has 64 reg's, with only 63 available to the program.
64 reg's * 512 threads = 32K - The maximum available to a single block.
So I suggest the # of available "code" reg's to a block = cudaDeviceProp::regsPerBlock - blockDim i.e. The kernel doesn't have access to all 32K registers.
The compiler currently limits the # of reg's per thread to 63, (or they spill over to lmem). I suspect this 63, is a HW addressing limitation.
So it looks like I'm running out of register space.