This question is a continuation of Interpreting the verbose output of ptxas, part I .
When we compile a kernel .ptx file with ptxas -v, or compile it from a .cu file with -ptxas-options=-v, we get a few lines of output such as:
ptxas info : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20'
ptxas info : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*)
72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 176 bytes cmem[0], 16 bytes cmem[14]
(same example as in the linked-to question; but with name demangling)
This question regards the last line. A few more examples from other kernels:
ptxas info : Used 19 registers, 336 bytes cmem[0], 4 bytes cmem[2]
...
ptxas info : Used 19 registers, 336 bytes cmem[0]
...
ptxas info : Used 6 registers, 16 bytes smem, 328 bytes cmem[0]
How do we interpret the information on this line, other than the number of registers used? Specifically:
Is cmem short for constant memory?
Why are there different categories of cmem, i.e. cmem[0], cmem[2], cmem[14]?
smem probably stands for shared memory; is it only static shared memory?
Under which conditions does each kind of entry appear on this line?
Is cmem short for constant memory?
Yes
Why are there different categories of cmem, i.e. cmem[0], cmem[2], cmem[14]?
They represent different constant memory banks. cmem[0] is the reserved bank for kernel arguments and statically sized constant values.
smem probably stands for shared memory; is it only static shared memory?
It is, and how could it be otherwise.
Under which conditions does each kind of entry appear on this line?
Mostly answered here.
Collected and reformatted...
Resources on the last ptxas info line:
registers - in the register file on every SM (multiprocessor)
gmem - Global memory
smem - Static Shared memory
cmem[N] - Constant memory bank with index N.
cmem[0] - Bank reserved for kernel argument and statically-sized constant values
cmem[2] - ???
cmem[4] - ???
cmem[14] - ???
Each of these categories will be shown if the kernel uses any such memory (Registers - probably always shown); thus it is no surprise all the examples show some cmem[0] usage.
You can read a bit more on the CUDA memory hierarchy in Section 2.3 of the Programming Guide and the links there. Also, there's this blog post about static vs dynamic shared memory.
Related
I have problem with:
cudaStatus = cudaSetDevice(gpuNo); //GPU select
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, gpuNo);
int ileWatkow = 512;
int ileSM = prop.multiProcessorCount;
int ileBlokow;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&ileBlokow, KernelTransformataPET_optymalizacja2, ileWatkow, 0);
in 1080Ti return ileBlokow=1, but if I compile and run in 2080Ti return ileBlokow=0,
sometimes if I copy paste kernel in new project in 1080Ti will return ileBlokow=0. I do not know what is going on...
kernel is a little big... contain a lot of code lines...
1> ptxas info : Compiling entry function '_Z36KernelTransformataPET_optymalizacja2P6float2S0_S0_PfS1_S1_S1_PiS2_S2_' for 'sm_61'
1> ptxas info : Function properties for _Z36KernelTransformataPET_optymalizacja2P6float2S0_S0_PfS1_S1_S1_PiS2_S2_
1> 408 bytes stack frame, 490 bytes spill stores, 904 bytes spill loads
1> ptxas info : Used 255 registers, 424 bytes cumulative stack size, 16384 bytes smem, 400 bytes cmem[0]
1> ptxas info : Function properties for _ZnwyPv
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas info : Function properties for __brev
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
This is what return the ptxas info.
if I compile with CC 6.1 for 1080 or for 7.5 for 2080 it is no different between it.
there is no different between cuda toolkit 9.1 or 10.1, in both is the same problem
someone can help, give me some tips or advice, where i should looking my problem ?
platform: Windows + VS 2015
I resolved this problem.
There was a 2 things:
I send to kernel information about iteration as INT, but I copied as SHORT
Here was a big problem for "newest" GPU, the highest bit was random... but for 1080Ti was zero, so works normally.
-> this was my mistake but also I thing problem with cuda copy from host to device method for newest card than 1080ti.
I reduce used registers from 255 do 230... and after recompiled it in CudaToolkit 12 - this registers decreased to 83 :) It is amazing for me :)
So, my problem is resolved, thank everyone for help :) Special thank you for #Robert Crovella
a I compiled my program with "nvcc -ccbin=icpc source/* -Iinclude -arch=sm_35 --ptxas-options=-v
". Output is below:
ptxas info : 0 bytes gmem
ptxas info : 0 bytes gmem
ptxas info : 450 bytes gmem
ptxas info : Compiling entry function '_Z21process_full_instancePiPViS1_S_' for 'sm_35'
ptxas info : Function properties for _Z21process_full_instancePiPViS1_S_
408 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 174 registers, 9748 bytes smem, 352 bytes cmem[0]
I think gmem refers to global memory, but why the first line and third line have different values (0 vs 450) for geme?
smem is shared memory, how about cmem?
Is the memory usage for a block or a SM (stream processor)? Blocks are dynamically assigned to SM. Can we infer how many blocks will concurrently run on a SM?
My GPU is K20.
smem is shared memory, how about cmem?
cmem stands for constant memory
gmem stands for global memory
smem stands for shared memory
lmem stands for local memory
stack frame is part of local memory
spill loads and store use part of the stack frame
Is the memory usage for a block or a SM (stream processor)?
No, the number of registers is per thread while the shared memory is per block.
Can we infer how many blocks will concurrently run on a SM?
No. Since you can't determine the number of threads per block you cannot calculate the resources each block requires.
i used --ptax-options=-v while compiling my .cu code, it gave the following:
ptxas info: Used 74 registers, 124 bytes smem, 16 bytes cmem[1]
devQuery for my card returns the following:
rev: 2.0
name: tesla c2050
total shared memory per block: 49152
total reg. per block: 32768
now, i input these data into cuda occupancy calculator as follows:
1.) 2.0
1.b) 49152
2.) threads per block: x
registers per thread: 74
shared memory per block (bytes): 124
i was varying the x (threads per block) so that x*74<=32768. for example, i enter 128 (or 256) in place of x. Am I entering all the required values by occupancy calculator correctly? thanks.
ptxas-options=--verbose (or -v) produces output of the format
ptxas : info : Compiling entry function '_Z13matrixMulCUDAILi16EEvPfS0_S0_ii' for 'sm_10'
ptxas : info : Used 15 registers, 2084 bytes smem, 12 bytes cmem[1]
The critical information is
1st line has the target architecture
2nd line has <Registers Per Thread>, <Static Shared Memory Per Block>, <Constant Memory Per Kernel>
When you fill in the occupancy calculator
Set field 1.) Select Compute Capability to 'sm_10' in the above example
Set field 2.) Register Per Thread to
Set field 2.) Share Memory Per Block to + DynamicSharedMemoryPerBlock passed as 3rd parameter to <<<GridDim, BlockDim, DynamicSharedMemoryPerBlock, Stream>>>
The Occupancy Calculator Help tab contains additional information.
In your example I believe you are not correctly setting field 1 as Fermi architecture is limited to 63 Registers Per Thread. sm_1* supports a limit of 124 Registers Per Thread.
The kernel uses: (--ptxas-options=-v)
0 bytes stack frame, 0 bytes spill sotes, 0 bytes spill loads
ptxas info: Used 45 registers, 49152+0 bytes smem, 64 bytes cmem[0], 12 bytes cmem[16]
Launch with: kernelA<<<20,512>>>(float parmA, int paramB); and it will run fine.
Launch with: kernelA<<<20,513>>>(float parmA, int paramB); and it get the out of resources error. (too many resources requested for launch).
The Fermi device properties: 48KB of shared mem per SM, constant mem 64KB, 32K registers per SM, 1024 maximum threads per block, comp capable 2.1 (sm_21)
I'm using all my shared mem space.
I'll run out of block register space around 700 threads/block. The kernel will not launch if I ask for more than half the number of MAX_threads/block. It may just be a coincidence, but I doubt it.
Why can't I use a full block of threads (1024)?
Any guess as to which resource I'm running out of?
I have often wondered where the stalled thread data/state goes between warps. What resource holds these?
When I did the reg count, I commented out the printf's. Reg count= 45
When it was running, it had the printf's coded. Reg count= 63 w/plenty of spill "reg's".
I suspect each thread really has 64 reg's, with only 63 available to the program.
64 reg's * 512 threads = 32K - The maximum available to a single block.
So I suggest the # of available "code" reg's to a block = cudaDeviceProp::regsPerBlock - blockDim i.e. The kernel doesn't have access to all 32K registers.
The compiler currently limits the # of reg's per thread to 63, (or they spill over to lmem). I suspect this 63, is a HW addressing limitation.
So it looks like I'm running out of register space.
I'm not finding much useful info on PTX info --ptxas-options=-v
I found a 2008 NVCC pdf that has a small blurb, but no details.
1) What is 64 bytes cmem[0], 12 bytes cmem[16] mean? I gather it refers to constant memory. I don't use any constant mem in the code, so this must come from the compiler. (What goes into RO mem?)
2) What does 49152+0 bytes smem mean? Yes, it is shared memory, but what do the two #'s mean?
3) Is there a doc that will help me with this? (What is it called?)
4) Where can I find a doc that will explain the *.ptx file? (I'd like to be able to read/understand the cuda assy code.)
cmem is dicussed here. In your case it means 64 bytes are used to pass arguments to kernel and 12 bytes are occupied by compiler-generated constants.
In case of smem, the first number is the amount of data your code request, and the second number (0) indicates how much memory is used for system purposes.
I don't know of any official information regarding verbose ptxas output format. E.g. in "CUDA Occupancy calculator" they simply say to sum the values for smem without any explnations.
There are several PTX docs on nVidia website. The most fundamental is PTX: Parallel Thread Execution ISA Version 3.0.
Please see "Miscellaneous NVCC Usage".
They mention, that the constant bank allocation is profile-specific.
In the PTX guide, they say that apart from 64KB constant memory, they had 10 more banks for constant memory. The driver may allocate and initialize constant buffers in these regions and pass pointers to the buffers as kernel function parameters.
I guess, that profile given for nvcc will take care of which constants go into which memory. Anyway, we don't need to worry if each constant memory cmem[n] is less than 64KB, because each bank is of size 64KB and common to all threads in grid.