The kernel uses: (--ptxas-options=-v)
0 bytes stack frame, 0 bytes spill sotes, 0 bytes spill loads
ptxas info: Used 45 registers, 49152+0 bytes smem, 64 bytes cmem[0], 12 bytes cmem[16]
Launch with: kernelA<<<20,512>>>(float parmA, int paramB); and it will run fine.
Launch with: kernelA<<<20,513>>>(float parmA, int paramB); and it get the out of resources error. (too many resources requested for launch).
The Fermi device properties: 48KB of shared mem per SM, constant mem 64KB, 32K registers per SM, 1024 maximum threads per block, comp capable 2.1 (sm_21)
I'm using all my shared mem space.
I'll run out of block register space around 700 threads/block. The kernel will not launch if I ask for more than half the number of MAX_threads/block. It may just be a coincidence, but I doubt it.
Why can't I use a full block of threads (1024)?
Any guess as to which resource I'm running out of?
I have often wondered where the stalled thread data/state goes between warps. What resource holds these?
When I did the reg count, I commented out the printf's. Reg count= 45
When it was running, it had the printf's coded. Reg count= 63 w/plenty of spill "reg's".
I suspect each thread really has 64 reg's, with only 63 available to the program.
64 reg's * 512 threads = 32K - The maximum available to a single block.
So I suggest the # of available "code" reg's to a block = cudaDeviceProp::regsPerBlock - blockDim i.e. The kernel doesn't have access to all 32K registers.
The compiler currently limits the # of reg's per thread to 63, (or they spill over to lmem). I suspect this 63, is a HW addressing limitation.
So it looks like I'm running out of register space.
Related
This question is a continuation of Interpreting the verbose output of ptxas, part I .
When we compile a kernel .ptx file with ptxas -v, or compile it from a .cu file with -ptxas-options=-v, we get a few lines of output such as:
ptxas info : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20'
ptxas info : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*)
72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 176 bytes cmem[0], 16 bytes cmem[14]
(same example as in the linked-to question; but with name demangling)
This question regards the last line. A few more examples from other kernels:
ptxas info : Used 19 registers, 336 bytes cmem[0], 4 bytes cmem[2]
...
ptxas info : Used 19 registers, 336 bytes cmem[0]
...
ptxas info : Used 6 registers, 16 bytes smem, 328 bytes cmem[0]
How do we interpret the information on this line, other than the number of registers used? Specifically:
Is cmem short for constant memory?
Why are there different categories of cmem, i.e. cmem[0], cmem[2], cmem[14]?
smem probably stands for shared memory; is it only static shared memory?
Under which conditions does each kind of entry appear on this line?
Is cmem short for constant memory?
Yes
Why are there different categories of cmem, i.e. cmem[0], cmem[2], cmem[14]?
They represent different constant memory banks. cmem[0] is the reserved bank for kernel arguments and statically sized constant values.
smem probably stands for shared memory; is it only static shared memory?
It is, and how could it be otherwise.
Under which conditions does each kind of entry appear on this line?
Mostly answered here.
Collected and reformatted...
Resources on the last ptxas info line:
registers - in the register file on every SM (multiprocessor)
gmem - Global memory
smem - Static Shared memory
cmem[N] - Constant memory bank with index N.
cmem[0] - Bank reserved for kernel argument and statically-sized constant values
cmem[2] - ???
cmem[4] - ???
cmem[14] - ???
Each of these categories will be shown if the kernel uses any such memory (Registers - probably always shown); thus it is no surprise all the examples show some cmem[0] usage.
You can read a bit more on the CUDA memory hierarchy in Section 2.3 of the Programming Guide and the links there. Also, there's this blog post about static vs dynamic shared memory.
I have a GeForce GTX 745 (CC 5.0).
The deviceQuery command shows that the total number of registers available per block is 65536 (65536 * 4 / 1024 = 256KB).
I wrote a kernel that uses an array of size 10K and the kernel is invoked as follows. I have tried two ways of allocating the array.
// using registers
fun1() {
short *arr = new short[100*100]; // 100*100*sizeof(short)=256K / per block
....
delete[] arr;
}
fun1<<<4, 64>>>();
// using global memory
fun2(short *d_arr) {
...
}
fun2<<<4, 64>>>(d_arr);
I can get the correct result in both cases.
The first one which uses registers runs much faster.
But when invoking the kernel using 6 blocks I got the error code 77.
fun1<<<6, 64>>>();
an illegal memory access was encountered
Now I'm wondering, actually how many of registers can I use? And how is it related to the number of blocks?
The important misconception in your question is that the new operator somehow uses registers to store memory allocated at runtime on the device. It does not. Registers are only allocated statically by the compiler. The new operator uses a dedicated heap for device allocation.
In detail: In your code, fun1, the first line is invoked by all threads, hence each thread of each block would allocate 10,000 16 bits values, that is 1,280,000 bytes per block. For 4 blocks, that make 5,120,000 bytes, for 6 that makes 7,680,000 bytes which for some reason seems to overflow the preallocated limit (default limit is 8MB - see Heap memory allocation). This may be why you get this Illegal Access Error (77).
Using new will make use of some preallocated global memory as malloc would, but not registers - maybe the code you provided is not exactly the one you run. If you want registers, you need to define the data in a fixed array:
func1()
{
short arr [100] ;
...
}
The compiler will then try to fit the array in registers. Note however that this register data is per thread, and maximum number of 32 bits registers per thread is 255 on your device.
a I compiled my program with "nvcc -ccbin=icpc source/* -Iinclude -arch=sm_35 --ptxas-options=-v
". Output is below:
ptxas info : 0 bytes gmem
ptxas info : 0 bytes gmem
ptxas info : 450 bytes gmem
ptxas info : Compiling entry function '_Z21process_full_instancePiPViS1_S_' for 'sm_35'
ptxas info : Function properties for _Z21process_full_instancePiPViS1_S_
408 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 174 registers, 9748 bytes smem, 352 bytes cmem[0]
I think gmem refers to global memory, but why the first line and third line have different values (0 vs 450) for geme?
smem is shared memory, how about cmem?
Is the memory usage for a block or a SM (stream processor)? Blocks are dynamically assigned to SM. Can we infer how many blocks will concurrently run on a SM?
My GPU is K20.
smem is shared memory, how about cmem?
cmem stands for constant memory
gmem stands for global memory
smem stands for shared memory
lmem stands for local memory
stack frame is part of local memory
spill loads and store use part of the stack frame
Is the memory usage for a block or a SM (stream processor)?
No, the number of registers is per thread while the shared memory is per block.
Can we infer how many blocks will concurrently run on a SM?
No. Since you can't determine the number of threads per block you cannot calculate the resources each block requires.
i used --ptax-options=-v while compiling my .cu code, it gave the following:
ptxas info: Used 74 registers, 124 bytes smem, 16 bytes cmem[1]
devQuery for my card returns the following:
rev: 2.0
name: tesla c2050
total shared memory per block: 49152
total reg. per block: 32768
now, i input these data into cuda occupancy calculator as follows:
1.) 2.0
1.b) 49152
2.) threads per block: x
registers per thread: 74
shared memory per block (bytes): 124
i was varying the x (threads per block) so that x*74<=32768. for example, i enter 128 (or 256) in place of x. Am I entering all the required values by occupancy calculator correctly? thanks.
ptxas-options=--verbose (or -v) produces output of the format
ptxas : info : Compiling entry function '_Z13matrixMulCUDAILi16EEvPfS0_S0_ii' for 'sm_10'
ptxas : info : Used 15 registers, 2084 bytes smem, 12 bytes cmem[1]
The critical information is
1st line has the target architecture
2nd line has <Registers Per Thread>, <Static Shared Memory Per Block>, <Constant Memory Per Kernel>
When you fill in the occupancy calculator
Set field 1.) Select Compute Capability to 'sm_10' in the above example
Set field 2.) Register Per Thread to
Set field 2.) Share Memory Per Block to + DynamicSharedMemoryPerBlock passed as 3rd parameter to <<<GridDim, BlockDim, DynamicSharedMemoryPerBlock, Stream>>>
The Occupancy Calculator Help tab contains additional information.
In your example I believe you are not correctly setting field 1 as Fermi architecture is limited to 63 Registers Per Thread. sm_1* supports a limit of 124 Registers Per Thread.
I am writing some code in CUDA and am a little confused about the what is actually run parallel.
Say I am calling a kernel function like this: kenel_foo<<<A, B>>>. Now as per my device query below, I can have a maximum of 512 threads per block. So am I guaranteed that I will have 512 computations per block every time I run kernel_foo<<<A, 512>>>? But it says here that one thread runs on one CUDA core, so that means I can have 96 threads running concurrently at a time? (See device_query below).
I wanted to know about the blocks. Every time I call kernel_foo<<<A, 512>>>, how many computations are done in parallel and how? I mean is it done one block after the other or are blocks parallelized too? If yes, then how many blocks can run 512 threads each in parallel? It says here that one block is run on one CUDA SM, so is it true that 12 blocks can run concurrently? If yes, the each block can have a maximum of how many threads, 8, 96 or 512 running concurrently when all the 12 blocks are also running concurrently? (See device_query below).
Another question is that if A had a value ~50, is it better to launch the kernel as kernel_foo<<<A, 512>>> or kernel_foo<<<512, A>>>? Assuming there is no thread syncronization required.
Sorry, these might be basic questions, but it's kind of complicated... Possible duplicates:
Streaming multiprocessors, Blocks and Threads (CUDA)
How do CUDA blocks/warps/threads map onto CUDA cores?
Thanks
Here's my device_query:
Device 0: "Quadro FX 4600"
CUDA Driver Version / Runtime Version 4.2 / 4.2
CUDA Capability Major/Minor version number: 1.0
Total amount of global memory: 768 MBytes (804978688 bytes)
(12) Multiprocessors x ( 8) CUDA Cores/MP: 96 CUDA Cores
GPU Clock rate: 1200 MHz (1.20 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 384-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: No
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 2 / 0
Check out this answer for some first pointers! The answer is a little out of date in that it is talking about older GPUs with compute capability 1.x, but that matches your GPU in any case. Newer GPUs (2.x and 3.x) have different parameters (number of cores per SM and so on), but once you understand the concept of threads and blocks and of oversubscribing to hide latencies the changes are easy to pick up.
Also, you could take this Udacity course or this Coursera course to get going.