Maximum number of threads for a CUDA kernel on Tesla M2050 - cuda

I am testing what is maximum number of threads for a simple kernel. I find total number of threads cannot exceed 4096. The code is as follows:
#include <stdio.h>
#define N 100
__global__ void test(){
printf("%d %d\n", blockIdx.x, threadIdx.x);
}
int main(void){
double *p;
size_t size=N*sizeof(double);
cudaMalloc(&p, size);
test<<<64,128>>>();
//test<<<64,128>>>();
cudaFree(p);
return 0;
}
My test environment: CUDA 4.2.9 on Tesla M2050. The code is compiled with
nvcc -arch=sm_20 test.cu
While checking what's the output, I found some combinations are missing. Run the command
./a.out|wc -l
I always got 4096. When I check cc2.0, I can only find the maximum number of blocks for x,y,z dimensions are (1024,1024,512) and maximum number of threads per block is 1024. And the calls to the kernel (either <<<64,128>>> or <<<128,64>>>) are well in the limits. Any idea?
NB: The CUDA memory operations are there to block the code so that the output from the kernel will be shown.

You are abusing kernel printf, and using it to judge how many threads you can run is a completely nonsensical idea. The runtime has a limited buffer size for printf output, and you are simply overflowing it with output when you run enough threads. There is an API to query and set the printf buffer size, using cudaDeviceGetLimit and cudaDeviceSetLimit (thanks to Robert Crovella for the link to the printf documentation in comments).
You can find the maximum number of threads a given kernel can run by looking here in the documentation.

Related

CUDA C: host doesn't send all the threads at once

I'm trying to learn CUDA, so I wrote some silly code (see below) in order to understand how CUDA works. I set the number of blocks as 1024 but when I run my program it seems that the host doesn't send all the 1024 threads at once to the GPU. Instead, the GPU processes 350 threads first (approx), then another 350 threads, and so on. WHY? Thanks in advance!!
PS1: My computer has Ubuntu installed and an NVIDIA GeForce GTX 1660 SUPER
PS2: In my program, each block goes to sleep for a few seconds and nothing else. Also the host creates an array called "H_Arr" and sends it to the GPU, although the device does not use this array. Of course, the latter doesn't make much sense, but I'm just experimenting to understand how CUDA works.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <limits>
#include <iostream>
#include <fstream>
#include<unistd.h>
using namespace std;
__device__ void funcDev(int tid);
int NB=1024;
int NT=1;
__global__
void funcGlob(int NB, int NT){
int tid=blockIdx.x*NT+threadIdx.x;
# if __CUDA_ARCH__>=200
printf("start block %d \n",tid);
if(tid<NB*NT){
funcDev(tid);
}
printf("end block %d\n",tid);
#endif
}
__device__ void funcDev(int tid){
clock_t clock_count;
clock_count =10000000000;
clock_t start_clock = clock();
clock_t clock_offset = 0;
while (clock_offset < clock_count)
{
clock_offset = clock() - start_clock;
}
}
int main(void)
{
int i;
ushort *D_Arr,*H_Arr;
H_Arr = new ushort[NB*NT+1];
for(i=1;i<=NB*NT+1;i++){H_Arr[i]=1;}
cudaMalloc((void**)&D_Arr,(NB*NT+1)*sizeof(ushort));
cudaMemcpy(D_Arr,H_Arr,(NB*NT+1)*sizeof(ushort),cudaMemcpyHostToDevice);
funcGlob<<<NB,NT>>>(NB,NT);
cudaFree(D_Arr);
delete [] H_Arr;
return 0;
}
I wrote a program in CUDA C. I set the number of blocks to be 1024. If I understood correctly, in theory 1024 processes should run simultaneously. However, this is not what happens.
GTX 1660 Super seems to have 22 SMs.
It is a compute capability 7.5 device. If you run deviceQuery cuda sample code on your GPU, you can confirm that (the compute capability and the numbers of SMs, called "Multiprocessors"), and also discover that your GPU has a limit of 16 blocks resident per SM at any moment.
So I haven't studied your code at all, really, but since you are launching 1024 blocks (of one thread each), it would be my expectation that the block scheduler would deposit an initial wave of 16x22=352 blocks on the SMs, and then it would wait for some of those blocks to finish/retire before it would be able to deposit any more.
So an "initial wave" of 352 scheduled blocks sounds just right to me.
Throughout your posting, you refer primarily to threads. While it might be correct to say "350 threads are running" (since you are launching one thread per block) it isn't very instructive to think of it that way. The GPU work distributor schedules blocks, not individual threads, and the relevant limit here is the blocks per SM limit.
If you don't really understand the distinction between blocks and threads, or other basic CUDA concepts, you can find many questions here on the SO cuda tag about these concepts, and this online training series, particularly the first 3-4 units, will also be illuminating.

Why data migrate from Host to Device when CPU try to read a managed memory initialized by GPU?

In the following test code, we init data by GPU, and then access data by CPU. I have 2 questions about the profiling result from nvprof.
Why there is one data migrate from Host To Device? In my understanding it should be Device to Host.
Why the H->D count is 2? I think it should be 1, because the data is in one page.
Thanks in advance!
my enviroment
Driver Version: 418.87.00
CUDA Version: 10.1
ubuntu 18.04
#include <cuda.h>
#include <iostream>
using namespace std;
__global__ void setVal(char* data, int idx)
{
data[idx] = 'd';
}
int main()
{
const int count = 8;
char* data;
cudaMallocManaged((void **)&data, count);
setVal<<<1,1>>>(data, 0); //GPU page fault
cout<<" cpu read " << data[0] <<endl;
cudaFree(data);
return 0;
}
==28762== Unified Memory profiling result:
Device "GeForce GTX 1070 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
2 32.000KB 4.0000KB 60.000KB 64.00000KB 11.74400us Host To Device
1 - - - - 362.9440us Gpu page fault groups
Total CPU Page faults: 1
Why there is one data migrate from Host To Device? In my understanding it should be Device to Host.
You are thrashing data between host and device. Because the GPU kernel launch is asynchronous, your host code, issued after the kernel launch is actually accessing the data before the GPU code.
Put a cudaDeviceSynchronize() after your kernel call, so that the CPU code does not attempt to read the data until after the kernel is complete.
I don't have an answer for your other question. The profiler is often not able to resolve perfectly very small amounts of activity. It does not necessarily instrument all SMs during a profiling run and some of its results may be scaled for the size of a GPC, a TPC and/or the entire GPU. That would be my guess, although it is just speculation. I generally don't expect perfectly accurate results from the profiler when profiling tiny bits of code doing almost nothing.

Maximum number of resident blocks per SM?

It seems the that there is a maximum number of resident blocks allowed per SM. But while other "hard" limits are easily found (via, for example, `cudaGetDeviceProperties'), a maximum number of resident blocks doesn't seem to be widely documented.
In the following sample code, I configure the kernel with one thread per block. To test the hypothesis that this GPU (a P100) has a maximum of 32 resident blocks per SM, I create a grid of 56*32 blocks (56 = number of SMs on the P100). Each kernel takes 1 second to process (via a "sleep" routine), so if I have configured the kernel correctly, the code should take 1 second. The timing results confirm this. Configuring with 32*56+1 blocks takes 2 seconds, suggesting the 32 blocks per SM is the maximum allowed per SM.
What I wonder is, why isn't this limit made more widely available? For example, it doesn't show up `cudaGetDeviceProperties'. Where can I find this limit for various GPUs? Or maybe this isn't a real limit, but is derived from other hard limits?
I am running CUDA 10.1
#include <stdio.h>
#include <sys/time.h>
double cpuSecond() {
struct timeval tp;
gettimeofday(&tp,NULL);
return (double) tp.tv_sec + (double)tp.tv_usec*1e-6;
}
#define CLOCK_RATE 1328500 /* Modify from below */
__device__ void sleep(float t) {
clock_t t0 = clock64();
clock_t t1 = t0;
while ((t1 - t0)/(CLOCK_RATE*1000.0f) < t)
t1 = clock64();
}
__global__ void mykernel() {
sleep(1.0);
}
int main(int argc, char* argv[]) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int mp = prop.multiProcessorCount;
//clock_t clock_rate = prop.clockRate;
int num_blocks = atoi(argv[1]);
dim3 block(1);
dim3 grid(num_blocks); /* N blocks */
double start = cpuSecond();
mykernel<<<grid,block>>>();
cudaDeviceSynchronize();
double etime = cpuSecond() - start;
printf("mp %10d\n",mp);
printf("blocks/SM %10.2f\n",num_blocks/((double)mp));
printf("time %10.2f\n",etime);
cudaDeviceReset();
}
Results :
% srun -p gpuq sm_short 1792
mp 56
blocks/SM 32.00
time 1.16
% srun -p gpuq sm_short 1793
mp 56
blocks/SM 32.02
time 2.16
% srun -p gpuq sm_short 3584
mp 56
blocks/SM 64.00
time 2.16
% srun -p gpuq sm_short 3585
mp 56
blocks/SM 64.02
time 3.16
Yes, there is a limit to the number of blocks per SM. The maximum number of blocks that can be contained in an SM refers to the maximum number of active blocks in a given time. Blocks can be organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension but the SM of your gpu will be able to accommodate only a certain number of blocks. This limit is linked in two ways to the Compute Capability of your Gpu.
Hardware limit stated by CUDA.
Each gpu allows a maximum limit of blocks per SM, regardless of the number of threads it contains and the amount of resources used. For example, a Gpu with compute capability 2.0 has a limit of 8 Blocks/SM while one with compute capability 7.0 has a limit of 32 Blocks/SM. This is the best number of active blocks for each SM that you can achieve: let's call it MAX_BLOCKS.
Limit derived from the amount of resources used by each block.
A block is made up of threads and each thread uses a certain number of registers: the more registers it uses, the greater the number of resources used by the block that contains it. Similarly, the amount of shared memory assigned to a block increases the amount of resources the block needs to be allocated. Once a certain value is exceeded, the number of resources needed for a block will be so large that SM will not be able to allocate as many blocks as it is allowed by MAX_BLOCKS: this means that the amount of resources needed for each block is limiting the maximum number of active blocks for each SM.
How do I find these boundaries?
CUDA thought about that too. On their site is available the Cuda Occupancy Calculator file with which you can discover the hardware limits grouped by compute capability. You can also enter the amount of resources used by your blocks (number of threads, registers per threads, bytes of shared memory) and get graphs and important information about the number of active blocks.
The first tab of the linked file allows you to calculate the actual use of SM based on the resources used. If you want to know how many registers per thread you use you have to add the -Xptxas -v option to have the compiler tell you how many registers it is using when it creates the PTX.
In the last tab of the file you will find the hardware limits grouped by Compute capability.

CUDA summation reduction puzzle

Reduction in CUDA has utterly baffled me! First off, both this tutorial by Mark Harris and this one by Mike Giles make use of the declaration extern __shared__ temp[]. The keyword extern is used in C when a declaration is made, but allocation takes place "elsewhre" (e.g. in another C file context in general). What is the relevance of extern here? Why don't we use:
__shared__ float temp[N/2];
for instance? Or why don't we declare temp to be a global variable, e.g.
#define N 1024
__shared__ float temp[N/2];
__global__ void sum(float *sum, float *data){ ... }
int main(){
...
sum<<<M,L>>>(sum, data);
}
I have yet another question? How many blocks and threads per block should one use to invoke the summation kernel? I tried this example (based on this).
Note: You can find information about my devices here.
The answer to the first question is that CUDA supports dynamic shared memory allocation at runtime (see this SO question and the documentation for more details). The declaration of shared memory using extern denotes to the compiler that shared memory size will be determined at kernel launch, passed in bytes as an argument to the <<< >>> syntax (or equivalently via an API function), something like:
sum<<< gridsize, blocksize, sharedmem_size >>>(....);
The second question is normally to launch the number of blocks which will completely fill all the streaming multiprocessors on your GPU. Most sensibly written reduction kernels will accumulate many values per thread and then perform a shared memory reduction. The reduction requires that the number of threads per block be a power of two: That usually gives you 32, 64, 128, 256, 512 (or 1024 if you have a Fermi or Kepler GPU). It is a very finite search space, just benchmark to see what works best on your hardware. You can find a more general discussion about block and grid sizing here and here.

thrust functor: "too many resources requested for launch"

I'm trying to implement something like this in CUDA:
for each element
p = { p if p >= floor
z if p < floor
Where floor and z are constants configured at the start of the test.
I have attempted to implement it like so, but I get the error "too many resources requested for launch"
A functor:
struct floor_functor : thrust::unary_function <float, float>
{
const float floorLevel, floorVal;
floor_functor(float _floorLevel, float _floorVal) : floorLevel(_floorLevel), floorVal(_floorVal){}
__host__
__device__
float operator()(float& x) const
{
if (x >= floorLevel)
return x;
else
return floorVal;
}
};
Used by a transform:
thrust::transform(input->begin(), input->end(), output.begin(), floor_functor(floorLevel, floorVal));
If I remove one of the members of my functor, say floorVal, and use a functor with only one member variable, it works fine.
Does anyone know why this might be, and how I could fix it?
Additional info:
My array is 786432 elements long.
My GPU is a GeForce GTX590
I am building with the command:
`nvcc -c -g -arch sm_11 -Xcompiler -fPIC -Xcompiler -Wall -DTHRUST_DEBUG -I <my_include_dir> -o <my_output> <my_source>`
My cuda version is 4.0:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Thu_May_12_11:09:45_PDT_2011
Cuda compilation tools, release 4.0, V0.2.1221
And my maximum number of threads per block is 1024 (reported by deviceQuery):
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
UPDATE::
I have stumbled upon a fix for my problem, but do not understand it. If I rename my functor from "floor_functor" to basically anything else, it works! I have no idea why this is the case, and would be interested to hear anyone's ideas about this.
For an easier CUDA implementation, you could do this with ArrayFire in one line of code:
p(p < floor) = z;
Just declare your variables as af::array's.
Good luck!
Disclaimer: I work on all sorts of CUDA projects, including ArrayFire.