CUDA C: host doesn't send all the threads at once - cuda

I'm trying to learn CUDA, so I wrote some silly code (see below) in order to understand how CUDA works. I set the number of blocks as 1024 but when I run my program it seems that the host doesn't send all the 1024 threads at once to the GPU. Instead, the GPU processes 350 threads first (approx), then another 350 threads, and so on. WHY? Thanks in advance!!
PS1: My computer has Ubuntu installed and an NVIDIA GeForce GTX 1660 SUPER
PS2: In my program, each block goes to sleep for a few seconds and nothing else. Also the host creates an array called "H_Arr" and sends it to the GPU, although the device does not use this array. Of course, the latter doesn't make much sense, but I'm just experimenting to understand how CUDA works.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <limits>
#include <iostream>
#include <fstream>
#include<unistd.h>
using namespace std;
__device__ void funcDev(int tid);
int NB=1024;
int NT=1;
__global__
void funcGlob(int NB, int NT){
int tid=blockIdx.x*NT+threadIdx.x;
# if __CUDA_ARCH__>=200
printf("start block %d \n",tid);
if(tid<NB*NT){
funcDev(tid);
}
printf("end block %d\n",tid);
#endif
}
__device__ void funcDev(int tid){
clock_t clock_count;
clock_count =10000000000;
clock_t start_clock = clock();
clock_t clock_offset = 0;
while (clock_offset < clock_count)
{
clock_offset = clock() - start_clock;
}
}
int main(void)
{
int i;
ushort *D_Arr,*H_Arr;
H_Arr = new ushort[NB*NT+1];
for(i=1;i<=NB*NT+1;i++){H_Arr[i]=1;}
cudaMalloc((void**)&D_Arr,(NB*NT+1)*sizeof(ushort));
cudaMemcpy(D_Arr,H_Arr,(NB*NT+1)*sizeof(ushort),cudaMemcpyHostToDevice);
funcGlob<<<NB,NT>>>(NB,NT);
cudaFree(D_Arr);
delete [] H_Arr;
return 0;
}
I wrote a program in CUDA C. I set the number of blocks to be 1024. If I understood correctly, in theory 1024 processes should run simultaneously. However, this is not what happens.

GTX 1660 Super seems to have 22 SMs.
It is a compute capability 7.5 device. If you run deviceQuery cuda sample code on your GPU, you can confirm that (the compute capability and the numbers of SMs, called "Multiprocessors"), and also discover that your GPU has a limit of 16 blocks resident per SM at any moment.
So I haven't studied your code at all, really, but since you are launching 1024 blocks (of one thread each), it would be my expectation that the block scheduler would deposit an initial wave of 16x22=352 blocks on the SMs, and then it would wait for some of those blocks to finish/retire before it would be able to deposit any more.
So an "initial wave" of 352 scheduled blocks sounds just right to me.
Throughout your posting, you refer primarily to threads. While it might be correct to say "350 threads are running" (since you are launching one thread per block) it isn't very instructive to think of it that way. The GPU work distributor schedules blocks, not individual threads, and the relevant limit here is the blocks per SM limit.
If you don't really understand the distinction between blocks and threads, or other basic CUDA concepts, you can find many questions here on the SO cuda tag about these concepts, and this online training series, particularly the first 3-4 units, will also be illuminating.

Related

Why data migrate from Host to Device when CPU try to read a managed memory initialized by GPU?

In the following test code, we init data by GPU, and then access data by CPU. I have 2 questions about the profiling result from nvprof.
Why there is one data migrate from Host To Device? In my understanding it should be Device to Host.
Why the H->D count is 2? I think it should be 1, because the data is in one page.
Thanks in advance!
my enviroment
Driver Version: 418.87.00
CUDA Version: 10.1
ubuntu 18.04
#include <cuda.h>
#include <iostream>
using namespace std;
__global__ void setVal(char* data, int idx)
{
data[idx] = 'd';
}
int main()
{
const int count = 8;
char* data;
cudaMallocManaged((void **)&data, count);
setVal<<<1,1>>>(data, 0); //GPU page fault
cout<<" cpu read " << data[0] <<endl;
cudaFree(data);
return 0;
}
==28762== Unified Memory profiling result:
Device "GeForce GTX 1070 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
2 32.000KB 4.0000KB 60.000KB 64.00000KB 11.74400us Host To Device
1 - - - - 362.9440us Gpu page fault groups
Total CPU Page faults: 1
Why there is one data migrate from Host To Device? In my understanding it should be Device to Host.
You are thrashing data between host and device. Because the GPU kernel launch is asynchronous, your host code, issued after the kernel launch is actually accessing the data before the GPU code.
Put a cudaDeviceSynchronize() after your kernel call, so that the CPU code does not attempt to read the data until after the kernel is complete.
I don't have an answer for your other question. The profiler is often not able to resolve perfectly very small amounts of activity. It does not necessarily instrument all SMs during a profiling run and some of its results may be scaled for the size of a GPC, a TPC and/or the entire GPU. That would be my guess, although it is just speculation. I generally don't expect perfectly accurate results from the profiler when profiling tiny bits of code doing almost nothing.

Getting an unexpected value in global device memory when multiple threads write to it

Here is problem with cuda threads , memory magament, it returns single threads result "100" but would expect 9 threads result "900".
#indudel <stdio.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>
__global__
void test(int in1,int*ptr){
int e = 0;
for (int i = 0; i < 100; i++){
e++;
}
*ptr +=e;
}
int main(int argc, char **argv)
{
int devID = 0;
cudaError_t error;
error = cudaGetDevice(&devID);
if (error == cudaSuccess)
{
printf("GPU Device fine\n");
}
else{
printf("GPU Device problem, aborting");
abort();
}
int* d_A;
cudaMalloc(&d_A, sizeof(int));
int res=0;
//cudaMemcpy(d_A, &res, sizeof(int), cudaMemcpyHostToDevice);
test <<<3, 3 >>>(0,d_A);
cudaDeviceSynchronize();
cudaMemcpy(&res, d_A, sizeof(int),cudaMemcpyDeviceToHost);
printf("res is : %i",res);
Sleep(10000);
return 0;
}
It returns:
GPU Device fine\n
res is : 100
Would expect it to return higher number because 3x3(blocks,threads), insted of just one threads result?
What is done wrong and where does the numbers get lost?
You can't write your sum in this way to global memory.
You have to use an atomic function to ensure that the store is atomic.
In general, when having multiple device threads writing into the same values on global memory, you have to use either atomic functions :
float atomicAdd(float* address, float val);
double atomicAdd(double*
address, double val);
reads the 32-bit or 64-bit word old located at the address address in
global or shared memory, computes (old + val), and stores the result
back to memory at the same address. These three operations are
performed in one atomic transaction. The function returns old.
or thread synchronization :
Throughput for __syncthreads() is 16 operations per clock cycle for
devices of compute capability 2.x, 128 operations per clock cycle for
devices of compute capability 3.x, 32 operations per clock cycle for
devices of compute capability 6.0 and 64 operations per clock cycle
for devices of compute capability 5.x, 6.1 and 6.2.
Note that __syncthreads() can impact performance by forcing the
multiprocessor to idle as detailed in Device Memory Accesses.
(adapting another answer of mine:)
You are experiencing the effects of the increment operator not being atomic. (C++-oriented description of what that means). What's happening, chronologically, is the following sequence of events (not necessarily in the same order of threads though):
...(other work)...
block 0 thread 0 issues a LOAD instruction with address ptr into register r
block 0 thread 1 issues a LOAD instruction with address ptr into register r
...
block 2 thread 0 issues a LOAD instruction with address ptr into register r
block 0 thread 0 completes the LOAD, now having 0 in register r
...
block 2 thread 2 completes the LOAD, now having 0 in register r
block 0 thread 0 adds 100 to r
...
block 2 thread 2 adds 100 to r
block 0 thread 0 issues a STORE instruction from register r to address ptr
...
block 2 thread 2 issues a STORE instruction from register r to address ptr
Thus every thread sees the initial value of *ptr, which is 0; adds 100; and stores 0+100=100 back. The order of the stores doesn't matter here as long as all of the threads try to store the same false value.
What you need to do is either:
Use atomic operations - the least amount of modifications to your code, but very inefficient, since it serializes your work to a great extent, or
Use a block-level reduction primitive. This will ensure some partial ordering of the computational activity vis-a-vis shared block memory - using __syncthreads() or other mechanisms. Thus it might first have each thread add its own two elements up; then synchronize block threads; then have less threads add up pairs of pair-sums and so on. Here's an nVIDIA blog post on implementing fast reductions on their more modern GPU architectures.
block-local or warp-local and/or work-group-specific partial results, which require less/cheaper synchronization, and combine them eventually after having done a lot of work on them.

How to avoid memcpy if number of blocks depends on device variable?

I am computing a number, X, on the device. Now I need to launch a kernel with X threads. I can set the blockSize to 1024. Is there a way to set the number of blocks to ceil(X / 1024) without performing a memcpy?
I see two possibilities:
Use dynamic parallelism (if feasible). Rather than copying the result back to determine the execution parameters of the next launch, just have the device perform the next launch itself.
Use zero-copy or managed memory. In that case the GPU writes directly to CPU memory over the PCI-e bus, rather than requiring an explicit memory transfer.
Of those options, dynamic parallelism and managed memory require hardware features which are not available on all GPUs. Zero-copy memory is supported by all GPUs with compute capability >= 1.1, which in practice is just about every CUDA compatible device ever made.
An example of using managed memory, as outlined by #talonmies, allowing kernel1 to determine the number of blocks for kernel2 without an explicit memcpy.
#include <stdio.h>
#include <cuda.h>
using namespace std;
__device__ __managed__ int kernel2_blocks;
__global__ void kernel1() {
if (threadIdx.x == 0) {
kernel2_blocks = 42;
}
}
__global__ void kernel2() {
if (threadIdx.x == 0) {
printf("block: %d\n", blockIdx.x);
}
}
int main() {
kernel1<<<1, 1024>>>();
cudaDeviceSynchronize();
kernel2<<<kernel2_blocks, 1024>>>();
cudaDeviceSynchronize();
return 0;
}

Maximum number of threads for a CUDA kernel on Tesla M2050

I am testing what is maximum number of threads for a simple kernel. I find total number of threads cannot exceed 4096. The code is as follows:
#include <stdio.h>
#define N 100
__global__ void test(){
printf("%d %d\n", blockIdx.x, threadIdx.x);
}
int main(void){
double *p;
size_t size=N*sizeof(double);
cudaMalloc(&p, size);
test<<<64,128>>>();
//test<<<64,128>>>();
cudaFree(p);
return 0;
}
My test environment: CUDA 4.2.9 on Tesla M2050. The code is compiled with
nvcc -arch=sm_20 test.cu
While checking what's the output, I found some combinations are missing. Run the command
./a.out|wc -l
I always got 4096. When I check cc2.0, I can only find the maximum number of blocks for x,y,z dimensions are (1024,1024,512) and maximum number of threads per block is 1024. And the calls to the kernel (either <<<64,128>>> or <<<128,64>>>) are well in the limits. Any idea?
NB: The CUDA memory operations are there to block the code so that the output from the kernel will be shown.
You are abusing kernel printf, and using it to judge how many threads you can run is a completely nonsensical idea. The runtime has a limited buffer size for printf output, and you are simply overflowing it with output when you run enough threads. There is an API to query and set the printf buffer size, using cudaDeviceGetLimit and cudaDeviceSetLimit (thanks to Robert Crovella for the link to the printf documentation in comments).
You can find the maximum number of threads a given kernel can run by looking here in the documentation.

CUDA performance: branching and shared memory

I wish to ask two questions on performance. I have been unable to create simple code to illustrate.
Question 1: How expensive is non-divergent branching? In my code it seems that it even goes up as to more then the equivalent of 4 non-fma FLOPS. Note that I am speaking of the BRA PTX code whereby the predicate is already calculated
Question 2: I have been reading a lot about performance of shared memory and some articles like a Dr Dobbs article even state that it can be as fast as registers (as far as accessed well). In my code all threads within the warps within the block access the same shared variable. I believe in this case shared memory is accessed in broadcast mode, isn't it? Should it reach the performance of registers in this way? Is there any special things that should be considered to make it work?
EDIT: I have been able to construct some simple code that give more insight for my query
Here it is
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <float.h>
#include "cuComplex.h"
#include "time.h"
#include "cuda_runtime.h"
#include <iostream>
using namespace std;
__global__ void test()
{
__shared__ int t[1024];
int v=t[0];
bool b=(v==-1);
bool c=(v==-2);
int myValue=0;
for (int i=0;i<800;i++)
{
#if 1
v=i;
#else
v=t[i];
#endif
#if 0
if (b) {
printf("abs");
}
#endif
if (c)
{
printf ("IT HAPPENED");
v=8;
}
myValue+=v;
}
if (myValue==1000)
printf ("IT HAPPENED");
}
int main(int argc, char *argv[])
{
cudaEvent_t event_start,event_stop;
float timestamp;
float4 *data;
// Initialise
cudaDeviceReset();
cudaSetDevice(0);
dim3 threadsPerBlock;
dim3 blocks;
threadsPerBlock.x=32;
threadsPerBlock.y=32;
threadsPerBlock.z=1;
blocks.x=1;
blocks.y=1000;
blocks.z=1;
cudaEventCreate(&event_start);
cudaEventCreate(&event_stop);
cudaEventRecord(event_start, 0);
test<<<blocks,threadsPerBlock,0>>>();
cudaEventRecord(event_stop, 0);
cudaEventSynchronize(event_stop);
cudaEventElapsedTime(&timestamp, event_start, event_stop);
printf("Calculated in %f", timestamp);
}
I am running this code on a GTX680.
Now the results are as follows ..
If run as it is it takes 5.44 ms
If I change the first #if conditional to 0 (which will enable reading from shared memory) it will take 6.02ms.. Not much more but still not enough for me
If I enable the second #if conditional (inserts a branch that will never evaluate to true) the it runs in 9.647040ms. The performance reduction is very big. What is the cause and what can be done?
I have also changed slightly the code to make further checks with shared memory
Instead of
__shared__ int t[1024]
I did
__shared__ int2 t[1024]
and wherever I access t[] I just access t[].x. In got a further drop in performance to 10ms..(another 400micro seconds) Why this should happen?
Regards
Daniel
Have you determined if your kernel is compute bound or memory bound? Your first question would be most relevant if your kernel is compute bound, while the second wold be most relevant if your kernel is memory bound. You might be getting results that are confusing or hard to reproduce if you're assuming one, while it is the other.
(1) I don't think the cost of a branch has been published. You might be left to determining that experimentally for your architecture. The CUDA Programming Guide does say that there is no "branch prediction and no speculative execution."
(2) You're right that when you access a single 32-bit value in shared memory from all the threads in a warp, the value is broadcast. But my guess would be that accessing a single value from all threads would have the same cost as accessing any combination of values as long as you don't incur any bank conflicts. So you end up with the latency of a single fetch from shared memory. I don't think the number of cycles of latency has been published. It is short enough that it is normally easily hidden.
You need to keep in mind that the compiler is highly optimizing. So if you comment out the branch, you also eliminate the evaluation of the conditional, wether or not you leave it in the source code. Thus a difference of four instructions seems very plausible for your example:
load -1,
compare v to it (and store result in b),
test b,
branch,
although I have not compiled your example and looked at the code (which is what you should do - run cuobjdump -sass on your binaries and look at the actual differences in machine code.
Using the only the .x compnent of an int2 changes the layout in shared memory so that you go from bank conflict free access to a 2-way bank conflict, which causes the slight further slowdown in your example. IIRC the latency of a shared memory access is of the order of 30 cycles, which usually is easily hidden by other threads (as Roger has already mentioned).