Cooperative groups in CUDA - cuda

since the CUDA 9 release apparently it is possible to group different threads and blocks into the same group so you can manage them together. That`s very useful for me because I need to launch a kernel with several blocks and wait until all of them are synchronized (cudaThreadSynchronize() is not worthy for me because after the threads are synchronized I have to continue working in my kernel).
What I have thought is to include these blocks of threads into the same group and wait until all of them are synchronized, as the examples of Nvdia main page suggest.
They do something like this:
__device__ int reduce_sum(thread_group g, int *temp, int val)
{
int lane = g.thread_rank();
// Each iteration halves the number of active threads
// Each thread adds its partial sum[i] to sum[lane+i]
for (int i = g.size() / 2; i > 0; i /= 2)
{
temp[lane] = val;
g.sync(); // wait for all threads to store
if(lane<i) val += temp[lane + i];
g.sync(); // wait for all threads to load
}
My problem is how to group these blocks into the g group.
This is how I originally launched my kernel:
asap << <5, 1000 >> > (cuda_E2, cuda_A2, cuda_temp, Nb, *binM, Nspb);
Any time that I try to use thread_group the compiler says that it is undefied. I'm using the cooperative_groups.h header.
Does anyone know how to deal with this? Thanks in advance.

Quote from the documentation:
Cooperative Groups requires CUDA 9.0 or later. To use Cooperative
Groups, include the header file:
#include <cooperative_groups.h>
and use the Cooperative Groups namespace:
using namespace cooperative_groups;
Then code containing any
intra-block Cooperative Groups functionality can be compiled in the
normal way using nvcc.
The namespace is what you are missing.

Related

how to sync threads in this cuda example

I have the following rough code outline:
run a loop, millions of times
in that loop, compute values 'I's - see example of such functions below
After all 'I's have been computed, compute other values 'V's
repeat the loop
Each computation of an I or V could involve up to 20ish mathematical operations, (e.g. I1 = A + B/C * D + 1/exp(V1) - E + F + V2 etc).
There are roughly:
50 'I's
10 'V's
10 values in each I and V, i.e. they are vectors of length 10
At first I tried running a simple loop in C, with kernel calls for each time step but this was really slow. It seems like I can get the code to run faster if the main loop is in a kernel that calls other kernels. However, I'm worried about kernel call overhead (maybe I shouldn't be) so I came up with something like the following, where each I and V loop independently, with syncing between the kernels as necessary.
For reference, the variables below are hardcoded as __device__ values, but eventually I will pass some values into specific kernels to make the system interesting.
__global__ void compute_IL1()
{
int id = threadIdx.x;
//n_t = 1e6;
for (int i = 0; i < n_t; i++){
IL1[id] = gl_1*(V1[id] - El_1);
//atomic, sync, event????,
}
}
__global__ void compute_IK1()
{
int id = threadIdx.x;
for (int i = 0; i < n_t; i++){
Ik1[id] = gk_1*powf(0.75*(1-H1[id]),4)*(V1[id]-Ek_1);
//atomic, sync, event?
}
}
__global__ void compute_V1()
{
int id = threadIdx.x;
for (int i = 0; i < n_t; i++){
//wait for IL1 and Ik1 and others, but how????
V1[id] = Ik1[id]+IL1[id] + ....
//trigger the I's again
}
}
//main function
compute_IL1<<<1,10,0,s0>>>();
compute_IK1<<<1,10,0,s1>>>();
//repeat this for many 50 - 70 more kernels (Is and Vs)
So the question is, how would I sync these kernels? Is an event approach best? Is there a better paradigm to use here?
There is no sane mechanism I can think of to have multiple resident kernels synchronize without resorting to hacky atomic tricks which may well not work reliably.
If you are running blocks with 10 threads and these kernels cannot execute concurrently for correctness reasons, you are (in the best possible case) using 1/64 of the computational capacity of your device. This problem as you have described it sounds completely Ill suited to a GPU.
So, I tried a couple of approaches.
A loop with a few kernel calls, where the last kernel call is dependent on the previous ones. This can be done with cudaStreamWaitEvent which can wait for multiple events. I found this on: http://cedric-augonnet.com/declaring-dependencies-with-cudastreamwaitevent/ . Unfortunately, the kernel calls were too expensive.
Global variables between concurrent streams. The logic was pretty simple, having one thread pause until a global variable equaled the loop variable, indicating that all threads could proceed. This was then followed by a sync-threads call. Unfortunately, this did not work well.
Ultimately, I think I've settled on a nested loop, where the outer loop represents time, and the inner loop indicates which of a set instructions to run, based on dependencies. I also launched the maximum number of threads per block (1024) and broke up the vectors that needed to be processed into warps. The rough psuedocode is:
run_main<<<1,1024>>>();
__global__ void run_main(){
int warp = threadIdx.x/32;
int id = threadIdx.x - warp*32;
if (id < 10){
for (int i = 0; i < n_t; i++){
for(int j = 0; j < n_j; j++){
switch (j){
case 0:
switch(warp){
case 0:
I1[id] = a + b + c*d ...
break;
case 1:
I2[id] = f*g/h
break;
}
break;
//These things depend on case 0 OR
//we've run out of space in the first pass
//32 cases max [0 ... 31]
case 1:
switch(warp){
case 0:
V1[ID] = I1*I2+ ...
break;
case 1:
V2[ID] = ...
//syncs across the block
__syncthreads();
This design is based on my impression that each set of 32 threads runs independently but should run the same code, otherwise things can slow done significantly.
So at the end, I'm running roughly 32*10 instructions simultaneously.
Where 32 is the number of warps, and it depends on how many different values I can compute at the same time (due to dependencies) and 10 is the # of elements in each vector. This is slowed down by any imbalances in the # of computations in each warp case, since all warps need to merge before moving onto the next step (due to the syncthreads call). I'm running different parameters (parameter sweep) on top of this, so I could potentially run 3 at a time in the block, multiplied by the # of streaming processors (or whatever the official name is) on the card.
One thing I need to change is that I'm currently testing on a video card that is attached to a monitor as well. Apparently Windows will kill the kernel if it lasts for more than 5 seconds, so I need to call the kernel in chunked time steps, like once every 1e5 time steps (in my case).

Memory access in CUDA kernel functions (simple example)

I am novice in GPU parallel computing and I'm trying to learn CUDA by looking at some examples in NVidia "CUDA by examples" book.
And I do not understand properly how thread access and change variables in such a simple example (dot product of two vectors).
The kernel function is defined as follows
__global__ void dot( float *a, float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
// set the cache values
cache[cacheIndex] = temp;
I do not understand three things.
What is the sequence of execution of this function? Is there any sequence between threads? For example, the first are the thread from the first block, then threads from the second block come into play and so on (this is connected to the question why this is necessary to divide threads into blocks).
Do all threads have their own copy of the "temp" variable or not (if not, why is there no race condition?)
How is it operated? What exactly goes to the variable temp in the while loop? The array cache stores values of temp for different threads. How does the summation go on? It seems that temp already contains all sums necessary for dot product because variable tid goes from 0 to N-1 in the while loop.
Despite the code you provide is incomplete, here are some clarifications about what you are asking :
The kernel code will be executed by all the threads in all the blocks. The way to "split the jobs" is to make threads work only on one or a few elements.
For instance, if you have to treat 100 integers with a specific algorithm, you probably want 100 threads to treat 1 element each.
In CUDA the amount of blocks and threads is defined at the kernel launch on host side :
myKernel<<<grid, threads>>>(...);
Where grids and threads are dim3, which define the size on three dimensions.
There is no specific order in the execution of threads and blocks. As you can read there :
http://mc.stanford.edu/cgi-bin/images/3/34/Darve_cme343_cuda_3.pdf
On page 6 : "No specific order in which blocks are dispatched and executed".
Since the temp variable is defined in the kernel in no specific way, it is not distributed and each thread will have this value stored in a register.
This is equivalent of what is done on CPU side. So yes, this means each threads has its own "temp" variable.
The temp variable is updated in each iteration of the loop, using access to device arrays.
Again, this is equivalent of what is done on CPU side.
I think you should probably check if you are used enough to C/C++ programming on CPU side before going further into GPU programming. Meaning no offense, it seems you have a lack in several main topics.
Since CUDA allows you to drive your GPU with C code, the difficulty is not in the syntax, but in the specificities of the hardware.

Empirically determining how many threads are in a warp

Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?
Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.
In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:
#include <stdio.h>
#define LOOPS 100000
__device__ volatile int test2 = 0;
__device__ int test3 = 32767;
__global__ void kernel(){
for (int i = 0; i < LOOPS; i++){
unsigned long time = clock64();
// while (clock64() < (time + (threadIdx.x * 1000)));
int start = test2;
atomicAdd((int *)&test2, 1);
int end = test2;
int diff = end - start;
atomicMin(&test3, diff);
}
}
int main() {
kernel<<<1, 1024>>>();
int result;
cudaMemcpyFromSymbol(&result, test3, sizeof(int));
printf("result = %d threads\n", result);
return 0;
}
I compile with:
nvcc -O3 -arch=sm_20 -o t331 t331.cu
I call it "statistical" because it requres a large number of iterations (LOOPS) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.
We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS = 1
*my test case is CUDA 5, Quadro5000, RHEL 5.5
Here are several easy solutions. There are other solutions that use warp synchronous programming; however, many of the solutions will not work across all devices.
SOLUTION 1: Launch one or more blocks with max threads per block, read the special registers %smid and %warpid, and blockIdx and write values to memory. Group data by the three variables to find the warp size. This is even easier if you limit the launch to a single block then you only need %warpid.
SOLUTION 2: Launch one block with max threads per block and read the special register %clock. This requires the following assumptions which can be shown to be true on CC 1.0-3.5 devices:
%clock is defined as a unsigned 32-bit read-only cycle counter that wraps silently and updates every issue cycle
all threads in a warp read the same value for %clock
due to warp launch latency and instruction fetch warps on the same SM but different warp schedulers cannot issue the first instruction of a warp on the same cycle
All threads in the block that have the same clock time on CC1.0 - 3.5 devices (may change in the future) will have the same clock time.
SOLUTION 3: Use Nsight VSE or cuda-gdb debugger. The warp state views show you sufficient information to determine the warp size. It is also possible to single step and see the change to the PC address for each thread.
SOLUTION 4: Use Nsight VSE, Visual Profiler, nvprof, etc. Launch kernels of of 1 block with increasing thread count per launch. Determine when the thread count causing warps_launched to go from 1 to 2.

How to share global memory within multiple kernels and multiple GPUs?

----------------a.c---------------------
variable *XX;
func1(){
for(...){
for(i = 0; i < 4; i++)
cutStartThread(func2,args)
}
}
---------------b.cu-------------------
func2(args){
cudaSetDevice(i);
xx = cudaMalloc();
mykernel<<<...>>>(xx);
}
--------------------------------------
Recently, I want to use multiple GPU device for my program. There are four Tesla C2075 cards on my node. I use four threads to manage the four GPUs. What's more, the kernel in each thread is launched several times. A simple pseudo code as above. I have two questions:
Variable XX is a very long string, and is read only in the kernel. I want to preserve it during the multiple launches of mykernel. Is it ok to call cudaMalloc and pass the pointer to mykernel only when mykernel is first launched? Or should I use __device__ qualifier?
XX is used in four threads, so I declare it as a global variable in file a.c. Are multiple cudaMalloc of XX correct or should I use an array such as variable *xx[4]?
For usage by kernels running on a single device, you can call cudaMalloc once to create your variable XX holding the string, then pass the pointer created by cudaMalloc (i.e. XX) to whichever kernels need it.
#define xx_length 20
char *XX;
cudaMalloc((void **)&XX, xx_length * sizeof(char));
...
kernel1<<<...>>>(XX, ...);
...
kernel2<<<...>>>(XX, ...);
etc.
Create a separate XX variable for each thread, assuming that each thread is being used to access a different device. How exactly you do this will depend on the scope of XX. But an array of:
char *XX[num_devices];
at global scope, should be OK.
The CUDA OpenMP sample may be of interest as an example of how to use multiple threads to manage multiple GPUs.

Very poor memory access performance with CUDA

I'm very new to CUDA, and trying to write a test program.
I'm running the application on GeForce GT 520 card, and get VERY poor performance.
The application is used to process some image, with each row being handled by a separate thread.
Below is a simplified version of the application. Please note that in the real application, all constants are actually variables, provided be the caller.
When running the code below, it takes more than 20 seconds to complete the execution.
But as opposed to using malloc/free, when l_SrcIntegral is defined as a local array (as it appears in the commented line), it takes less than 1 second to complete the execution.
Since the actual size of the array is dynamic (and not 1700), this local array can't be used in the real application.
Any advice how to improve the performance of this rather simple code would be appreciated.
#include "cuda_runtime.h"
#include <stdio.h>
#define d_MaxParallelRows 320
#define d_MinTreatedRow 5
#define d_MaxTreatedRow 915
#define d_RowsResolution 1
#define k_ThreadsPerBlock 64
__global__ void myKernel(int Xi_FirstTreatedRow)
{
int l_ThreadIndex = blockDim.x * blockIdx.x + threadIdx.x;
if (l_ThreadIndex >= d_MaxParallelRows)
return;
int l_Row = Xi_FirstTreatedRow + (l_ThreadIndex * d_RowsResolution);
if (l_Row <= d_MaxTreatedRow) {
//float l_SrcIntegral[1700];
float* l_SrcIntegral = (float*)malloc(1700 * sizeof(float));
for (int x=185; x<1407; x++) {
for (int i=0; i<1700; i++)
l_SrcIntegral[i] = i;
}
free(l_SrcIntegral);
}
}
int main()
{
cudaError_t cudaStatus;
cudaStatus = cudaSetDevice(0);
int l_ThreadsPerBlock = k_ThreadsPerBlock;
int l_BlocksPerGrid = (d_MaxParallelRows + l_ThreadsPerBlock - 1) / l_ThreadsPerBlock;
int l_FirstRow = d_MinTreatedRow;
while (l_FirstRow <= d_MaxTreatedRow) {
printf("CUDA: FirstRow=%d\n", l_FirstRow);
fflush(stdout);
myKernel<<<l_BlocksPerGrid, l_ThreadsPerBlock>>>(l_FirstRow);
cudaDeviceSynchronize();
l_FirstRow += (d_MaxParallelRows * d_RowsResolution);
}
printf("CUDA: Done\n");
return 0;
}
1.
As #aland said, you will maybe even encounter worse performance calculating just one row in each kernel call.
You have to think about processing the whole input, just to theoretically use the power of the massive parallel processing.
Why start multiple kernels with just 320 threads just to calculate one row?
How about using as many blocks you have rows and let the threads per block process one row.
(320 threads per block is not a good choice, check out how to reach better occupancy)
2.
If your fast resources as registers and shared memory are not enough, you have to use a tile apporach which is one of the basics using GPGPU programming.
Separate the input data into tiles of equal size and process them in a loop in your thread.
Here I posted an example of such a tile approach:
Parallelization in CUDA, assigning threads to each column
Be aware of range checks in that tile approach!
Example to give you the idea:
Calculate the sum of all elements in a column vector in an arbitrary sized matrix.
Each block processes one column and the threads of that block store in a tile loop their elements in a shared memory array. When finished they calculate the sum using parallel reduction, just to start the next iteration.
At the end each block calculated the sum of its vector.
You can still use dynamic array sizes using shared memory. Just pass a third argument in the <<<...>>> of the kernel call. That'd be the size of your shared memory per block.
Once you're there, just bring all relevant data into your shared array (you should still try to keep coalesced accesses) bringing one or several (if it's relevant to keep coalesced accesses) elements per thread. Sync threads after it's been brought (only if you need to stop race conditions, to make sure the whole array is in shared memory before any computation is done) and you're good to go.
Also: you should tessellate using blocks and threads, not loops. I understand that's just an example using a local array, but still, it could be done tessellating through blocks/threads and not nested for loops (which are VERY bad for performance!) I hope you're running your sample code using just 1 block and 1 thread, otherwise it wouldn't make much sense.