how to avoid thread divergence in this CUDA kernel? - cuda

for the CUDA kernel function, get branching divergence shown below, how to optimize it?
int gx = threadIdx.x + blockDim.x * blockIdx.x;
val = g_data[gx];
if (gx % 4 == 0)
val = op1(val);
else if (gx % 4 == 1)
val = op2(val);
else if (gx % 4 == 2)
val = op3(val);
else if (gx % 4 == 3)
val = op4(val);
g_data[gx] = val;

If I were programming in CUDA, I certainly wouldn't do any of this. However to answer your question:
how to avoid thread divergence in this CUDA kernel?
You could do something like this:
int gx = threadIdx.x + blockDim.x * blockIdx.x;
val = g_data[gx];
int gx_bit_0 = gx & 1;
int gx_bit_1 = (gx & 2) >> 1;
val = (1-gx_bit_1)*(1-gx_bit_0)*op1(val) + (1-gx_bit_1)*(gx_bit_0)*op2(val) + (gx_bit_1)*(1-gx_bit_0)*op3(val) + (gx_bit_1)*(gx_bit_0)*op4(val);
g_data[gx] = val;
Here is a full test case:
$ cat t1914.cu
#include <iostream>
__device__ float op1(float val) { return val + 1.0f;}
__device__ float op2(float val) { return val + 2.0f;}
__device__ float op3(float val) { return val + 3.0f;}
__device__ float op4(float val) { return val + 4.0f;}
__global__ void k(float *g_data){
int gx = threadIdx.x + blockDim.x * blockIdx.x;
float val = g_data[gx];
int gx_bit_0 = gx & 1;
int gx_bit_1 = (gx & 2) >> 1;
val = (1-gx_bit_1)*(1-gx_bit_0)*op1(val) + (1-gx_bit_1)*(gx_bit_0)*op2(val) + (gx_bit_1)*(1-gx_bit_0)*op3(val) + (gx_bit_1)*(gx_bit_0)*op4(val);
g_data[gx] = val;
}
const int N = 32;
int main(){
float *data;
cudaMallocManaged(&data, N*sizeof(float));
for (int i = 0; i < N; i++) data[i] = 1.0f;
k<<<1,N>>>(data);
cudaDeviceSynchronize();
for (int i = 0; i < N; i++) std::cout << data[i] << std::endl;
}
$ nvcc -o t1914 t1914.cu
$ compute-sanitizer ./t1914
========= COMPUTE-SANITIZER
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
========= ERROR SUMMARY: 0 errors
$

Solution by changing the work per thread
The best solution with the existing data layout is to let every thread compute 4 consecutive values. It's better to have fewer threads that can work properly than have more that can't.
float* g_data;
int gx = threadIdx.x + blockDim.x * blockIdx.x;
g_data[4 * gx] = op1(g_data[4 * gx]);
g_data[4 * gx + 1] = op2(g_data[4 * gx + 1]);
g_data[4 * gx + 2] = op3(g_data[4 * gx + 2]);
g_data[4 * gx + 3] = op4(g_data[4 * gx + 3]);
If the size of g_data is not a multiple of 4, put an if around the index operations. If it is always a multiple of 4 and properly aligned, load and store 4 values as a float4 for better performance.
Solution by reordering the work
As all my talk about float4 may have suggested, your input data appears to be some form of 2D structure where one every four elements share a similar function. Maybe it is an array of structs or an array of vectors -- in other words, a matrix.
For the purpose of explaining what I mean, I consider it a Nx4 matrix. If you transpose this into a 4xN matrix and apply a kernel to this, most of your problems disappear. Because then entries for which the same operation has to be done are placed next to each other in memory and that makes writing an efficient kernel easier. Something like this:
float* g_data;
int rows_in_g;
int gx = threadIdx.x + blockDim.x * blockIdx.x;
int gy = threadIdx.y;
float& own_g = g_data[gx + rows_in_g * gy];
switch(gy) {
case 0: own_g = op1(own_g); break;
case 1: own_g = op2(own_g); break;
case 2: own_g = op3(own_g); break;
case 3: own_g = op4(own_g); break;
default: break;
}
Start this as a 2D kernel with blocksize x=32, y=4 and gridsize x=N/32, y=1.
Now your kernel is still divergent, but all threads within a warp will execute the same case and access consecutive floats in memory. That's the best you can achieve. Of course this all depends on whether you can change the data layout.

Related

PyCUDA how to get the number of used registers per thread when launching the kernels?

I have a kernel, how can I get the number of used registers per thread when launching the kernels? I mean in a PyCuda way.
A simple example will be:
__global__
void
make_blobs(float* matrix, float2 *pts, int num_pts, float sigma, int rows, int cols) {
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
if (x < cols && y < rows) {
int idx = y*cols + x;
float temp = 0.f;
for (int i = 0; i < num_pts; i++) {
float x_0 = pts[i].x;
float y_0 = pts[i].y;
temp += exp(-(pow(x - x_0, 2) + pow(y - y_0, 2)) / (2 * sigma*sigma));
}
matrix[idx] = temp;
}
}
Is there anyway to get the number without crashing the program if the real number used has exceeded the max?
The above is OK, it dose not exceed the max in my machine. I just want to get the number in a convenient way. Thanks!
PyCuda already provides this as part of the Cuda function object. The property is called pycuda.driver.Function.num_regs.
Below is a small example that shows how to use it:
import pycuda.autoinit
from pycuda.compiler import SourceModule
kernel_src = """
__global__ void
make_blobs(float* matrix, float2 *pts, int num_pts, float sigma, int rows, int cols) {
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
if (x < cols && y < rows) {
int idx = y*cols + x;
float temp = 0.f;
for (int i = 0; i < num_pts; i++) {
float x_0 = pts[i].x;
float y_0 = pts[i].y;
temp += exp(-(pow(x - x_0, 2) + pow(y - y_0, 2)) / (2 * sigma*sigma));
}
matrix[idx] = temp;
}
}"""
compiledKernel = SourceModule(kernel_src)
make_blobs = compiledKernel.get_function("make_blobs")
print(make_blobs.num_regs)
Note that you don't need to use SourceModule. You can also load the module from e.g. a cubin file. More details can be found in the documentation.

CUDA threads for inner loop

I've got this kernel
__global__ void kernel1(int keep, int include, int width, int* d_Xco,
int* d_Xnum, bool* d_Xvalid, float* d_Xblas)
{
int i, k;
i = threadIdx.x + blockIdx.x * blockDim.x;
if(i < keep){
for(k = 0; k < include ; k++){
int val = (d_Xblas[i*include + k] >= 1e5);
int aux = d_Xnum[i];
d_Xblas[i*include + k] *= (!val);
d_Xco[i*width + aux] = k;
d_Xnum[i] +=val;
d_Xvalid[i*include + k] = (!val);
}
}
}
launched with
int keep = 9000;
int include = 23000;
int width = 0.2*include;
int threads = 192;
int blocks = keep+threads-1/threads;
kernel1 <<< blocks,threads >>>( keep, include, width,
d_Xco, d_Xnum, d_Xvalid, d_Xblas );
This kernel1 works fine but it is obviously not totally optimized. I thought it would be straight forward to eliminate the inner loop k but for some reason it doesn't work fine.
My first idea was:
__global__ void kernel2(int keep, int include, int width,
int* d_Xco, int* d_Xnum, bool* d_Xvalid,
float* d_Xblas)
{
int i, k;
i = threadIdx.x + blockIdx.x * blockDim.x;
k = threadIdx.y + blockIdx.y * blockDim.y;
if((i < keep) && (k < include) ) {
int val = (d_Xblas[i*include + k] >= 1e5);
int aux = d_Xnum[i];
d_Xblas[i*include + k] *= (float)(!val);
d_Xco[i*width + aux] = k;
atomicAdd(&d_Xnum[i], val);
d_Xvalid[i*include + k] = (!val);
}
}
launched with a 2D grid:
int keep = 9000;
int include = 23000;
int width = 0.2*include;
int th = 32;
dim3 threads(th,th);
dim3 blocks ((keep+threads.x-1)/threads.x, (include+threads.y-1)/threads.y);
kernel2 <<< blocks,threads >>>( keep, include, width, d_Xco, d_Xnum,
d_Xvalid, d_Xblas );
Although I believe the idea is fine, it does not work and I am running out of ideas here. Could you please help me out here? I also think the problem could be in d_Xco which stores the position k in a smaller array and push them to the beginning of the array , so the order matters.
d_Xco
-------------------------------
| 2|3 |15 |4 |5 |5 | | | | | | .......
-------------------------------
In the original code, you have
for(k = 0; k < include ; k++){
...
int aux = d_Xnum[i];
...
d_Xco[i*width + aux] = k;
...
}
The index to the d_Xco array is not dependent on k and therefore writing to it each iteration is redundant. The final value will always be include-1. So, replace these two lines inside the k loop with one line outside the k loop:
d_Xco[i*width + d_Xnum[i]] = include - 1;
Once you do that, when you parallelize the k loop you will no longer have the race condition you currently have when many k threads assign different values to the same location in d_Xco concurrently (no guarantee of ordering).

Get statistics for a list of numbers using GPU

I have several lists of numbers on a file . For example,
.333, .324, .123 , .543, .00054
.2243, .333, .53343 , .4434
Now, I want to get the number of times each number occurs using the GPU. I believe this will be faster to do on the GPU than the CPU because each thread can process one list. What data structure should I use on the GPU to easily get the above counts. For example , for the above, the answer will look as follows:
.333 = 2 times in entire file
.324 = 1 time
etc..
I looking for a general solution. Not one that works only on devices with specific compute capability
Just writing kernel suggested by Pavan to see if I have implemented it efficiently:
int uniqueEle = newend.valiter – d_A;
int* count;
cudaMalloc((void**)&count, uniqueEle * sizeof(int)); // stores the count of each unique element
int TPB = 256;
int blocks = uniqueEle + TPB -1 / TPB;
//Cast d_I to raw pointer called d_rawI
launch<<<blocks,TPB>>>(d_rawI,count,uniqueEle);
__global__ void launch(int *i, int* count, int n){
int id = blockDim.x * blockIdx.x + threadIdx.x;
__shared__ int indexes[256];
if(id < n ){
indexes[threadIdx.x] = i[id];
//as occurs between two blocks
if(id % 255 == 0){
count[indexes] = i[id+1] - i[id];
}
}
__syncthreads();
if(id < ele - 1){
if(threadIdx.x < 255)
count[id] = indexes[threadIdx.x+1] – indexes[threadIdx.x];
}
}
Question: how to modify this kernel so that it handles arrays of arbitrary size. I.e , handle the condition when the total number of threads < number of elements
Here is how I would do the code in matlab
A = [333, .324, .123 , .543, .00054 .2243, .333, .53343 , .4434];
[values, locations] = unique(A); % Find unique values and their locations
counts = diff([0, locations]); % Find the count based on their locations
There is no easy way to do this in plain cuda, but you can use existing libraries to do this.
1) Thrust
It is also being shipped with CUDA toolkit from CUDA 4.0.
The matlab code can be roughly translated into thrust by using the following functions. I am not too proficient with thrust, but I am just trying to give you an idea on what routines to look at.
float _A[] = {.333, .324, .123 , .543, .00054 .2243, .333, .53343 , .4434};
int _I[] = {0, 1, 2, 3, 4, 5, 6, 7, 8};
float *A, *I;
// Allocate memory on device and cudaMempCpy values from _A to A and _I to I
int num = 9;
// Values vector
thrust::device_vector<float>d_A(A, A+num);
// Need to sort to get same values together
thrust::stable_sort(d_A, d_A+num);
// Vector containing 0 to num-1
thrust::device_vector<int>d_I(I, I+num);
// Find unique values and elements
thrust::device_vector<float>d_Values(num), d_Locations(num), d_counts(num);
// Find unique elements
thrust::device_vector<float>::iterator valiter;
thrust::device_vector<int>::iterator idxiter;
thrust::pair<valiter, idxiter> new_end;
new_end = thrust::unique_by_key(d_A, d_A+num, d_I, d_Values, d_Locations);
You now have the locations of the first instance of each unique value. You can now launch a kernel to find the differences between adjacent elements from 0 to new_end in d_Locations. Subtract the final value from num to get the count for final location.
EDIT (Adding code that was provided over chat)
Here is how the difference code needs to be done
#define MAX_BLOCKS 65535
#define roundup(A, B) = (((A) + (B) - 1) / (B))
int uniqueEle = newend.valiter – d_A;
int* count;
cudaMalloc((void**)&count, uniqueEle * sizeof(int));
int TPB = 256;
int num_blocks = roundup(uniqueEle, TPB);
int blocks_y = roundup(num_blocks, MAX_BLOCKS);
int blocks_x = roundup(num_blocks, blocks_y);
dim3 blocks(blocks_x, blocks_y);
kernel<<<blocks,TPB>>>(d_rawI, count, uniqueEle);
__global__ void kernel(float *i, int* count, int n)
{
int tx = threadIdx.x;
int bid = blockIdx.y * gridDim.x + blockIdx.x;
int id = blockDim.x * bid + tx;
__shared__ int indexes[256];
if (id < n) indexes[tx] = i[id];
__syncthreads();
if (id < n - 1) {
if (tx < 255) count[id] = indexes[tx + 1] - indexes[tx];
else count[id] = i[id + 1] - indexes[tx];
}
if (id == n - 1) count[id] = n - indexes[tx];
return;
}
2) ArrayFire
This is an easy to use, free array based library.
You can do the following in ArrayFire.
using namespace af;
float h_A[] = {.333, .324, .123 , .543, .00054 .2243, .333, .53343 , .4434};
int num = 9;
// Transfer data to device
array A(9, 1, h_A);
array values, locations, original;
// Find the unique values and locations
setunique(values, locations, original, A);
// Locations are 0 based, add 1.
// Add *num* at the end to find count of last value.
array counts = diff1(join(locations + 1, num));
Disclosure: I work for AccelerEyes, that develops this software.
To answer the latest addenum to this question - the diff kernel which would complete the thrust method proposed by Pavan could look something like this:
template<int blcksz>
__global__ void diffkernel(const int *i, int* count, const int n) {
int id = blockDim.x * blockIdx.x + threadIdx.x;
int strd = blockDim.x * gridDim.x;
int nmax = blcksz * ((n/blcksz) + ((n%blcksz>0) ? 1 : 0));
__shared__ int indices[blcksz+1];
for(; id<nmax; id+=strd) {
// Data load
indices[threadIdx.x] = (id < n) ? i[id] : n;
if (threadIdx.x == (blcksz-1))
indices[blcksz] = ((id+1) < n) ? i[id+1] : n;
__syncthreads();
// Differencing calculation
int diff = indices[threadIdx.x+1] - indices[threadIdx.x];
// Store
if (id < n) count[id] = diff;
__syncthreads();
}
}
here is a solution:
__global__ void counter(float* a, int* b, int N)
{
int idx = blockIdx.x*blockDim.x+threadIdx.x;
if(idx < N)
{
float my = a[idx];
int count = 0;
for(int i=0; i < N; i++)
{
if(my == a[i])
count++;
}
b[idx]=count;
}
}
int main()
{
int threads = 9;
int blocks = 1;
int N = blocks*threads;
float* h_a;
int* h_b;
float* d_a;
int* d_b;
h_a = (float*)malloc(N*sizeof(float));
h_b = (int*)malloc(N*sizeof(int));
cudaMalloc((void**)&d_a,N*sizeof(float));
cudaMalloc((void**)&d_b,N*sizeof(int));
h_a[0]= .333f;
h_a[1]= .324f;
h_a[2]= .123f;
h_a[3]= .543f;
h_a[4]= .00054f;
h_a[5]= .2243f;
h_a[6]= .333f;
h_a[7]= .53343f;
h_a[8]= .4434f;
cudaMemcpy(d_a,h_a,N*sizeof(float),cudaMemcpyHostToDevice);
counter<<<blocks,threads>>>(d_a,d_b,N);
cudaMemcpy(h_b,d_b,N*sizeof(int),cudaMemcpyDeviceToHost);
for(int i=0; i < N; i++)
{
printf("%f = %d times\n",h_a[i],h_b[i]);
}
cudaFree(d_a);
cudaFree(d_b);
free(h_a);
free(h_b);
getchar();
return 0;
}

Calculating differences between consecutive indices fast

Given that I have the array
Let Sum be 16
dintptr = { 0 , 2, 8,11,13,15}
I want to compute the difference between consecutive indices using the GPU. So the final array should be as follows:
count = { 2, 6,3,2,2,1}
Below is my kernel:
//for this function n is 6
__global__ void kernel(int *dintptr, int * count, int n){
int id = blockDim.x * blockIdx.x + threadIdx.x;
__shared__ int indexes[256];
int need = (n % 256 ==0)?0:1;
int allow = 256 * ( n/256 + need);
while(id < allow){
if(id < n ){
indexes[threadIdx.x] = dintptr[id];
}
__syncthreads();
if(id < n - 1 ){
if(threadIdx.x % 255 == 0 ){
count[id] = indexes[threadIdx.x + 1] - indexes[threadIdx.x];
}else{
count[id] = dintptr[id+1] - dintptr[id];
}
}//end if id<n-1
__syncthreads();
id+=(gridDim.x * blockDim.x);
}//end while
}//end kernel
// For last element explicitly set count[n-1] = SUm - dintptr[n-1]
2 questions:
Is this kernel fast. Can you suggest a faster implementation?
Does this kernel handle arrays of arbitrary size ( I think it does)
I'll bite.
__global__ void kernel(int *dintptr, int * count, int n)
{
for (int id = blockDim.x * blockIdx.x + threadIdx.x;
id < n-1;
id += gridDim.x * blockDim.x)
count[id] = dintptr[id+1] - dintptr[i];
}
(Since you said you "explicitly" set the value of the last element, and you didn't in your kernel, I didn't bother to set it here either.)
I don't see a lot of advantage to using shared memory in this kernel as you do: the L1 cache on Fermi should give you nearly the same advantage since your locality is high and reuse is low.
Both your kernel and mine appear to handle arbitrary-sized arrays. Yours however appears to assume blockDim.x == 256.

CUDA kernel - nested for loop

Hello
I'm trying to write a CUDA kernel to perform the following piece of code.
for (n = 0; n < (total-1); n++)
{
a = values[n];
for ( i = n+1; i < total ; i++)
{
b = values[i] - a;
c = b*b;
if( c < 10)
newvalues[i] = c;
}
}
This is what I have currently, but it does not seem to be giving the correct results? does anyone know what I'm doing wrong. Cheers
__global__ void calc(int total, float *values, float *newvalues){
float a,b,c;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (int n = idx; n < (total-1); n += blockDim.x*gridDim.x){
a = values[n];
for(int i = n+1; i < total; i++){
b = values[i] - a;
c = b*b;
if( c < 10)
newvalues[i] = c;
}
}
Realize this problem in 2D and launch your kernel with 2D thread blocks. The total number of threads in x and y dimension will be equal to total . The kernel code should look like this:
__global__ void calc(float *values, float *newvalues, int total){
float a,b,c;
int n= blockIdx.y * blockDim.y + threadIdx.y;
int i= blockIdx.x * blockDim.x + threadIdx.x;
if (n>=total || i>=total)
return;
a = values[n];
b = values[i] - a;
c = b*b;
if( c < 10)
newvalues[i] = c;
// I don't know your problem statement but i think it should be like: newvalues[n*total+i] = c;
}
Update:
This is how you should call the kernel
dim3 block(16,16);
dim3 grid ( (total+15)/16, (total+15)/16 );
calc<<<grid,block>>>(float *val, float *newval, int T);
Also make sure you add this line in kernel (see updated kernel)
if (n>=total || i>=total)
return;
Update 2:
fixed blockIdy.y, correct is blockIdx.y
I'll probably be way wrong but the n < (total-1) check in
for (int n = idx; n < (total-1); n += blockDim.x*gridDim.x)
seems different than the original version.
Why don't you just remove the outter loop and start the kernel with as many threads as you need for this loop? It's a bit weird to have a loop that depends on your blockId. Normally you try to avoid these loops.
Secondly it seems to me that newvalues[i] can be overriden by different threads.