Will other threads block in this code with CUDA? - cuda

I am new in CUDA programming and have a strange behaviour.
I have a kernel like this:
__global__ void myKernel (uint64_t *input, int numOfBlocks, uint64_t *state) {
int const t = blockIdx.x * blockDim.x + threadIdx.x;
int i;
for (i = 0; i < numOfBlocks; i++) {
if (t < 32) {
if (t < 8) {
state[t] = state[t] ^ input[t];
}
if (t < 25) {
deviceFunc(device_state); /* will use some printf() */
}
}
}
}
I run this kernel with this parameter:
myKernel<<<1, 32>>>(input, numOfBlocks, state);
If 'numOfBlocks' is equal to 1, it will work fine, I get the result I expect back and the printf() inside the deviceFunc() are in the correct order.
If 'numOfBlocks' is equal to 2, it does not work fine! The result is not that what I expected and the printf() are not in the correct order (I only use printf() from thread 0)!
So, my question is now: The left threads from (32-25) which ARE NOT calling deviceFunc(), will they wait and block and this position or will they run the again and start over with the next for-loop iteration? I always thought that every line in the kernel is synchronized in the same block.

I worked the whole day on this and I finally found a solution. First, you are right that I had in my deviceFunc() many RAW hazards. I started to put some __syncthreads() after any WRITE operation, but I think this slows down my program. And I don't think that __syncthreads() is the common way to resolve them. Funny is, that the result is still the same with and without __syncthreads().
But my problem in my code above is that I used
input[t]
which was wrong, because I had to include 'numOfBlocks' in my calculation of index:
input[(NUM_OF_XOR_THREADS * i) + t)
Now, the result was correct and my problem is solved.

Related

Can I run a CUDA device function without parallelization or calling it as part of a kernel?

I have a program that loads an image onto a CUDA device, analyzes it with cufft and some custom stuff, and updates a single number on the device which the host then queries as needed. The analysis is mostly parallelized, but the last step sums everything up (using thrust::reduce) for a couple final calculations that aren't parallel.
Once everything is reduced, there's nothing to parallelize, but I can't figure out how to just run a device function without calling it as its own tiny kernel with <<<1, 1>>>. That seems like a hack. Is there a better way to do this? Maybe a way to tell the parallelized kernel "just do these last lines once after the parallel part is finished"?
I feel like this must have been asked before, but I can't find it. Might just not know what to search for though.
Code snip below, I hope I didn't remove anything relevant:
float *d_phs_deltas; // Allocated using cudaMalloc (data is on device)
__device__ float d_Z;
static __global__ void getDists(const cufftComplex* data, const bool* valid, float* phs_deltas)
{
const int i = blockIdx.x*blockDim.x + threadIdx.x;
// Do stuff with the line indicated by index i
// ...
// Save result into array, gets reduced to single number in setDist
phs_deltas[i] = phs_delta;
}
static __global__ void setDist(const cufftComplex* data, const bool* valid, const float* phs_deltas)
{
// Final step; does it need to be it's own kernel if it only runs once??
d_Z += phs2dst * thrust::reduce(thrust::device, phs_deltas, phs_deltas + d_y);
// Save some other stuff to refer to next frame
// ...
}
void fftExec(unsigned __int32 *host_data)
{
// Copy image to device, do FFT, etc
// ...
// Last parallel analysis step, sets d_phs_deltas
getDists<<<out_blocks, N_THREADS>>>(d_result, d_valid, d_phs_deltas);
// Should this be a serial part at the end of getDists somehow?
setDist<<<1, 1>>>(d_result, d_valid, d_phs_deltas);
}
// d_Z is copied out only on request
void getZ(float *Z) { cudaMemcpyFromSymbol(Z, d_Z, sizeof(float)); }
Thank you!
There is no way to run a device function directly without launching a kernel. As pointed out in comments, there is a working example in the Programming Guide which shows how to use memory fence functions and an atomically incremented counter to signal that a given block is the last block:
__device__ unsigned int count = 0;
__global__ void sum(const float* array, unsigned int N, volatile float* result)
{
__shared__ bool isLastBlockDone;
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
result[blockIdx.x] = partialSum;
// Thread 0 makes sure that the incrementation
// of the "count" variable is only performed after
// the partial sum has been written to global memory.
__threadfence();
// Thread 0 signals that it is done.
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 determines if its block is the last
// block to be done.
isLastBlockDone = (value == (gridDim.x - 1));
}
// Synchronize to make sure that each thread reads
// the correct value of isLastBlockDone.
__syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums
// stored in result[0 .. gridDim.x-1] float totalSum =
calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores the total sum
// to global memory and resets the count
// varilable, so that the next kernel call
// works properly.
result[0] = totalSum;
count = 0;
}
}
}
I would recommend benchmarking both ways and choosing which is faster. On most platforms kernel launch latency is only a few microseconds, so a short running kernel to finish an action after a long running kernel can be the most efficient way to get this done.

cudaMemcpyFromSymbol errors when calling a CUDA kernel multiple times

I am running a loop on a GPU such that after every iteration, I check if the convergence condition is satisified. If yes, I exit the while loop.
__device__ int converged = 0; // this line before the kernel
inside the kernel:
__global__ convergence_kernel()
{
if (convergence condition is true)
{
atomicAdd(&converged, 1);
}
}
On CPU I am calling the kernel within the loop:
int *convc = (int*) calloc(1,sizeof(int));
//converged = 0; //commenting as this is not correct as per Robert's suggestion
while(convc[0]< 1)
{
foo_bar1<<<num_blocks, threads>>>(err, count);
cudaDeviceSynchronize();
count += 1;
cudaMemcpyFromSymbol(convc, converged, sizeof(int));
}
So here, if the condition is true, my convc[0] = 1, however, when I print this value, I always see a random value, eg. conv = 3104 , conv = 17280, conv = 17408, etc.
Can someone tell me what's missing in my cudaMemcpyFromSymbol operation? Am I missing something?? Thanks in advance.
My best guess as to why you are getting garbage when you read the converged value into convc is that you have not initialized converged anywhere. It can't be done in host code like this:
converged = 0;
You could change your declaration to be like this:
__device__ int converged = 0; // this line before the kernel
or you could also use the cudaMemcpyToSymbol function, which is effectively the reverse of the cudaMemcpyFromSymbol function that you already seem to be aware of.

Thrust - accessing neighbors

I would like to use Thrust's stream compaction functionality (copy_if) for distilling indices of elements from a vector if the elements adhere to a number of constraints. One of these constraints depends on the values of neighboring elements (8 in 2D and 26 in 3D). My question is: how can I obtain the neighbors of an element in Thrust?
The function call operator of the functor for the 'copy_if' basically looks like:
__host__ __device__ bool operator()(float x) {
bool mark = x < 0.0f;
if (mark) {
if (left neighbor of x > 1.0f) return false;
if (right neighbor of x > 1.0f) return false;
if (top neighbor of x > 1.0f) return false;
//etc.
}
return mark;
}
Currently I use a work-around by first launching a CUDA kernel (in which it is easy to access neighbors) to appropriately mark the elements. After that, I pass the marked elements to Thrust's copy_if to distill the indices of the marked elements.
I came across counting_iterator as a sort of substitute for directly using threadIdx and blockIdx to acquire the index of the processed element. I tried the solution below, but when compiling it, it gives me a "/usr/include/cuda/thrust/detail/device/cuda/copy_if.inl(151): Error: Unaligned memory accesses not supported". As far as I know I'm not trying to access memory in an unaligned fashion. Anybody knows what's going on and/or how to fix this?
struct IsEmpty2 {
float* xi;
IsEmpty2(float* pXi) { xi = pXi; }
__host__ __device__ bool operator()(thrust::tuple<float, int> t) {
bool mark = thrust::get<0>(t) < -0.01f;
if (mark) {
int countindex = thrust::get<1>(t);
if (xi[countindex] > 1.01f) return false;
//etc.
}
return mark;
}
};
thrust::copy_if(indices.begin(),
indices.end(),
thrust::make_zip_iterator(thrust::make_tuple(xi, thrust::counting_iterator<int>())),
indicesEmptied.begin(),
IsEmpty2(rawXi));
#phoad: you're right about the shared mem, it struck me after I already posted my reply, subsequently thinking that the cache probably will help me. But you beat me with your quick response. The if-statement however is executed in less than 5% of all cases, so either using shared mem or relying on the cache will probably have negligible impact on performance.
Tuples only support 10 values, so that would mean I would require tuples of tuples for the 26 values in the 3D case. Working with tuples and zip_iterator was already quite cumbersome, so I'll pass for this option (also from a code readability stand point). I tried your suggestion by directly using threadIdx.x etc. in the device function, but Thrust doesn't like that. I seem to be getting some unexplainable results and sometimes I end up with an Thrust error. The following program for example generates a 'thrust::system::system_error' with an 'unspecified launch failure', although it first correctly prints "Processing 10" to "Processing 41":
struct printf_functor {
__host__ __device__ void operator()(int e) {
printf("Processing %d\n", threadIdx.x);
}
};
int main() {
thrust::device_vector<int> dVec(32);
for (int i = 0; i < 32; ++i)
dVec[i] = i + 10;
thrust::for_each(dVec.begin(), dVec.end(), printf_functor());
return 0;
}
Same applies to printing blockIdx.x Printing blockDim.x however generates no error. I was hoping for a clean solution, but I guess I am stuck with my current work-around solution.

CUDA: Unspecified Launch Failure

I was using the CUDA-GDB to find out what the problem was with my kernel execution. It would always output; Cuda error: kernel execution: unspecified launch failure. That's probably the worst error anyone could possibly get because there is no indication whatsoever of what is going on!
Back to the CUDA-GDB... When I was using the debugger it would arrive at the kernel and output;
Breakpoint 1, myKernel (__cuda_0=0x200300000, __cuda_1=0x200400000, __cuda_2=320, __cuda_3=7872, __cuda_4=0xe805c0, __cuda_5=0xea05e0, __cuda_6=0x96dfa0, __cuda_7=0x955680, __cuda_8=0.056646065580379823, __cuda_9=-0.0045986640087569072, __cuda_10=0.125,
__cuda_11=18.598229033761132, __cuda_12=0.00048828125, __cuda_13=5.9604644775390625e-08)
at myFunction.cu:60
Then I would type: next.
output;
0x00007ffff7f7a790 in __device_stub__Z31chisquared_LogLikelihood_KernelPdS_iiP12tagCOMPLEX16S1_S1_S_dddddd ()
from /home/alex/master/opt/lscsoft/lalinference/lib/liblalinference.so.3
The notable part in that section is that it has a tag to a typedef'd datatype. COMPLEX16 is defined as: typedef double complex COMPLEX16
Then I would type: next.
output;
Single stepping until exit from function Z84_device_stub__Z31chisquared_LogLikelihood_KernelPdS_iiP12tagCOMPLEX16S1_S1_S_ddddddPdS_iiP12tagCOMPLEX16S1_S1_S_dddddd#plt,
which has no line number information.
0x00007ffff7f79560 in ?? () from /home/alex/master/opt/lscsoft/lalinference/lib/liblalinference.so.3
Type next...
output;
Cannot find bounds of current function
Type continue...
Cuda error: kernel execution: unspecified launch failure.
Which is the error I get without debugging. I have seen some forum topics on something similar where the debugger cannot find the bounds of current function, possibly because the library is somehow not linked or something along those lines? The ?? was said to be because the debugger is somewhere is shell for some reason and not in any function.
I believe the problem lies deeper in the fact that I have these interesting data types in my code. COMPLEX16 REAL8
Here is my kernel...
__global__ void chisquared_LogLikelihood_Kernel(REAL8 *d_temp, double *d_sum, int lower, int dataSize,
COMPLEX16 *freqModelhPlus_Data,
COMPLEX16 *freqModelhCross_Data,
COMPLEX16 *freqData_Data,
REAL8 *oneSidedNoisePowerSpectrum_Data,
double FplusScaled,
double FcrossScaled,
double deltaF,
double twopit,
double deltaT,
double TwoDeltaToverN)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ REAL8 ssum[MAX_THREADS];
if (idx < dataSize)
{
idx += lower; //accounts for the shift that was made in the original loop
memset(ssum, 0, MAX_THREADS * sizeof(*ssum));
int tid = threadIdx.x;
int bid = blockIdx.x;
REAL8 plainTemplateReal = FplusScaled * freqModelhPlus_Data[idx].re
+ freqModelhCross_Data[idx].re;
REAL8 plainTemplateImag = FplusScaled * freqModelhPlus_Data[idx].im
+ freqModelhCross_Data[idx].im;
/* do time-shifting... */
/* (also un-do 1/deltaT scaling): */
double f = ((double) idx) * deltaF;
/* real & imag parts of exp(-2*pi*i*f*deltaT): */
double re = cos(twopit * f);
double im = - sin(twopit * f);
REAL8 templateReal = (plainTemplateReal*re - plainTemplateImag*im) / deltaT;
REAL8 templateImag = (plainTemplateReal*im + plainTemplateImag*re) / deltaT;
double dataReal = freqData_Data[idx].re / deltaT;
double dataImag = freqData_Data[idx].im / deltaT;
/* compute squared difference & 'chi-squared': */
double diffRe = dataReal - templateReal; // Difference in real parts...
double diffIm = dataImag - templateImag; // ...and imaginary parts, and...
double diffSquared = diffRe*diffRe + diffIm*diffIm ; // ...squared difference of the 2 complex figures.
//d_temp[idx - lower] = ((TwoDeltaToverN * diffSquared) / oneSidedNoisePowerSpectrum_Data[idx]);
//ssum[tid] = ((TwoDeltaToverN * diffSquared) / oneSidedNoisePowerSpectrum_Data[idx]);
/***** REDUCTION *****/
//__syncthreads(); //all the temps should have data before we add them up
//for (int i = blockDim.x / 2; i > 0; i >>= 1) { /* per block */
// if (tid < i)
// ssum[tid] += ssum[tid + i];
// __syncthreads();
//}
//d_sum[bid] = ssum[0];
}
}
When I'm not debugging (-g -G not included in command) then the kernel only runs fine if I don't include the line(s) that begin with d_temp[idx - lower] and ssum[tid]. I only did d_temp to make sure that it wasn't a shared memory error, ran fine. I also tried running with ssum[tid] = 20.0 and other various number types to make sure it wasn't that sort of problem, ran fine too. When I run with either of them included then the kernel exits with the cuda error above.
Please ask me if something is unclear or confusing.
There was a lack of context here for my question. The assumption was probably that I had done cudaMalloc and other such preliminary things before the kernel execution for ALL the pointers involved. However I had only done it to d_temp and d_sum (I was making tons of switches and barely realized I was making the other four pointers). Once I did cudaMalloc and cudaMemcpy for the data needed, then everything ran perfectly.
Thanks for the insight.

Parallel reduction and find index on CUDA

I have an array of 20K values and I am reducing it over 50 blocks with 400 threads each. num_blocks = 50 and block_size = 400.
My code looks like this:
getmax <<< num_blocks,block_size >>> (d_in, d_out1, d_indices);
__global__ void getmax(float *in1, float *out1, int *index)
{
// Declare arrays to be in shared memory.
__shared__ float max[threads];
int nTotalThreads = blockDim.x; // Total number of active threads
float temp;
float max_val;
int max_index;
int arrayIndex;
// Calculate which element this thread reads from memory
arrayIndex = gridDim.x*blockDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
max[threadIdx.x] = in1[arrayIndex];
max_val = max[threadIdx.x];
max_index = blockDim.x*blockIdx.x + threadIdx.x;
__syncthreads();
while(nTotalThreads > 1)
{
int halfPoint = (nTotalThreads >> 1);
if (threadIdx.x < halfPoint)
{
temp = max[threadIdx.x + halfPoint];
if (temp > max[threadIdx.x])
{
max[threadIdx.x] = temp;
max_val = max[threadIdx.x];
}
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1); // divide by two.
}
if (threadIdx.x == 0)
{
out1[num_blocks*blockIdx.y + blockIdx.x] = max[threadIdx.x];
}
if(max[blockIdx.x] == max_val )
{
index[blockIdx.x] = max_index;
}
}
The problem/issue here is that at some point “nTotalThreads” is not exactly a power of 2, resulting in garbage value for the index. The array out1 gives me the maximum value in each block, which is correct and validated. But the value of the index is wrong. For example: the max value in the first block occurs at index=40, but the kernel gives the values of index as 15. Similarly the value of the max in the second block is at 440, but the kernel gives 416.
Any suggestions??
It should be easy to ensure that nTotalThreads is always a power of 2.
Make the first reduction a special case that gets the nTotalThreads to a power of 2. eg, since you start with 400 threads in a block, do the first reduction with 256 threads. Threads 0-199 will reduce from two values, and threads 200-255 just won't have to do a reduction in this initial step. From then on out you'd be fine.
Are you sure you really need the 'issue' “nTotalThreads” is not exactly a power of 2?
It makes the code less readable and I think it can interfere with the performance too.
Anyway if you substitute
nTotalThreads = (nTotalThreads >> 1);
with
nTotalThreads = (nTotalThreads +1 ) >> 1;
it should solve one bug concerning this 'issue'.
Francesco
Second Jeff's suggestion.
Take a look at the CUDA Thrust Library's reduce function. This is demonstrated to have 95+% efficiency compared with heavily hand-tuned kernels and is pretty flexible and easy to use.
check my kernel. You can put your blockresults to array(which can be in global memory) and get the result in global memory
And see how I call it in host code:
sumSeries<<<dim3(blockCount),dim3(threadsPerBlock)>>>(deviceSum,threadsPerBlock*blockCount);