The code does not work. But when I comment atomicAdd in the following code, the code works.
What is the reason for that?
Where can I get histogram code for float array?
__global__ void calculateHistogram(float *devD, int* retHis)
{
int globalLi = getCurrentThread(); //get the thread ID
if(globalLi>=0 && globalLi<Rd*Cd*Dd)
{
int r=0,c=0,d=0;
GetInd2Sub(globalLi, Rd, Cd, r, c, d); //some calculations to get r,c,d
if(r>=stYd && r<edYd && c>=stXd && c<edXd && d>=stZd && d<edZd)
{
//calculate the histogram
int indexInHis = GetBinNo(devD[globalLi]); //get the bin number in the histogram
atomicAdd(&retHis[indexInHis],1); //when I comment this line the code works
}
}
}
Take a look at chapter 9 of CUDA by Example by Jason Sanders and Edward Kandrot. It covers atomics and goes through a simple example computing histograms of 8-bit integers. The first version uses an atomic add for each value, which works but is very slow. The refined version of the example computes a histogram for each block in shared memory, then merges all the histograms together into global memory to get the final result. Your code is like the first version, once you get it working you will want to make it more like the fast refined version.
You can download the examples from the book to see both versions: CUDA by Example downloads
You don't appear to give complete code or error messages, so I can't say exactly what is going wrong in your code. Here are some thoughts:
You need to compile with an architecture that supports atomics (i.e. greater than the default 1.0 architecture target)
The indexing and index limits appear somewhat complicated, I would double-check those
Your bin calculation might be giving bin numbers outside a valid range for retHis, I would add some checks before using the return value, at least for debugging
Related
I'm trying to update some older CUDA code (pre CUDA 9.0), and I'm having some difficulty updating usage of warp shuffles (e.g., __shfl).
Basically the relevant part of the kernel might be something like this:
int f = d[threadIdx.x];
int warpLeader = <something in [0,32)>;
// Point being, some threads in the warp get removed by i < stop
for(int i = k; i < stop; i+=skip)
{
// Point being, potentially more threads don't see the shuffle below.
if(mem[threadIdx.x + i/2] == foo)
{
// Pre CUDA 9.0.
f = __shfl(f, warpLeader);
}
}
Maybe that's not the best example (real code too complex to post), but the two things accomplished easily with the old intrinsics were:
Shuffle/broadcast to whatever threads happen to be here at this time.
Still get to use the warp-relative thread index.
I'm not sure how to do the above post CUDA 9.0.
This question is almost/partially answered here: How can I synchronize threads within warp in conditional while statement in CUDA?, but I think that post has a few unresolved questions.
I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.
The linked question says to use coalesced_group, but my understanding is that this type of cooperative_group re-ranks the threads, so if you had a warpLeader (on [0, 32)) in mind as above, I'm not sure there's a way to "figure out" its new rank in the coalesced_group.
(Also, based on the truncated comment conversation in the linked question, it seems unclear if coalesced_group is just a nice wrapper for __activemask() or not anyway ...)
It is possible to iteratively build up a mask using __ballot_sync as described in the linked question, but for code similar to the above, that can become pretty tedious. Is this our only way forward for CUDA > 9.0?
I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.
The linked question doesn't show any such usage. Furthermore, the canonical blog specifically says that usage is the one that satisfies this:
Shuffle/broadcast to whatever threads happen to be here at this time.
The blog states that this is incorrect usage:
//
// Incorrect use of __activemask()
//
if (threadIdx.x < NUM_ELEMENTS) {
unsigned mask = __activemask();
val = input[threadIdx.x];
for (int offset = 16; offset > 0; offset /= 2)
val += __shfl_down_sync(mask, val, offset);
(which is conceptually similar to the usage given in your linked question.)
But for "opportunistic" usage, as defined in that blog, it actually gives an example of usage in listing 9 that is similar to the one that you state "won't work". It certainly does work following exactly the definition you gave:
Shuffle/broadcast to whatever threads happen to be here at this time.
If your algorithm intent is exactly that, it should work fine. However, for many cases, that isn't really a correct description of the algorithm intent. In those cases, the blog recommends a stepwise process to arrive at a correct mask:
Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
If your program does opportunistic warp-synchronous programming, use “detective” functions such as __activemask() and __match_all_sync() to find the right mask.
Use __syncwarp() to separate operations with intra-warp dependences. Do not assume lock-step execution.
Note that steps 1 and 2 are not contradictory to other comments. If you know for certain that you intend the entire warp to participate (not typically known in a "opportunistic" setting) then it is perfectly fine to use a hardcoded full mask.
If you really do intend the opportunistic definition you gave, there is nothing wrong with the usage of __activemask() to supply the mask, and in fact the blog gives a usage example of that, and step 4 also confirms that usage, for that case.
I am trying to parallelize this for loop inside a function using OpenMP, but when I compile the code I still have an error =(
Error 1 error C3010: 'return' : jump out of OpenMP structured block not allowed.
I am using Visual studio 2010 C++ compiler. Can anyone help me? I appreciate any advice.
int match(char* pattern, int patternSize, char* string, int startFrom, unsigned int &comparisons) {
comparisons = 0;
#pragma omp for
for (int i = 0; i < patternSize; i++){
comparisons++;
if (pattern[i] != string[i + startFrom])
return 0;
}
return 1;
}
As #Hristo has already mentioned, you are not allowed to branch out of a parallel region in OpenMP. Among other reasons, this is not allowed because the compiler cannot know a priori how many iterations each thread should work on when it splits up a for loop like the one that you have written among the different threads.
Furthermore, even if you could branch out of your loop, you should be able to see that comparisons would be computed incorrectly. As is, you have an inherently serial algorithm that breaks at the first different character. How could you split up this work such that throwing more threads at this algorithm possibly makes it faster?
Finally, note that there is very little work being done in this loop anyway. You would be very unlikely to see any benefit from OpenMP even if you could rewrite this algorithm into a parallel algorithm. My suggestion: drop OpenMP from this loop and look to implement it somewhere else (either at a higher level - maybe you call this method on different strings? - or in a section of your code that does more work).
I've recently tested the algorithm for reduction using CUDA (the one you can find for example at http://www.cuvilib.com/Reduction.pdf, page 16). But at the end of it, I ran into trouble not using atomicity. So basically I do the sum of each block and store it into shared array. Then I get it back to the global array x (tdx is threadIndex.x, and i is global index).
if(i==0){
*sum = 0.; // Initialize to 0
}
__syncthreads();
if (tdx == 0){
x[blockIdx.x] = s_x[tdx]; //get the shared sums in global memory
}
__syncthreads();
Then I want to sum the first x elements (as many as I have blocks).
When doing with atomicity it works fine (same result as the cpu), however when I use the commented line below it does not work and often yields "nan":
if(i == 0){
for(int k = 0; k < gridDim.x; k++){
atomicAdd(sum, x[k]); //Works good
//sum[0] += x[k]; //or *sum += x[k]; //Does not work, often results in nan
}
}
Now in fact I use atomicadd directly to sum the shared sums, but I would like to understand why this does not work. An atomic add is quite of nonsense when restricting the operation to a single thread. And the simple sum should work fine!
__syncthreads() only synchronizes threads in the same block, not across different blocks and CUDA has no safe synchronization mechanism across blocks.
The incorrect result is due to a synchronization problem. The operands x[k] are the outcomes of the computations from different blocks: x[0] is the result from block 0, x[1] is the result from block 1, etc. Thread 0 could start adding them up before some blocks have really finished their computations.
You should put the second code snippet in a different kernel, so that synchronization is enforced, and the line sum[0] += x[k]; can now work.
As has been pointed out, your problem is due to missing synchronisation after the first pass since you cannot synchronise between blocks. There is a good walkthrough on reduction in the sample codes provided with the toolkit.
Having said that, I would strongly recommend that people don't write reduction kernels (or other primitives such as scan) where such primitives exist in library code. Much better to invest your effort elsewhere and reuse existing optimised code where it is available. This doesn't apply if you're doing this to learn of course!
I recommend you take a look at Thrust and CUB.
I am reading the CUDA 5.0 samples (AdvancedQuickSort) now. However, I cannot understand this sample totally due to following codes:
// Now compute my own personal offset within this. I need to know how many
// threads with a lane ID less than mine are going to write to the same buffer
// as me. We can use popc to implement a single-operation warp scan in this case.
unsigned lane_mask_lt;
asm( "mov.u32 %0, %%lanemask_lt;" : "=r"(lane_mask_lt) );
unsigned int my_mask = greater ? gt_mask : lt_mask;
unsigned int my_offset = __popc(my_mask & lane_mask_lt);
which is in the __global__ void qsort_warp function, especially for this assemble language in the codes. Can anyone help me to explain the meaning of this assemble language?
%lanemask_lt is a special, read-only register in PTX assembly which is initialized with a 32-bit mask with bits set in positions less than the thread’s lane number in the warp. The inline PTX you have posted is simply reading the value of that register and storing it in a variable where it can be used in the subsequent C++ code you posted.
Every version of the CUDA toolkit ships with a PTX assembly lanugage reference guide you can use to look up things like this.
Recently I've been doing string comparing jobs on CUDA, and i wonder how can a __global__ function return a value when it finds the exact string that I'm looking for.
I mean, i need the __global__ function which contains a great amount of threads to find a certain string among a big big string-pool simultaneously, and i hope that once the exact string is caught, the __global__ function can stop all the threads and return back to the main function, and tells me "he did it"!
I'm using CUDA C. How can I possibly achieve this?
There is no way in CUDA (or on NVIDIA GPUs) for one thread to interrupt execution of all running threads. You can't have immediate exit of the kernel as soon as a result is found, it's just not possible today.
But you can have all threads exit as soon as possible after one thread finds a result. Here's a model of how you would do that.
__global___ void kernel(volatile bool *found, ...)
{
while (!(*found) && workLeftToDo()) {
bool iFoundIt = do_some_work(...); // see notes below
if (iFoundIt) *found = true;
}
}
Some notes on this.
Note the use of volatile. This is important.
Make sure you initialize found—which must be a device pointer—to false before launching the kernel!
Threads will not exit instantly when another thread updates found. They will exit only the next time they return to the top of the while loop.
How you implement do_some_work matters. If it is too much work (or too variable), then the delay to exit after a result is found will be long (or variable). If it is too little work, then your threads will be spending most of their time checking found rather than doing useful work.
do_some_work is also responsible for allocating tasks (i.e. computing/incrementing indices), and how you do that is problem specific.
If the number of blocks you launch is much larger than the maximum occupancy of the kernel on the present GPU, and a match is not found in the first running "wave" of thread blocks, then this kernel (and the one below) can deadlock. If a match is found in the first wave, then later blocks will only run after found == true, which means they will launch, then exit immediately. The solution is to launch only as many blocks as can be resident simultaneously (aka "maximal launch"), and update your task allocation accordingly.
If the number of tasks is relatively small, you can replace the while with an if and run just enough threads to cover the number of tasks. Then there is no chance for deadlock (but the first part of the previous point applies).
workLeftToDo() is problem-specific, but it would return false when there is no work left to do, so that we don't deadlock in the case that no match is found.
Now, the above may result in excessive partition camping (all threads banging on the same memory), especially on older architectures without L1 cache. So you might want to write a slightly more complicated version, using a shared status per block.
__global___ void kernel(volatile bool *found, ...)
{
volatile __shared__ bool someoneFoundIt;
// initialize shared status
if (threadIdx.x == 0) someoneFoundIt = *found;
__syncthreads();
while(!someoneFoundIt && workLeftToDo()) {
bool iFoundIt = do_some_work(...);
// if I found it, tell everyone they can exit
if (iFoundIt) { someoneFoundIt = true; *found = true; }
// if someone in another block found it, tell
// everyone in my block they can exit
if (threadIdx.x == 0 && *found) someoneFoundIt = true;
__syncthreads();
}
}
This way, one thread per block polls the global variable, and only threads that find a match ever write to it, so global memory traffic is minimized.
Aside: __global__ functions are void because it's difficult to define how to return values from 1000s of threads into a single CPU thread. It is trivial for the user to contrive a return array in device or zero-copy memory which suits his purpose, but difficult to make a generic mechanism.
Disclaimer: Code written in browser, untested, unverified.
If you feel adventurous, an alternative approach to stopping kernel execution would be to just execute
// (write result to memory here)
__threadfence();
asm("trap;");
if an answer is found.
This doesn't require polling memory, but is inferior to the solution that Mark Harris suggested in that it makes the kernel exit with an error condition. This may mask actual errors (so be sure to write out your results in a way that clearly allows to tell a successful execution from an error), and it may cause other hiccups or decrease overall performance as the driver treats this as an exception.
If you look for a safe and simple solution, go with Mark Harris' suggestion instead.
The global function doesn't really contain a great amount of threads like you think it does. It is simply a kernel, function that runs on device, that is called by passing paramaters that specify the thread model. The model that CUDA employs is a 2D grid model and then a 3D thread model inside of each block on the grid.
With the type of problem you have it is not really necessary to use anything besides a 1D grid with 1D of threads on in each block because the string pool doesn't really make sense to split into 2D like other problems (e.g. matrix multiplication)
I'll walk through a simple example of say 100 strings in the string pool and you want them all to be checked in a parallelized fashion instead of sequentially.
//main
//Should cudamalloc and cudacopy to device up before this code
dim3 dimGrid(10, 1); // 1D grid with 10 blocks
dim3 dimBlocks(10, 1); //1D Blocks with 10 threads
fun<<<dimGrid, dimBlocks>>>(, Height)
//cudaMemCpy answerIdx back to integer on host
//kernel (Not positive on these types as my CUDA is very rusty
__global__ void fun(char *strings[], char *stringToMatch, int *answerIdx)
{
int idx = blockIdx.x * 10 + threadIdx.x;
//Obviously use whatever function you've been using for string comparison
//I'm just using == for example's sake
if(strings[idx] == stringToMatch)
{
*answerIdx = idx
}
}
This is obviously not the most efficient and is most likely not the exact way to pass paramaters and work with memory w/ CUDA, but I hope it gets the point across of splitting the workload and that the 'global' functions get executed on many different cores so you can't really tell them all to stop. There may be a way I'm not familiar with, but the speed up you will get by just dividing the workload onto the device (in a sensible fashion of course) will already give you tremendous performance improvements. To get a sense of the thread model I highly recommend reading up on the documents on Nvidia's site for CUDA. They will help tremendously and teach you the best way to set up the grid and blocks for optimal performance.