Let's suppose we want to call a global function with the code that follows. Every single thread will have a curandState generator and an array of ints (both properly initialized) that we'll use in order to execute the following code sample:
#define NUMTHREADS 200
int main(){
int * result;
curandState * randState;
if (cudaMalloc(&randState, NUMTHREADS * sizeof(curandState)) == cudaErrorMemoryAllocation ||
cudaMalloc(&result, NUMTHREADS * sizeof(int)) == cudaErrorMemoryAllocation){
cudaDeviceReset();
exit(-1);
}
setup_cuRand <<<1, NUMTHREADS>>> (randState, unsigned(time(NULL)));
method <<<1, NUMTHREADS>>> (state,result);
return 1;
}
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
int id = threadIdx.x;
curand_init(seed, id, 0, &state[id]);
}
__global__ void generic method(curandState* state, int * result){
curandState localState = state[threadIdx.x];
int num = curand(&localState) % 100;
if(num > 50)
result[threadIdx.x] = threadIdx.x;
else
result[threadIdx.x] = -1;
}
What would be our execution? I mean, do the threads split into both codes magically and re-join later or how it works? are all 1024 threads in execution at once? this last question is because when i'm debugging on Visual Studio 2013, using Cuda Debugger, when i'm going forward, threadIdx.x allways has a value like n*32 and until now i tought that 1024 threads could be executed at the same time and now i'm doubtfull
The test is likely to be transformed into a predicate that will mean conditional assignment of some value in your region of memory. Should your if be more complex, the threads of a warp would magically join after the second part of the if clause. Depending on predicate for each thread of a warp, a branch might not even get visited.
When entering a breakpoint, the data will be shown for a specific thread/block id. Which thread/block is followed is given by the CUDA Debug Focus setting in NSIGHT for Visual Studio (While debugging with CUDA, enter the NSIGHT menu entry, and select Windows, then CUDA Debug Focus...) By default, thread 0,0,0 will be focused.
Threads are logically executed at the same time, but in practice, you have less than 1024 CUDA-cores per SM. The threads are organized into warps of 32, and warps are scheduled on different execution units by the instruction scheduler.
For 1024 threads, that is 32 warps, the first and last warp are not necessarily executed at the same time precisely.
See Memory Fence function in cuda documentation for more details, as well as Synchronization Functions.
Related
I am working on a program in which there are two main kernels.
Due to the impact on performances, each kernel has its own dimensions. Thus I have 2 different block and grid sizes (whose values cannot be known at compile time).
Both kernels need to use the cuRAND library, so before a third kernel is launched to initialize the cuRAND state on the device.
My question comes when I need to choose the dimensions of this kernel.
Let's say I have for kernel 1 and 2:
block_size_1 = 256
grid_size_1 = 10
block_size_2 = 512
grid_size_2 = 2
For the cuRAND initialization kernel, should I use the largest sizes (10*512), or the highest number of threads (10*256)?
Pick the biggest kernel size, because that is the maximum number of cuRand generators that you'll use. You can easyly evaluate the size you need using something like
__host__ void fun(){
curandState * randState;
int myCurandSize = ((block_size1 * grid_size1) > (block_size2 * grid_size2))? Block_size1 * Grid_size1 : Block_size2 * Grid_size2);
error = cudaMalloc((void **)&randState, myCurandSize * sizeof(curandState));
if (error == cudaErrorMemoryAllocation){
cudaDeviceReset();
return 1;
}
setup_cuRand <<<1, myCurandSize>>> (randState, unsigned(time(NULL)));
//Don't forget to free the space
cudaFree(randState);
}
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
int id = threadIdx.x;
curand_init(seed, id, 0, &state[id]);
}
Edit: I was asumming that block_size * grid_size will not go over the maximum thread limit, otherwise, you can do the same but keeping aswell the grid and block dimension and launching that number of threads setup_curand<<<x, y>>>(...);
I have a CUDA card with : Cuda Compute capability (3.5) If i have a call such as <<<2000,512>>> , what are the number of iterations that occur within the kernel? I thought it was (2000*512), but testing isn't proving this? I also want to confirm that the way I'm calculating the the variable is correct.
The situation is, within the kernel I am incrementing a passed global memory number based on the thread number :
int thr = blockDim.x * blockIdx.x + threadIdx.x;
worknumber = globalnumber + thr;
So, when I return back to the CPU, I want to know exactly how many increments there were so I can keep track so I don't repeat or skip numbers when I recall the kernel GPU to process my next set of numbers.
Edit :
__global__ void allin(uint64_t *lkey, const unsigned char *d_patfile)
{
uint64_t kkey;
int tmp;
int thr = blockDim.x * blockIdx.x + threadIdx.x;
kkey = *lkey + thr;
if (thr > tmp) {
tmp = thr;
printf("%u \n", thr);
}
}
If you launch a kernel with the configuration <<<X,Y>>>, and you have not violated any rules of CUDA usage, then the number of threads launched will, in fact, be X*Y (or a suitable modification of that if we are talking about 2 or 3 dimensional threadblocks and/or grids, i.e. X.x*X.y*X.z*Y.x*Y.y*Y.z ).
printf from a CUDA kernel has various limitations. Therefore, generating a large amount of printf output from a CUDA kernel is generally unwise and probably not useful for validating the number of threads launched in a large grid.
If you want to keep track of the number of threads that actually get launched, you could use a global variable and have each thread atomically update it. Something like this:
$ cat t848.cu
#include <stdio.h>
__device__ unsigned long long totThr = 0;
__global__ void mykernel(){
atomicAdd(&totThr, 1);
}
int main(){
mykernel<<<2000,512>>>();
unsigned long long total;
cudaMemcpyFromSymbol(&total, totThr, sizeof(unsigned long long));
printf("Total threads counted: %lu\n", total);
}
$ nvcc -o t848 t848.cu
$ cuda-memcheck ./t848
========= CUDA-MEMCHECK
Total threads counted: 1024000
========= ERROR SUMMARY: 0 errors
$
Note that atomic operations may be relatively slow. I wouldn't recommend making regular use of such a code for performance reasons. But if you want to convince yourself of the number of threads launched, it should give the correct answer.
In my Kernel, the threads are processing a small part of an array in global memory.
After processing I would also like to set a flag indicating that the result of the calculation is zero for all threads within a block:
__global__ void kernel( int *a, bool *blockIsNull) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int result = 0;
// {...} Here calculate result
a[tid] = result;
// some code here, but I don't know, that's my question...
if (condition)
blockIsNull[blockIdx.x] = true; // if all threads have returned result==0
}
Each individual thread owns the information. But I don't find an efficient way to collect it.
For example, I could have a counter in shared memory that is atomically incremented by each thread when result==0. So when the counter reaches blockDim.x it means that all threads have returned zero. Althought not tested, I am afraid that this solution will have a negative impact on performance (atomic functions are slow).
A zero result does not occur very often, so it is very unlikely to have zeros for all threads within a block. I would like to find a solution that has little impact on the performance in the general case.
What would be your recommendation ?
It sounds like you want to perform a block level reduction of the condition value across a block. Just about all CUDA hardware supports a set of very useful warp voting primitives. You could use the __all() warp vote to determine that each warp of threads satisfied the condition, and then use __all() again to check whether all warps satisfy the condition. In code, it might look like this:
__global__ void kernel( int *a, bool *blockIsNull) {
// assume that threads per block is <= 1024
__shared__ volatile int blockcondition[32];
int laneid = threadIdx.x % 32;
int warpid = threadIdx.x / 32;
// Set each condition value to non zero to begin
if (warpid == 0) {
blockcondition[threadIdx.x] = 1;
}
__syncthreads();
//
// your code goes here
//
// warpcondition holds the vote from each warp
int warpcondition = __all(condition);
// First thread in each warp loads the warp vote to shared memory
if (laneid == 0) {
blockcondition[warpid] = warpcondition;
}
__syncthreads();
// First warp reduces all the votes in shared memory
if (warpid == 0) {
int result = __all(blockcondition[threadIdx.x] != 0);
// first thread stores the block result to global memory
if (laneid == 0) {
blockIsNull[blockIdx.x] = (result !=0);
}
}
}
[ Huge disclaimer: written in browser, never compiled or tested, use at own risk ]
This code should (I think) work for any number of threads per block up to 1024. You could, if required, adjust the size of blockcondition to a smaller value if you were confident of an upper block size limit less than 1024. Probably the smartest way would be to use C++ templating and make the warp count limit a template parameter.
I understand the purpose of __syncthreads(), but I sometimes find it overused in some codes.
For instance, in the code below taken from NVIDIA notes, each thread calculates mainly s_data[tx]-s_data[tx-1]. Each thread needs the data it reads from the global memory and the data read by its neighboring thread. Both threads will be in the same warp and hence should complete retrieval of their data from the global memory and are scheduled for execution simultaneously.
I believe the code will still work without __syncthread(), but obviously the NVIDIA notes say otherwise. Any comment, please?
// Example â shared variables
// optimized version of adjacent difference
__global__ void adj_diff(int *result, int *input)
{
// shorthand for threadIdx.x
int tx = threadIdx.x;
// allocate a __shared__ array, one element per thread
__shared__ int s_data[BLOCK_SIZE];
// each thread reads one element to s_data
unsigned int i = blockDim.x * blockIdx.x + tx;
s_data[tx] = input[i];
// avoid race condition: ensure all loads
// complete before continuing
__syncthreads();
if(tx > 0)
result[i] = s_data[tx] â s_data[txâ1];
else if(i > 0)
{
// handle thread block boundary
result[i] = s_data[tx] â input[i-1];
}
}
It would be nice if you included a link to where, in the "Nvidia notes", this appeared.
both threads will be in the same warp
No, they won't, at least not in all cases. What happens when tx = 32? Then the thread corresponding to tx belongs to warp 1 in the block, and the thread corresponding to tx-1 belongs to warp 0 in the block.
There's no guarantee that warp 0 has executed before warp 1, so the code could fail without the call to __synchtreads() (since, without it, the value of s_data[tx-1] could be invalid, since warp 0 hasn't run and therefore hasn't loaded it yet.)
A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:
Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).
Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?
I'll also go back to the answer I would have posted in 2012 but for a browser crash.
The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.
This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.
A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>
__device__ unsigned int process(int & val)
{
return (++val < 10);
}
template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
__shared__ int wchanged[nwarps];
unsigned int laneid = threadIdx.x % warpSize;
unsigned int warpid = threadIdx.x / warpSize;
// Do calculations then check for change/convergence
// and set tchanged to be !=0 if required
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tchanged = process(inout[idx]);
// Simple blockwise reduction using voting primitives
// increments kchanged is any thread in the block
// returned tchanged != 0
tchanged = __any(tchanged != 0);
if (laneid == 0) {
wchanged[warpid] = tchanged;
}
__syncthreads();
if (threadIdx.x == 0) {
int bchanged = 0;
#pragma unroll
for(int i=0; i<nwarps; i++) {
bchanged |= wchanged[i];
}
if (bchanged) {
atomicAdd(kchanged, 1);
}
}
}
int main(void)
{
const int N = 2048;
const int min = 5, max = 15;
std::vector<int> data(N);
for(int i=0; i<N; i++) {
data[i] = min + (std::rand() % (int)(max - min + 1));
}
int* _data;
size_t datasz = sizeof(int) * (size_t)N;
cudaMalloc<int>(&_data, datasz);
cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);
unsigned int *kchanged, *_kchanged;
cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);
const int nwarps = 4;
dim3 blcksz(32*nwarps), grdsz(16);
// Loop while the kernel signals it needs to run again
do {
*kchanged = 0;
kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
cudaDeviceSynchronize();
} while (*kchanged != 0);
cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
cudaDeviceReset();
int minval = *std::min_element(data.begin(), data.end());
assert(minval == 10);
return 0;
}
Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.
You should be able to adapt this to whatever your application requires
I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.
create a flag in global memory:
__device__ int flag;
at each iteration,
initialize the flag to zero (in host code):
int init_val = 0;
cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
In your kernel device code, modify the flag to 1 if a change is made to the matrix:
__global void iter_kernel(float *matrix){
...
if (new_val[i] != matrix[i]){
matrix[i] = new_val[i];
flag = 1;}
...
}
after calling the kernel, at the end of the iteration (in host code), test for modification:
int modified = 0;
cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
if (modified){
...
}
Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.
Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).
Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.
I can't see any reason to use atomics, or how it would benefit anything.
A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".