Generating Uniform Double random numbers on device in CUDA - cuda

I would like to generate uniform random numbers on the device, to be used inside of a device function. Each thread should generate a different uniform random number. I have this code, but I get a segmentation fault.
int main{
curandStateMtgp32 *devMTGPStates;
mtgp32_kernel_params *devKernelParams;
cudaMalloc((void **)&devMTGPStates, NUM_THREADS*NUM_BLOCKS * sizeof(curandStateMtgp32));
cudaMalloc((void**)&devKernelParams,sizeof(mtgp32_kernel_params));
curandMakeMTGP32Constants(mtgp32dc_params_fast_11213, devKernelParams);
curandMakeMTGP32KernelState(devMTGPStates,
mtgp32dc_params_fast_11213, devKernelParams,NUM_BLOCKS*NUM_THREADS, 1234);
doHenry <<NUM_BLOCKS,NUM_THREADS>>> (devMTGPStates);
}
Inside my global function doHenry, evaluated on the device, I put:
double rand1 = curand_uniform_double(&state[threadIdx.x+NUM_THREADS*blockIdx.x]);
Is this the best way to generate a random number per thread? I don't understand what the devKernelParams is doing, but I know I need one state per thread, right?

I think you're getting the seg fault on this line:
curandMakeMTGP32KernelState(devMTGPStates, mtgp32dc_params_fast_11213, devKernelParams,NUM_BLOCKS*NUM_THREADS, 1234);
I believe the reason for the seg fault is because you have exceeded 200 for the n parameter, for which you are passing NUM_BLOCKS*NUM_THREADS. I tried a version of your code, and I was able to reproduce the seg fault at around n=540.
The MT generator has a limitation on the amount of states it can set up when using pre-generated kernel parameters (mtgp32dc_params_fast_11213). You may wish to read the relevant section of the documentation. (Bit Generation with the MTGP32 generator)
I'm not really an expert on CURAND, but other generators (such as XORWOW) don't have this type of limitation, so if you want to generate a large amount of independent thread state easily, consider one of the other generators. Using the particular approach you have outlined, the MTGP32 generator seems to be limited to about 200*256 independent thread generation. Contrary to what I said in the comments (which is true for other generator types) the MTGP32 state seems to be sufficient at one state for a block of up to 256 threads. And the example given in the documentation (refer to the second example) uses that type of state generation and threadblock hierarchy.

Related

In CUDA, how can I get this warp's thread mask in conditionally executed code (in order to execute e.g., __shfl_sync or <cg>.shfl?

I'm trying to update some older CUDA code (pre CUDA 9.0), and I'm having some difficulty updating usage of warp shuffles (e.g., __shfl).
Basically the relevant part of the kernel might be something like this:
int f = d[threadIdx.x];
int warpLeader = <something in [0,32)>;
// Point being, some threads in the warp get removed by i < stop
for(int i = k; i < stop; i+=skip)
{
// Point being, potentially more threads don't see the shuffle below.
if(mem[threadIdx.x + i/2] == foo)
{
// Pre CUDA 9.0.
f = __shfl(f, warpLeader);
}
}
Maybe that's not the best example (real code too complex to post), but the two things accomplished easily with the old intrinsics were:
Shuffle/broadcast to whatever threads happen to be here at this time.
Still get to use the warp-relative thread index.
I'm not sure how to do the above post CUDA 9.0.
This question is almost/partially answered here: How can I synchronize threads within warp in conditional while statement in CUDA?, but I think that post has a few unresolved questions.
I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.
The linked question says to use coalesced_group, but my understanding is that this type of cooperative_group re-ranks the threads, so if you had a warpLeader (on [0, 32)) in mind as above, I'm not sure there's a way to "figure out" its new rank in the coalesced_group.
(Also, based on the truncated comment conversation in the linked question, it seems unclear if coalesced_group is just a nice wrapper for __activemask() or not anyway ...)
It is possible to iteratively build up a mask using __ballot_sync as described in the linked question, but for code similar to the above, that can become pretty tedious. Is this our only way forward for CUDA > 9.0?
I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.
The linked question doesn't show any such usage. Furthermore, the canonical blog specifically says that usage is the one that satisfies this:
Shuffle/broadcast to whatever threads happen to be here at this time.
The blog states that this is incorrect usage:
//
// Incorrect use of __activemask()
//
if (threadIdx.x < NUM_ELEMENTS) {
unsigned mask = __activemask();
val = input[threadIdx.x];
for (int offset = 16; offset > 0; offset /= 2)
val += __shfl_down_sync(mask, val, offset);
(which is conceptually similar to the usage given in your linked question.)
But for "opportunistic" usage, as defined in that blog, it actually gives an example of usage in listing 9 that is similar to the one that you state "won't work". It certainly does work following exactly the definition you gave:
Shuffle/broadcast to whatever threads happen to be here at this time.
If your algorithm intent is exactly that, it should work fine. However, for many cases, that isn't really a correct description of the algorithm intent. In those cases, the blog recommends a stepwise process to arrive at a correct mask:
Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
If your program does opportunistic warp-synchronous programming, use “detective” functions such as __activemask() and __match_all_sync() to find the right mask.
Use __syncwarp() to separate operations with intra-warp dependences. Do not assume lock-step execution.
Note that steps 1 and 2 are not contradictory to other comments. If you know for certain that you intend the entire warp to participate (not typically known in a "opportunistic" setting) then it is perfectly fine to use a hardcoded full mask.
If you really do intend the opportunistic definition you gave, there is nothing wrong with the usage of __activemask() to supply the mask, and in fact the blog gives a usage example of that, and step 4 also confirms that usage, for that case.

Cuda _sync functions, how to handle unknown thread mask?

This question is about adapting to the change in semantics from lock step to independent program counters. Essentially, what can I change calls like int __all(int predicate); into for volta.
For example, int __all_sync(unsigned mask, int predicate);
with semantics:
Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.
The docs assume that the caller knows which threads are active and can therefore populate mask accurately.
a mask must be passed that specifies the threads participating in the call
I don't know which threads are active. This is in a function that is inlined into various places in user code. That makes one of the following attractive:
__all_sync(UINT32_MAX, predicate);
__all_sync(__activemask(), predicate);
The first is analogous to a case declared illegal at https://forums.developer.nvidia.com/t/what-does-mask-mean-in-warp-shuffle-functions-shfl-sync/67697, quoting from there:
For example, this is illegal (will result in undefined behavior for warp 0):
if (threadIdx.x > 3) __shfl_down_sync(0xFFFFFFFF, v, offset, 8);
The second choice, this time quoting from __activemask() vs __ballot_sync()
The __activemask() operation has no such reconvergence behavior. It simply reports the threads that are currently converged. If some threads are diverged, for whatever reason, they will not be reported in the return value.
The operating semantics appear to be:
There is a warp of N threads
M (M <= N) threads are enabled by compile time control flow
D (D subset of M) threads are converged, as a runtime property
__activemask returns which threads happen to be converged
That suggests synchronising threads then using activemask,
__syncwarp();
__all_sync(__activemask(), predicate);
An nvidia blog post says that is also undefined, https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
Calling the new __syncwarp() primitive at line 10 before __ballot(), as illustrated in Listing 11, does not fix the problem either. This is again implicit warp-synchronous programming. It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model.
That marks the end of my ideas. That same blog concludes with some guidance on choosing a value for mask:
Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
However, I can't compute what the mask should be. It depends on the control flow at the call site that the code containing __all_sync was inlined into, which I don't know. I don't want to change every function to take an unsigned mask parameter.
How do I retrieve semantically correct behaviour without that global transform?
TL;DR: In summary, the correct programming approach will most likely be to do the thing you stated you don't want to do.
Longer:
This blog specifically suggests an opportunistic method for handling an unknown thread mask: precede the desired operation with __activemask() and use that for the desired operation. To wit (excerpting verbatim from the blog):
int mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
That should be perfectly legal.
You might ask "what about item 2 mentioned at the end of the blog?" I think if you read that carefully and taking into account the previous usage I just excerpted, it's suggesting "don't just use __activemask()" if you intend something different. That reading seems evident from the full text there. That doesn't abrogate the legality of the previous construct.
You might ask "what about incidental or enforced divergence along the way?" (i.e. during the processing of my function which is called from elsewhwere)
I think you have only 2 options:
grab the value of __activemask() at entry to the function. Use it later when you call the sync operation you desire. That is your best guess as to the intent of the calling environment. CUDA doesn't guarantee that this will be correct, however this should certainly be legal if you don't have enforced divergence at the point of your sync function call.
Make the intent of the calling environment clear - add a mask parameter to your function and rewrite the code everywhere (which you've stated you don't want to do).
There is no way to deduce the intent of the calling environment from within your function, if you permit the possibility of warp divergence prior to entry to your function, which obscures the calling environment intent. To be clear, CUDA with the Volta execution model permits the possibility of warp divergence at any time. Therefore, the correct approach is to rewrite the code to make the intent at the call site explicit, rather than trying to deduce it from within the called function.

In CUDA PTX, what does %warpid mean, really?

IN CUDA PTX, there's a special register which holds a thread's warp's index: %warpid. Now, the spec says:
Note that %warpid is volatile and returns the location of a thread
at the moment when read, but its value may change during execution,
e.g., due to rescheduling of threads following preemption.
Umm, what location is that? Shouldn't it be the location within the block, e.g. for a 1-dimensional grid %tid.x / warpSize? Is it some slot-for-a-warp within the SM (e.g. warp scheduler or some internal queue)? I'm confused.
Motivation: I wanted to spare myself the trouble of calculating %tid.x / warpSize as well as free up a register, by using this special register. However, in retrospect this is a false motivation, because reading a special register is expensive; see: What's the most efficient way to calculate the warp id / lane id in a 1-D grid?
You need to read the next 25 words of the documentation which directly follow after the quotation which you posted in your question:
For this reason, %ctaid and %tid should be used to compute a virtual
warp index if such a value is needed in kernel code;
and then
%warpid is intended mainly to enable profiling and diagnostic code to
sample and log information such as work place mapping and load
distribution.
So no, you can't use it for what you want. %warpid is effectively a scheduler slot ID rather than a constant, unique warp index within a block.

CUDA memory operation order within a single thread

From the CUDA Programming Guide (v. 5.5):
The CUDA programming model assumes a device with a weakly-ordered
memory model, that is:
The order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the data is observed being
written by another CUDA or host thread;
The order in which a CUDA thread reads data from shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the read instructions appear in
the program for instructions that are independent of each other
However, do we have a guarantee that the (dependent) memory operations as seen from the single thread are actually consistent? If I do - say:
arr[x] = 1;
int z = arr[y];
where x happens to be equal to y, and no other thread is touching the memory, do I have a guarantee that z is 1? Or do I still need to put some volatile or a barrier between those two operations?
In response to Orpedo's answer.
If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
My problem is what optimizations (done either by compiler or hardware) are allowed?
It could happen --- for example --- that store instruction is non-blocking and the load instruction that follows somehow is managed by the memory controller faster than the already queued-up store.
I don't know CUDA hardware. Do I have a guarantee that the above will never happen?
The CUDA Programming Guide simply stating, that you cannot predict in which order the threads is executed, but every single thread will still run as a sequential thread.
In the example you state, where x and y are the same and NO OTHER THREAD is touching the memory, you DO have a guarantee that z = 1.
Here the point being, that if you have several threads dooing operations on the same data (e.g. an array), you are NOT guaranteed that thread #9 executes before #10.
Take an example:
__device__ void sum_all(float *x, float *result, int size N){
x[threadId.x] = threadId.x;
result[threadId.x] = 0;
for(int i = 0; i < N; i++)
result[threadId.x] += x[threadID.x];
}
Here we have some dumb function, which SHOULD fill a shared array (x) with the numbers from m ... n (read from one number to another number), and then sum up the numbers already put into the array and store the result in another array.
Given that you your lowest indexed thread is enumerated thread #0, you would expect that the first time your code runs this code x should contain
x[] = {0, 0, 0 ... 0} and result[] = {0, 0, 0 ... 0}
next for thread #1
x[] = {0, 1, 0 ... 0} and result[] = {0, 1, 0 ... 0}
next for thread #2
x[] = {0, 1, 2 ... 0} and result[] = {0, 1, 3 ... 0}
and so forth.
But this is NOT guaranteed. You can't know if e.g. thread #3 runs first, hence changing the array x[] before thread #0 runs. You actually don't even know if the arrays are changed by some other thread while you are executing the code.
I am not sure, if this is explicitly stated in the CUDA documentation (I wouldn't expect it to be), as this is a basic principle of computing. Basically what you are asking is, if running your code on a GFX will change the functionality of your code.
The cores of a GPU are generally the same, as that of a CPU, just with less control-arithmetics, a smaller instructionset and typically only supporting single-precision.
In a CUDA-GPU there is 1 program counter for each Warp (section of 32 synchronous cores). Like a CPU, the program counter increases by magnitude of one address element after each instruction, unless you have branches or jumps. This gives the sequential flow of the program, and this can not be changed.
Branches and jumps can only be introduced by the software running on the core, and hence is determined by your compiler. Compiler optimizations can in fact change the functionality of your code, but only in the case where the code is implemented "wrong" with respect to the compiler
So in short - Your code will always be executed in the order it is ordered in the memory, no matter if it is executed on a CPU or a GPU. If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
Hope this was clear enough :)
As far as I understood you're basically asking whether memory dependencies and alias analysis information are being respected in the CUDA compiler.
The answer to that question is, assuming that the CUDA compiler is free of bugs, yes because as Robert noted the CUDA compiler uses LLVM under the hood and two basic modules (which, at the moment, I really don't think they could be excluded by the pipeline) are:
Memory dependence analysis
Alias Analysis
These two passes detect memory locations potentially pointing to the same address and use live-analysis on variables (even out of the block scope) to avoid dangerous optimizations (e.g. you can't write in a live variable before its next read, the data may still be useful).
I don't know the compiler internals but assuming (as any other reasonably trusted compiler) that it will do its best to be bug-free, the analysis that take place in there should really not bother you at all and assure you that at least in theory what you just presented as an example (i.e. the dependent-load faster than the store) cannot happen.
What guarantee you that? Nothing but the fact that the company is giving a compiler to use, and there are disclaimers in case it doesn't for exceptional cases :)
Also: aside from the compiler topic, the instruction execution is also dependent on the hardware specification. In this case, a SIMT hardware instruction issuing unit
cfr. http://www.csl.cornell.edu/~cbatten/pdfs/kim-simt-vstruct-isca2013.pdf and all the referenced papers for more information

CURAND properties of generators

CURAND comes with an array of random number generators, but I have failed to find any comparison of the performance (and randomness) properties of each of them; mostly, I'd be interested in which generator to use for which application to gain maximum performance. I'd be happy if someone could quickly outline the differences between them or link me a resource that does so.
Thanks in advance.
This picture shows the performance for different RNGs.
For randomness, it should be only related to the RNG type/algorithm. So you can refer to Intel MKL doc. There's detail info and research papers in it. The type names in both CURAND and MKL are very similar.
http://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-3D7D2650-A414-4C95-AF33-BE291BAB2AC3.htm
First difference is efficiency. XORWOW is default generator, but isn't always most efficient. For instance, Philox is faster for generating normally distributed floats.
Another difference is, that in practice You can generate more than one float with each call with some generators.
For example, with Philox You can generate even 4 floats normally or uniformly distributed with each call, while with XORWOW you can generate max two floats normally or uniformly distributed.
__device__ float4
curand_normal4 (curandStatePhilox4_32_10_t *state)
Next difference is period of pseudorandom sequence (Total state space of the PRNG before
you start to see repeats). Xorwow has period about 2^190 (with the state set up after 2^67 for the same seed)*. For Philox, subsequence and offset together define offset in a sequence with period 2^128.
Note that if You run millions of threads with the same seed You could run out of state space per thread and start seeing repeats. ((2^190) / (10^6)) / (2^67) = 1.0633824 × 10^31
One more difference is size of the states. For Xorwow sizeof(curandState_t) is 48 bytes and sizeof(curandStatePhilox4_32_10_t) is 64 bytes.
When You run millions of threads (each thread has its own curand state) you can run out of device memory. 1024^2*64 ~= 64 megabytes per million threads.
XORWOW, Philox, MRR32k3a, MTGP32 are Pseudo-random generators while both Sobols are Quasi-ranom generators.
*When calling curand_init with a seed, it scrambles that seed and then skips ahead 2^67 numbers (this is kind of expensive but has some nice properties)
sources:
https://developer.nvidia.com/cuRAND
http://cs.brown.edu/courses/cs195v/lecture/week11.pdf