CURAND properties of generators - cuda

CURAND comes with an array of random number generators, but I have failed to find any comparison of the performance (and randomness) properties of each of them; mostly, I'd be interested in which generator to use for which application to gain maximum performance. I'd be happy if someone could quickly outline the differences between them or link me a resource that does so.
Thanks in advance.

This picture shows the performance for different RNGs.
For randomness, it should be only related to the RNG type/algorithm. So you can refer to Intel MKL doc. There's detail info and research papers in it. The type names in both CURAND and MKL are very similar.
http://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-3D7D2650-A414-4C95-AF33-BE291BAB2AC3.htm

First difference is efficiency. XORWOW is default generator, but isn't always most efficient. For instance, Philox is faster for generating normally distributed floats.
Another difference is, that in practice You can generate more than one float with each call with some generators.
For example, with Philox You can generate even 4 floats normally or uniformly distributed with each call, while with XORWOW you can generate max two floats normally or uniformly distributed.
__device__ float4
curand_normal4 (curandStatePhilox4_32_10_t *state)
Next difference is period of pseudorandom sequence (Total state space of the PRNG before
you start to see repeats). Xorwow has period about 2^190 (with the state set up after 2^67 for the same seed)*. For Philox, subsequence and offset together define offset in a sequence with period 2^128.
Note that if You run millions of threads with the same seed You could run out of state space per thread and start seeing repeats. ((2^190) / (10^6)) / (2^67) = 1.0633824 × 10^31
One more difference is size of the states. For Xorwow sizeof(curandState_t) is 48 bytes and sizeof(curandStatePhilox4_32_10_t) is 64 bytes.
When You run millions of threads (each thread has its own curand state) you can run out of device memory. 1024^2*64 ~= 64 megabytes per million threads.
XORWOW, Philox, MRR32k3a, MTGP32 are Pseudo-random generators while both Sobols are Quasi-ranom generators.
*When calling curand_init with a seed, it scrambles that seed and then skips ahead 2^67 numbers (this is kind of expensive but has some nice properties)
sources:
https://developer.nvidia.com/cuRAND
http://cs.brown.edu/courses/cs195v/lecture/week11.pdf

Related

Strategy for minimizing bank conflicts for 64-bit thread-separate shared memory

Suppose I have a full warp of threads in a CUDA block, and each of these threads is intended to work with N elements of type T, residing in shared memory (so we have warp_size * N = 32 N elements total). The different threads never access each other's data. (Well, they do, but at a later stage which we don't care about here). This access is to happen in a loop such as the following:
for(int i = 0; i < big_number; i++) {
auto thread_idx = determine_thread_index_into_its_own_array();
T value = calculate_value();
write_to_own_shmem(thread_idx, value);
}
Now, the different threads may have different indices each, or identical - I'm not making any assumptions this way or that. But I do want to minimize shared memory bank conflicts.
If sizeof(T) == 4, then this is is easy-peasy: Just place all of thread i's data in shared memory addresses i, 32+i, 64+i, 96+i etc. This puts all of i's data in the same bank, that's also distinct from the other lane's banks. Great.
But now - what if sizeof(T) == 8? How should I place my data and access it so as to minimize bank conflicts (without any knowledge about the indices)?
Note: Assume T is plain-old-data. You may even assume it's a number if that makes your answer simpler.
tl;dr: Use the same kind of interleaving as for 32-bit values.
On later-than-Kepler micro-architectures (up to Volta), the best we could theoretically get is 2 shared memory transactions for a full warp reading a single 64-bit value (as a single transaction provides 32 bits to each lane at most).
This is is achievable in practice by the analogous placement pattern OP described for 32-bit data. That is, for T* arr, have lane i read the idx'th element as T[idx + i * 32]. This will compile so that two transactions occur:
The lower 16 lanes obtain their data from the first 32*4 bytes in T (utilizing all banks)
The higher 16 obtain their data from the successive 32*4 bytes in T (utilizing all banks)
So the GPU is smarter/more flexible than trying to fetch 4 bytes for each lane separately. That means it can do better than the simplistic "break up T into halves" idea the earlier answer proposed.
(This answer is based on #RobertCrovella's comments.)
On Kepler GPUs, this had a simple solution: Just change the bank size! Kepler supported setting the shared memory bank size to 8 instead of 4, dynamically. But alas, that feature is not available in later microarchitectures (e.g. Maxwell, Pascal).
Now, here's an ugly and sub-optimal answer for more recent CUDA microarchitectures: Reduce the 64-bit case to the 32-bit case.
Instead of each thread storing N values of type T, it stores 2N values, each consecutive pair being the low and the high 32-bits of a T.
To access a 64-bit values, 2 half-T accesses are made, and the T is composed with something like `
uint64_t joined =
reinterpret_cast<uint32_t&>(&upper_half) << 32 +
reinterpret_cast<uint32_t&>(&lower_half);
auto& my_t_value = reinterpret_cast<T&>(&joined);
and the same in reverse when writing.
As comments suggest, it is better to make 64-bit access, as described in this answer.

Accelerate vDSP FFT resulting in NaN under demanding scenario

I'm using the vDSP framework for a real-time audio application based on FFT computation.
After having lots of problems trying to figure out why the algorithm was producing incorrect results, I found out the following comment on the official vDSP FFT help code (DemonstrateFFT.c, lines 242, 416, 548)
/* Zero the signal before timing because repeated FFTs on non-zero
data can cause abnormalities such as infinities, NaNs, and
subnormal numbers.
*/
In order to reproduce the error, just comment line 247 (no zero the signal) and add something similar to the following line at line 273 (just after the vDSP_fft_zrip method)
if (isnan(Observed.realp[0])) printf("Iteration %lu: NaN\n",i); // it would work with any of the components of Observed
It is interesting to observe that reducing N (i.e. increasing the amount of FFTs per time unit) makes the zrip algorithm to fail before, which kinds of makes sense since the comment advices about performing repeated FFTs.
The behavior is also observed with the vDSP_fft_zrop algorithm.
I'm really wondering what's the point about performing FFTs of "zero data" as advised on the comment. Either I'm missing something important, or definitely the vDSP framework is not suited at all for real-time audio processing.
Normal 16 and 24-bit "real time" audio samples will not see this issue.
But benchmarks can create bigger and smaller numbers that can exceed the range of double precision floats when iterated enough times, and when using many functions, not just FFTs. Try iterating exp() fed back to itself, that will blow up even faster. It's a problem one encounters using any finite precision computer arithmetic (not just the ARM and x86 CPUs that vDSP uses).

In CUDA PTX, what does %warpid mean, really?

IN CUDA PTX, there's a special register which holds a thread's warp's index: %warpid. Now, the spec says:
Note that %warpid is volatile and returns the location of a thread
at the moment when read, but its value may change during execution,
e.g., due to rescheduling of threads following preemption.
Umm, what location is that? Shouldn't it be the location within the block, e.g. for a 1-dimensional grid %tid.x / warpSize? Is it some slot-for-a-warp within the SM (e.g. warp scheduler or some internal queue)? I'm confused.
Motivation: I wanted to spare myself the trouble of calculating %tid.x / warpSize as well as free up a register, by using this special register. However, in retrospect this is a false motivation, because reading a special register is expensive; see: What's the most efficient way to calculate the warp id / lane id in a 1-D grid?
You need to read the next 25 words of the documentation which directly follow after the quotation which you posted in your question:
For this reason, %ctaid and %tid should be used to compute a virtual
warp index if such a value is needed in kernel code;
and then
%warpid is intended mainly to enable profiling and diagnostic code to
sample and log information such as work place mapping and load
distribution.
So no, you can't use it for what you want. %warpid is effectively a scheduler slot ID rather than a constant, unique warp index within a block.

Generating Uniform Double random numbers on device in CUDA

I would like to generate uniform random numbers on the device, to be used inside of a device function. Each thread should generate a different uniform random number. I have this code, but I get a segmentation fault.
int main{
curandStateMtgp32 *devMTGPStates;
mtgp32_kernel_params *devKernelParams;
cudaMalloc((void **)&devMTGPStates, NUM_THREADS*NUM_BLOCKS * sizeof(curandStateMtgp32));
cudaMalloc((void**)&devKernelParams,sizeof(mtgp32_kernel_params));
curandMakeMTGP32Constants(mtgp32dc_params_fast_11213, devKernelParams);
curandMakeMTGP32KernelState(devMTGPStates,
mtgp32dc_params_fast_11213, devKernelParams,NUM_BLOCKS*NUM_THREADS, 1234);
doHenry <<NUM_BLOCKS,NUM_THREADS>>> (devMTGPStates);
}
Inside my global function doHenry, evaluated on the device, I put:
double rand1 = curand_uniform_double(&state[threadIdx.x+NUM_THREADS*blockIdx.x]);
Is this the best way to generate a random number per thread? I don't understand what the devKernelParams is doing, but I know I need one state per thread, right?
I think you're getting the seg fault on this line:
curandMakeMTGP32KernelState(devMTGPStates, mtgp32dc_params_fast_11213, devKernelParams,NUM_BLOCKS*NUM_THREADS, 1234);
I believe the reason for the seg fault is because you have exceeded 200 for the n parameter, for which you are passing NUM_BLOCKS*NUM_THREADS. I tried a version of your code, and I was able to reproduce the seg fault at around n=540.
The MT generator has a limitation on the amount of states it can set up when using pre-generated kernel parameters (mtgp32dc_params_fast_11213). You may wish to read the relevant section of the documentation. (Bit Generation with the MTGP32 generator)
I'm not really an expert on CURAND, but other generators (such as XORWOW) don't have this type of limitation, so if you want to generate a large amount of independent thread state easily, consider one of the other generators. Using the particular approach you have outlined, the MTGP32 generator seems to be limited to about 200*256 independent thread generation. Contrary to what I said in the comments (which is true for other generator types) the MTGP32 state seems to be sufficient at one state for a block of up to 256 threads. And the example given in the documentation (refer to the second example) uses that type of state generation and threadblock hierarchy.

How to convert a sparse histogram into dense histogram in CUDA?

I am implementing an algorithm using raw CUDA kernels, in which every threadblock needs the dense histogram of available data to that threadblock, now the question is that do I have to calculate the dense histogram from the scratch? (is it worth calculating the dense histogram at all, provided that i already have the sparse histogram which is implemented using shared memory)
I have come up with this idea of converting, I will try to elaborate my idea with example (temp and hist both are in shared memory)
0,1,2,3,4,5,6... //array indexes
4,3,0,2,1,0,5... //contents of hist[]
0,0,2,0,0,5,0... //contents of temp[] if(hist[x]>0)temp[x]=x;
for_every_element //this is sequential part :(
if(temp[x]>0)
shift elements from index x to 256
4,3,2,1,0,5... //pass 1 of the for loop
4,3,2,1,5... //pass 2 of the for loop
//this goes on until all the 0s are compacted
Now I know above is sequential in nature, but the shifting can be done with constant time (and in parallel) because threads_per_block is already set to 256, so shifting is not the main issue, the main issue is how to improve this (or any other suggestion is welcomed).
Edit: i am thinking of another idea, that is as follows
Assuming threads_per_block=256 if i can count which of histogram bins are non-zeros (this operation is parallel because each thread is assigned to each bin, i can atomicadd the values generated by each thread) let's say that i can then start a new shared index variable sindex=0 and each time a thread wants to store the value into d_hist[] it can take the latest value from sindex and store it's values to d_hist[sindex]=hist[treadIdx.x] after that i can atomicAdd the sindex
Now there is only one problem, there is going to be a race condition to getting the value of sindex, so i may have to setup a flag which can be locked or unlocked when a thread is adding any value to d_hist (but i think there can be a deadlock situation here)
Will this technique work? and is there any other technique better than that?
Converting a sparse histogram to a dense histogram is a scatter operation. If the sparse histogram is composed of s_index[S_N] and s_hist[S_N], then first we create a dense histogram d_hist[N] composed of all zeroes (you can do this from host code, perhaps). Then we populate the dense histogram with d_hist[s_index[i]] = s_hist[i]; This can be done in parallel and uses as many threads as there are valid indices in your sparse histogram (i < S_N). Assuming your histogram is sorted, you'll get whatever coalescing benefit may be possible based on the distribution of your sparse histogram indices.
It may not make sense for your case where each threadblock is doing a separate histogram, but you may also be interested in thrust scatter.
Well I guess the simplest method is to find out which bins>0 and after that, and exclusive scan can be done (in order to calculate the target indexes let's say sum_array[]) after that for allbins>0 move to d_hist[sum_array[threadIdx.x]-1]=s_hist[threadIdx.x]
0,1,2,3,4,5,6... //s_indexes[]
4,3,0,2,1,0,5... //contents of s_hist[]
1,1,0,1,1,0,1... //all bins which are > 0 = sum_array[]
1,2,2,3,4,4,5... //inclusive_scan of summ_array[]
//after the moving part
0,1,3,4,6... //s_indexes[]
4,3,2,1,5... //d_hist[]
0,1,2,3,4... //d_indexes[]
The reason why I am inclined to use this pattern is because it takes log_base_2(256) time in order to calculate the sum_array plus, other than that, moving and checking parts are just constant time operations, if anyone have different idea than this, please share.