Generating random numbers from a gaussian distribution in CUDA - cuda

I've searched a lot over the internet to find a way to generate random numbers on my CUDA device, within a kernel. The numbers must come from a gaussian distribution.
The best thing I found was from NVIDIA itself. It is the Wallace algorithm, that uses a uniform distribution to build a gaussian one. But the code samples they give lack explanation and I really need to understand how the algorithm goes, especially on the device. For example, they give:
__device__ void generateRandomNumbers_wallace(
unsigned seed, // Initialization seed
float *chi2Corrections, // Set of correction values
float *globalPool, // Input random number pool
float *output // Output random numbers
unsigned tid=threadIdx.x;
// Load global pool into shared memory.
unsigned offset = __mul24(POOL_SIZE, blockIdx.x);
for( int i = 0; i < 4; i++ )
pool[tid+THREADS*i] = globalPool[offset+TOTAL_THREADS*i+tid];
__syncthreads();
const unsigned lcg_a=241;
const unsigned lcg_c=59;
const unsigned lcg_m=256;
const unsigned mod_mask = lcg_m-1;
seed=(seed+tid)&mod_mask ;
// Loop generating outputs repeatedly
for( int loop = 0; loop < OUTPUTS_PER_RUN; loop++ )
{
Transform();
unsigned intermediate_address;
i_a = __mul24(loop,8*TOTAL_THREADS)+8*THREADS *
blockIdx.x + threadIdx.x;
float chi2CorrAndScale=chi2Corrections[
blockIdx.x * OUTPUTS_PER_RUN + loop];
for( i = 0; i < 4; i++ )
output[i_a + i*THREADS]=chi2CorrAndScale*pool[tid+THREADS*i];
}
First of all, many of the variables declared aren't even used in the function! And I really don't get what the "8" is for in the second loop. I understand the "4" in the other loops have something to do with the 4x4 orthogonal matrix block, am I right? Could anyone give me a better idea of what is going on here?
Anyway, does anyone have any good code samples I could use? Or does anyone have another way of generating random gaussian numbers in a CUDA kernel? Code samples will be much appreciated.
Thanks!

You could use CURAND, which is included with the CUDA Toolkit (version 3.2 and later). It'd be far simpler!
A few notes on the code you posted:
The Wallace generator transforms Gaussian to Gaussian (i.e. not Uniform to Gaussian)
CUDA code has two implicit variables: blockIdx and threadIdx - these define the block index and thread index with a block, see the CUDA Programming Guide for more information
The code uses __mul24, on sm_20 and later this is actually slower than "ordinary" 32-bit multiplication so I would avoid it (even on older architectures for simplicity)

The Box-Muller method is also good.

The Fast Walsh Hadamard transform is done by patterns of addition and subtraction. Hence the central limit theorem applies. An array of uniform random numbers that undergoes a Walsh Hadamard transformation will have a Gaussian/Normal distribution. There are some slight technical details about that. The algorithm was not discovered by Wallace. It was first published in Servo Magazine around 1993/1994 by myself.
I have code about the Walsh Hadamard transform at www.code.google.com/p/lemontree
Regards,
Sean O'Connor

Related

Memory access in CUDA kernel functions (simple example)

I am novice in GPU parallel computing and I'm trying to learn CUDA by looking at some examples in NVidia "CUDA by examples" book.
And I do not understand properly how thread access and change variables in such a simple example (dot product of two vectors).
The kernel function is defined as follows
__global__ void dot( float *a, float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
// set the cache values
cache[cacheIndex] = temp;
I do not understand three things.
What is the sequence of execution of this function? Is there any sequence between threads? For example, the first are the thread from the first block, then threads from the second block come into play and so on (this is connected to the question why this is necessary to divide threads into blocks).
Do all threads have their own copy of the "temp" variable or not (if not, why is there no race condition?)
How is it operated? What exactly goes to the variable temp in the while loop? The array cache stores values of temp for different threads. How does the summation go on? It seems that temp already contains all sums necessary for dot product because variable tid goes from 0 to N-1 in the while loop.
Despite the code you provide is incomplete, here are some clarifications about what you are asking :
The kernel code will be executed by all the threads in all the blocks. The way to "split the jobs" is to make threads work only on one or a few elements.
For instance, if you have to treat 100 integers with a specific algorithm, you probably want 100 threads to treat 1 element each.
In CUDA the amount of blocks and threads is defined at the kernel launch on host side :
myKernel<<<grid, threads>>>(...);
Where grids and threads are dim3, which define the size on three dimensions.
There is no specific order in the execution of threads and blocks. As you can read there :
http://mc.stanford.edu/cgi-bin/images/3/34/Darve_cme343_cuda_3.pdf
On page 6 : "No specific order in which blocks are dispatched and executed".
Since the temp variable is defined in the kernel in no specific way, it is not distributed and each thread will have this value stored in a register.
This is equivalent of what is done on CPU side. So yes, this means each threads has its own "temp" variable.
The temp variable is updated in each iteration of the loop, using access to device arrays.
Again, this is equivalent of what is done on CPU side.
I think you should probably check if you are used enough to C/C++ programming on CPU side before going further into GPU programming. Meaning no offense, it seems you have a lack in several main topics.
Since CUDA allows you to drive your GPU with C code, the difficulty is not in the syntax, but in the specificities of the hardware.

CUDA5.0 Samples AdvancedQuickSort

I am reading the CUDA 5.0 samples (AdvancedQuickSort) now. However, I cannot understand this sample totally due to following codes:
// Now compute my own personal offset within this. I need to know how many
// threads with a lane ID less than mine are going to write to the same buffer
// as me. We can use popc to implement a single-operation warp scan in this case.
unsigned lane_mask_lt;
asm( "mov.u32 %0, %%lanemask_lt;" : "=r"(lane_mask_lt) );
unsigned int my_mask = greater ? gt_mask : lt_mask;
unsigned int my_offset = __popc(my_mask & lane_mask_lt);
which is in the __global__ void qsort_warp function, especially for this assemble language in the codes. Can anyone help me to explain the meaning of this assemble language?
%lanemask_lt is a special, read-only register in PTX assembly which is initialized with a 32-bit mask with bits set in positions less than the thread’s lane number in the warp. The inline PTX you have posted is simply reading the value of that register and storing it in a variable where it can be used in the subsequent C++ code you posted.
Every version of the CUDA toolkit ships with a PTX assembly lanugage reference guide you can use to look up things like this.

Interpolation with CUDA Texture memory

I would like to use the Texture Memory for Interpolation of Data. I have 2 Arrays (namely A[i] and B[i]) and I would want to interpolate Data between them. I thought I could bind them to Texture Memory and set the interpolation but I am not sure how I can do that.
The examples that come with CUDA use the A[i-1] and A[i+1] for the interpolation.
Is there any way to do what I planned? I'm trying this because I think I can get a good speedup.
Yes, you can do this with texture memory, and it is fast. I personally use ArrayFire to accomplish these kinds of operations, because it is faster than I can hope to code by hand.
If you want to code by hand yourself in CUDA, something like this is what you want:
// outside kernel
texture<float,1>  A;
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float>();
cudaArray *arr = NULL;
cudaError_t e = cudaMallocArray(&arr, &desc, 1, length);
A.filterMode = cudaFilterModePoint;
A.addressMode[0] = cudaAddressModeClamp;
cudaBindTextureToArray(A, arr, desc);
...
// inside kernel
   
valA = tex1D(A,1,idx)
valB = tex1D(B,1,idx)
float f = 0.5;
output  = (f)*valA + (1-f)*valB;
If you want to just plug-in ArrayFire (which in my experience is faster than what I try to code by hand, not to mention way simpler to use), then you'll want:
// in arrayfire
array A = randu(10,1);
array B = randu(10,1);
float f = 0.5;
array C = (f)*A + (1-f)*B;
The above assumes you want to interpolate between corresponding indices of 2 different arrays or matrices. There are other interpolation functions available too.
If you're not used to developing with CUDA, using texture memory is not the easiest thing to start with.
I'd suggest you to try writing a first parallel version of your algorithm in CUDA with no optimisation. Then, use the NVIDIA Visual Profiler on your application to figure out whether you need to set up texture memory to optimize your memory accesses.
Remember that the earlier you optimize, the trickier it is to debug.
Last but not least, the latest CUDA version (CUDA 5, still in release candidate) is able to automatically store your data in texture memory as long as you declare the input buffers passed as parameters to your kernel as const restrict pointers.

CUDA - what is this loop doing

Hey
I've seen on a website this example kernel
__global__ void loop1( int N, float alpha, float* x, float* y ) {
int i;
int i0 = blockIdx.x*blockDim.x + threadIdx.x;
for(i=i0;i<N;i+=blockDim.x*gridDim.x) {
y[i] = alpha*x[i] + y[i];
}
}
To compute this function in C
for(i=0;i<N;i++) {
y[i] = alpha*x[i] + y[i];
}
Surely the for loop inside the kernel isn't necessary? and you can just do y[i0] = alpha*x[i0] + y[i0] and remove the for loop altogether.
I'm just curious as to why it's there and what it's purpose is. This is assuming a kernel call such as loop1<<<64,256>>>> so presumably gridDim.x = 1
You need the for loop in the kernel if your vector has more entrys than you have started threads. If it's possible it is of course more efficent to start enough threads.
Interesting kernel. The loop inside the kernel is necessary, because N is greater than total number of threads, which is 16 384 (blockDim.x*gridDim.x), but I think it's not good practice to do it (the whole point of CUDA is to use SIMT concept). According to CUDA Programming Guide you can have at most 65535 thread blocks with one kernel. Futhermore starting from Compute Capability 2.x (Fermi) you can have at most 1024 threads per one block (512 before Fermi) Also you can (if possible) separate code into multiple (sequential) kernels.
Much as we would like to believe that CUDA GPUs have infinite execution resources, they do not, and authors of highly optimized code are finding that unrolled for loops, often with fixed numbers of blocks, give the best performance. Makes for painful coding, but optimized CPU code is also pretty painful.
btw a commenter mentioned that this code would have coalescing problems, and I don't see why. If the base addresses are correctly aligned (64B since those are floats), all of the memory transactions by this code will be coalesced, provided the threads/block is also divisible by 64.

Coding a CUDA Kernel that has many threads writing to the same index?

I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron.
So here is the kernel code, and I'll try to explain it a bit clearer with the variables.
__global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength)
{
int nx = threadIdx.x + TILE_WIDTH*threadIdx.y;
int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;
if(index_in < cLength)
{
sumArray[sourceTargetArray[index_in].y] += fabs(weightArray[index_in]);
//__threadfence();
__threadfence_block();
}
}
First off, the number of connections in the network is cLength. For every connection, there is a source neuron and a target neuron, as well as a weight for that connection. SourceTargetArray contains that information. So index i of sourceTargetArray is the source neuron index of connection i, and target neuron index of connection i. The weightArray contains the weight information (so index i of weightArray corresponds to connection i).
As you can see, SumArray is where I'm storing the sums. So kernel increments the sumArray (at target neuron index of connection i) by the absolute value of the weight of connection i. Intuitively, for all the incoming connections to the neuron, sum all the weights. That's really all I'm trying to do with this kernel. Eventually, I'll normalize the weights using this sum.
The problem is that it's wrong. I've done this serially, and the answer is different. The answer differ, usually by about 12-15x (so the right answer will be 700.0 and what I'm getting is something in the 50s range).
You can see that I added __threadfence() (and __threadfence_block() in an attempt to make sure that the writes weren't being done at the same time by every thread). I'm not sure if this is the problem with my code. I've ensured that the weight array is identical to the serial version I tested, and that the source/target information is identical as well. What am I doing wrong?
EDIT: For reference, __threadfence() usaged is described in the CUDA Programming Guide v3.1 Appendix B.5 Memory Fence Functions
+= is not atomical => not thread safe. Use atomicAdd.
Also you should avoid writing to same memory cell. Problem is that these calls will be serialized, threads will stand in line and wait for each other. If you can't avoid this operation try to break your algorithm into two phases: individual computation and merging. Parallel merging can be implemented very efficiently.
You need to do a reduction.
Sum the elements assigned to each thread and place the result in an array, cache[threadsPerBlock] then __Syncthreads
Now reduce the resulting sub totals by adding successive neighboring subtotals:
int cacheIndex = threadIdx.x;
int i = blockDim.x / 2;
while (i != 0)
{
if (cacheIndex < i)
cache[cacheIndex] += cache[cacheIndex] + 1;
__syncthreads;
i /= 2;
}
}
The following deck explains this in some detail:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
Sample code for this is here:
http://www.nvidia.com/object/cuda_sample_data-parallel.html
It's also very well explained in "CUDA BY Example" (which is where the code fragment comes from).
There is one big caveat with this approach. The additions will not occur in the same order they would with serial code. Addition of floats is not commutative so rounding errors may lead to slightly different results.