Funnel shift - what is it? - cuda

When reading through CUDA 5.0 Programming Guide I stumbled on a feature called "Funnel shift" which is present in 3.5 compute-capable device, but not 3.0. It contains an annotation "see reference manual", but when I search for the "funnel shift" term in the manual, I don't find anything.
I tried googling for it, but only found a mention on http://www.cudahandbook.com, in the chapter 8:
8.2.3 Funnel Shift (SM 3.5)
GK110 added a 64-bit “funnel shift” instruction that may be accessed with the following intrinsics:
__funnelshift_lc(): returns most significant 32 bits of a left funnel shift.
__funnelshift_rc(): returns least significant 32 bits of a right funnel shift.
These intrinsics are implemented as inline device
functions (using inline PTX assembler) in sm_35_intrinsics.h.
...but it still does not explain what the "left funnel shift" or "right funnel shift" is.
So, what is it and where does one need it?

In the case of CUDA, two 32-bit registers are concatenated together into a 64-bit value; that value is shifted left or right; and the most significant (for a left shift) or least significant (for right shift) 32 bits are returned.
The intrinsics from sm_35_intrinsics.h are as follows:
unsigned int __funnelshift_lc(unsigned int lo, unsigned int hi, unsigned int shift);
unsigned int __funnelshift_rc(unsigned int lo, unsigned int hi, unsigned int shift);
According to Andy Glew (dead link removed), applications for funnel shift include fast misaligned memcpy; and as njuffa mentions in the comments above, it can be used to implement rotate if the two input words are the same.

Related

Cheap approximate integer division on a GPU

So, I want to divide me some 32-bit unsigned integers on a GPU, and I don't care about getting an exact result. In fact, let's be lenient and suppose I'm willing to accept a multiplicative error factor of upto 2, i.e. if q = x/y I'm willing to accept anything between 0.5*q and 2*q.
I haven't yet measured anything, but it seems to me that something like this (CUDA code) should be useful:
__device__ unsigned cheap_approximate_division(unsigned dividend, unsigned divisor)
{
return 1u << (__clz(dividend) - __clz(divisor));
}
it uses the "find first (bit) set" integer intrinsic as a cheap base-2-logarithm function.
Notes: I could make this non-32-bit-specific but then I'd have to complicate the code with templates, wrap __clz() with a templated function to use __clzl() and __clzll() etc.
Questions:
Is there a better method for such approximate division, in terms of clock cycles? Perhaps with slightly different constraints?
If I want better accuracy, should I stay with integers or should I just go through floating-point arithemtic?
Going via floating point gives you a much more precise result, slightly lower instruction count on most architectures, and potentially a higher throughput:
__device__ unsigned cheap_approximate_division(unsigned dividend, unsigned divisor)
{
return (unsigned)(__fdividef(dividend, divisor) /*+0.5f*/ );
}
The +0.5f in the comment shall indicate that you can also turn the float->int conversion into proper rounding at essentially no cost other than higher energy consumption (it turns an fmul into an fmad with the constant coming straight from the constant cache). Rounding would take you further away from the exact integer result though.

Why are there no simple atomic increment, decrement operations in CUDA?

From the CUDA Programming guide:
unsigned int atomicInc(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or
shared memory, computes ((old >= val) ? 0 : (old+1)), and stores the
result back to memory at the same address. These three operations are
performed in one atomic transaction. The function returns old.
This is nice and dandy. But where's
unsigned int atomicInc(unsigned int* address);
which simply increments the value at address and returns the old value? And
void atomicInc(unsigned int* address);
which simply increments the value at address and returns nothing?
Note: Of course I can 'roll my own' by wrapping the actual API call, but I would think the hardware has such a simpler operation, which could be cheaper.
They haven't implemented the simple increase and decrease operations because the operations wouldn't have better performance. Each machine code instruction in current architectures takes the same amount of space, which is 64 bits. In other words, there's room in the instruction for a full 32-bit immediate value and since they have atomic instructions that support adding a full 32-bit value, they have already spent the transistors.
I think that dedicated inc and dec instructions on old processors are now just artifacts from a time when transistors were much more expensive and the instruction cache was small, thereby making it worth the effort to code instructions into as few bits as possible. My guess is that, on new CPUs, the inc and dec instructions are implemented in terms of a more general addition function internally and are mostly there for backwards compatibility.
You should be able to accomplish "simple" atomic increment using:
__global__ void mykernel(int *value){
int my_old_val = atomicAdd(value, 1);
}
Likewise if you don't care about the return value, this is perfectly acceptable:
__global__ void mykernel(int *value){
atomicAdd(value, 1);
}
You can do atomic decrement similarly with:
atomicSub(value, 1);
or even
atomicAdd(value, -1);
Here is the section on atomics in the programming guide, which covers support by compute capability for the various functions and data types.

CUDA5.0 Samples AdvancedQuickSort

I am reading the CUDA 5.0 samples (AdvancedQuickSort) now. However, I cannot understand this sample totally due to following codes:
// Now compute my own personal offset within this. I need to know how many
// threads with a lane ID less than mine are going to write to the same buffer
// as me. We can use popc to implement a single-operation warp scan in this case.
unsigned lane_mask_lt;
asm( "mov.u32 %0, %%lanemask_lt;" : "=r"(lane_mask_lt) );
unsigned int my_mask = greater ? gt_mask : lt_mask;
unsigned int my_offset = __popc(my_mask & lane_mask_lt);
which is in the __global__ void qsort_warp function, especially for this assemble language in the codes. Can anyone help me to explain the meaning of this assemble language?
%lanemask_lt is a special, read-only register in PTX assembly which is initialized with a 32-bit mask with bits set in positions less than the thread’s lane number in the warp. The inline PTX you have posted is simply reading the value of that register and storing it in a variable where it can be used in the subsequent C++ code you posted.
Every version of the CUDA toolkit ships with a PTX assembly lanugage reference guide you can use to look up things like this.

Generating random numbers from a gaussian distribution in CUDA

I've searched a lot over the internet to find a way to generate random numbers on my CUDA device, within a kernel. The numbers must come from a gaussian distribution.
The best thing I found was from NVIDIA itself. It is the Wallace algorithm, that uses a uniform distribution to build a gaussian one. But the code samples they give lack explanation and I really need to understand how the algorithm goes, especially on the device. For example, they give:
__device__ void generateRandomNumbers_wallace(
unsigned seed, // Initialization seed
float *chi2Corrections, // Set of correction values
float *globalPool, // Input random number pool
float *output // Output random numbers
unsigned tid=threadIdx.x;
// Load global pool into shared memory.
unsigned offset = __mul24(POOL_SIZE, blockIdx.x);
for( int i = 0; i < 4; i++ )
pool[tid+THREADS*i] = globalPool[offset+TOTAL_THREADS*i+tid];
__syncthreads();
const unsigned lcg_a=241;
const unsigned lcg_c=59;
const unsigned lcg_m=256;
const unsigned mod_mask = lcg_m-1;
seed=(seed+tid)&mod_mask ;
// Loop generating outputs repeatedly
for( int loop = 0; loop < OUTPUTS_PER_RUN; loop++ )
{
Transform();
unsigned intermediate_address;
i_a = __mul24(loop,8*TOTAL_THREADS)+8*THREADS *
blockIdx.x + threadIdx.x;
float chi2CorrAndScale=chi2Corrections[
blockIdx.x * OUTPUTS_PER_RUN + loop];
for( i = 0; i < 4; i++ )
output[i_a + i*THREADS]=chi2CorrAndScale*pool[tid+THREADS*i];
}
First of all, many of the variables declared aren't even used in the function! And I really don't get what the "8" is for in the second loop. I understand the "4" in the other loops have something to do with the 4x4 orthogonal matrix block, am I right? Could anyone give me a better idea of what is going on here?
Anyway, does anyone have any good code samples I could use? Or does anyone have another way of generating random gaussian numbers in a CUDA kernel? Code samples will be much appreciated.
Thanks!
You could use CURAND, which is included with the CUDA Toolkit (version 3.2 and later). It'd be far simpler!
A few notes on the code you posted:
The Wallace generator transforms Gaussian to Gaussian (i.e. not Uniform to Gaussian)
CUDA code has two implicit variables: blockIdx and threadIdx - these define the block index and thread index with a block, see the CUDA Programming Guide for more information
The code uses __mul24, on sm_20 and later this is actually slower than "ordinary" 32-bit multiplication so I would avoid it (even on older architectures for simplicity)
The Box-Muller method is also good.
The Fast Walsh Hadamard transform is done by patterns of addition and subtraction. Hence the central limit theorem applies. An array of uniform random numbers that undergoes a Walsh Hadamard transformation will have a Gaussian/Normal distribution. There are some slight technical details about that. The algorithm was not discovered by Wallace. It was first published in Servo Magazine around 1993/1994 by myself.
I have code about the Walsh Hadamard transform at www.code.google.com/p/lemontree
Regards,
Sean O'Connor

Why do MIPS operations on unsigned numbers give signed results?

When I try working on unsigned integers in MIPS, the result of every operation I do remains signed (that is, the integers are all in 2's complement), even though every operation I perform is an unsigned one: addu, multu and so fourth...
When I print numbers in the range [2^31, 2^32 - 1] I get their 'overflowed' negative value as if they were signed (I guess they are).
Though, when I try something like this:
li $v0, 1
li $a0, 2147483648 # or any bigger number
syscall
the printed number is always 2147483647 (2^31 - 1)
I'm confused... What am I missing?
PS : I haven't included my code as it isn't very readable (such is assembly code) and putting aside this problem,
seems to be working fine. If anyone feels it is necessary I shall include it right away!
From Wikipedia:
The MIPS32 Instruction Set states that the word unsigned as part of Add and Subtract instructions, is a misnomer. The difference between signed and unsigned versions of commands is not a sign extension (or lack thereof) of the operands, but controls whether a trap is executed on overflow (e.g. Add) or an overflow is ignored (Add unsigned). An immediate operand CONST to these instructions is always sign-extended.
From the MIPS Instruction Reference:
ALL arithmetic immediate values are sign-extended [...] The only difference between signed and unsigned instructions is that signed instructions can generate an overflow exception and unsigned instructions can not.
It looks to me like the real problem is the syscall that you are using to print numbers. It appears to and to always interpret what you pass as signed, and to possibly to bound as as well.