Cheap approximate integer division on a GPU

Cheap approximate integer division on a GPU - cuda

So, I want to divide me some 32-bit unsigned integers on a GPU, and I don't care about getting an exact result. In fact, let's be lenient and suppose I'm willing to accept a multiplicative error factor of upto 2, i.e. if q = x/y I'm willing to accept anything between 0.5*q and 2*q.
I haven't yet measured anything, but it seems to me that something like this (CUDA code) should be useful:
__device__ unsigned cheap_approximate_division(unsigned dividend, unsigned divisor)
{
return 1u << (__clz(dividend) - __clz(divisor));
}
it uses the "find first (bit) set" integer intrinsic as a cheap base-2-logarithm function.
Notes: I could make this non-32-bit-specific but then I'd have to complicate the code with templates, wrap __clz() with a templated function to use __clzl() and __clzll() etc.
Questions:
Is there a better method for such approximate division, in terms of clock cycles? Perhaps with slightly different constraints?
If I want better accuracy, should I stay with integers or should I just go through floating-point arithemtic?

Going via floating point gives you a much more precise result, slightly lower instruction count on most architectures, and potentially a higher throughput:
__device__ unsigned cheap_approximate_division(unsigned dividend, unsigned divisor)
{
return (unsigned)(__fdividef(dividend, divisor) /*+0.5f*/ );
}
The +0.5f in the comment shall indicate that you can also turn the float->int conversion into proper rounding at essentially no cost other than higher energy consumption (it turns an fmul into an fmad with the constant coming straight from the constant cache). Rounding would take you further away from the exact integer result though.

Related

Range of unsigned FLOAT values in MySQL

According to the MySQL Numeric types documentation, among other, Float types have these properties:
FLOAT[(M,D)] [UNSIGNED] [ZEROFILL]
A small (single-precision) floating-point number. Permissible values are -3.402823466E+38 to -1.175494351E-38, 0, and 1.175494351E-38 to 3.402823466E+38. These are the theoretical limits, based on the IEEE standard. The actual range might be slightly smaller depending on your hardware or operating system.
M is the total number of digits and D is the number of digits following the decimal point. If M and D are omitted, values are stored to the limits permitted by the hardware. A single-precision floating-point number is accurate to approximately 7 decimal places.
UNSIGNED, if specified, disallows negative values.
I want to determine if the behavior of unsigned float is similar to the behavior of unsigned int where the the full bit depth of the decimal and mantissa is used to represent positive values, or if unsigned floats only prevent negative values.

I want to determine if the behavior of unsigned float is similar to
the behavior of unsigned int where the the full bit depth of the
decimal and mantissa is used to represent positive values, or
No. MySQL chose to use IEEE-754 Single precision Float for its FLOAT. IEEE Float is "sign-magnitude", so the sign bit occupies a specific location of its own. INT, on the other hand, is 2's complement encoding. So, the bit for the sign can be either interpreted as a sign or as an extension of the value. (Caveat: Virtually all current computer hardware works the way described in this paragraph; but there may be some that do not. An antique example: The CDC 6600 used 1's complement.)
if unsigned floats only prevent negative values.
Yes.
It has to be said: If you're using the inherently imprecise FLOAT data type you should think through why your problem space prohibits all negative numbers. Do those numbers faithfully represent the problem space? Is FLOAT's imprecision acceptable for what you're representing?
The idea that there can be no computation on floating-point numbers is illusory. Just generating them from integers involves computation. Yes, it's done directly on the computer chip, but it's still computation.
And, the example of packet loss is
precisely represented by two integers: number of packets lost / number of packets (you could call this a rational number if you wanted), or
approximately represented by a FLOAT fraction resulting from dividing one of those integers by the other.
With respect, you're overthinking this.
FLOAT always occupies 4 bytes, regardless of the options tacked onto it. DECIMAL(m,n), on the other hand, occupies approx (m/2) bytes.
FLOAT(m,n), is probably useless. The n says to round in decimal to n places, then store into a binary number (IEEE Float), thereby causing another rounding.

Why does cuFFT performance suffer with overlapping inputs?

I'm experimenting with using cuFFT's callback feature to perform input format conversion on the fly (for instance, calculating FFTs of 8-bit integer input data without first doing an explicit conversion of the input buffer to float). In many of my applications, I need to calculate overlapped FFTs on an input buffer, as described in this previous SO question. Typically, adjacent FFTs might overlap by 1/4 to 1/8 of the FFT length.
cuFFT, with its FFTW-like interface, explicitly supports this via the idist parameter of the cufftPlanMany() function. Specifically, if I want to calculate FFTs of size 32768 with an overlap of 4096 samples between consecutive inputs, I would set idist = 32768 - 4096. This does work properly in the sense that it yields the correct output.
However, I'm seeing strange performance degradation when using cuFFT in this way. I have devised a test that implements this format conversion and overlap in two different ways:
Explicitly tell cuFFT about the overlapping nature of the input: set idist = nfft - overlap as I described above. Install a load callback function that just does the conversion from int8_t to float as needed on the buffer index provided to the callback.
Don't tell cuFFT about the overlapping nature of the input; lie to it an dset idist = nfft. Then, let the callback function handle the overlapping by calculating the correct index that should be read for each FFT input.
A test program implementing both of these approaches with timing and equivalence tests is available in this GitHub gist. I didn't reproduce it all here for brevity. The program calculates a batch of 1024 32768-point FFTs that overlap by 4096 samples; the input data type is 8-bit integers. When I run it on my machine (with a Geforce GTX 660 GPU, using CUDA 8.0 RC on Ubuntu 16.04), I get the following result:
executing method 1...done in 32.523 msec
executing method 2...done in 26.3281 msec
Method 2 is noticeably faster, which I would not expect. Look at the implementations of the callback functions:
Method 1:
template <typename T>
__device__ cufftReal convert_callback(void * inbuf, size_t fft_index,
void *, void *)
{
return (cufftReal)(((const T *) inbuf)[fft_index]);
}
Method 2:
template <typename T>
__device__ cufftReal convert_and_overlap_callback(void *inbuf,
size_t fft_index, void *, void *)
{
// fft_index is the index of the sample that we need, not taking
// the overlap into account. Convert it to the appropriate sample
// index, considering the overlap structure. First, grab the FFT
// parameters from constant memory.
int nfft = overlap_params.nfft;
int overlap = overlap_params.overlap;
// Calculate which FFT in the batch that we're reading data for. This
// tells us how much overlap we need to account for. Just use integer
// arithmetic here for speed, knowing that this would cause a problem
// if we did a batch larger than 2Gsamples long.
int fft_index_int = fft_index;
int fft_batch_index = fft_index_int / nfft;
// For each transform past the first one, we need to slide "overlap"
// samples back in the input buffer when fetching the sample.
fft_index_int -= fft_batch_index * overlap;
// Cast the input pointer to the appropriate type and convert to a float.
return (cufftReal) (((const T *) inbuf)[fft_index_int]);
}
Method 2 has a significantly more complex callback function, one that even involves integer division by a non-compile time value! I would expect this to be much slower than method 1, but I'm seeing the opposite. Is there a good explanation for this? Is it possible that cuFFT structures its processing much differently when the input overlaps, thus resulting in the degraded performance?
It seems like I should be able to achieve performance that is quite a bit faster than method 2 if the index calculations could be removed from the callback (but that would require the overlapping to be specified to cuFFT).
Edit: After running my test program under nvvp, I can see that cuFFT definitely seems to be structuring its computations differently. It's hard to make sense of the kernel symbol names, but the kernel invocations break down like this:
Method 1:
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex14packR2C_kernelIjfEEvNS_19spRealComplexR2C_stIT_T0_EE: 3.72 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 7.71 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 12.75 msec (yes, it gets invoked twice)
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_: 7.49 msec
Method 2:
spRadix0128C::kernel1MemCallback<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, L1, ALL, WRITEBACK>: 5.15 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 12.88 msec
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_: 7.51 msec
Interestingly, it looks like cuFFT invokes two kernels to actually compute the FFTs using method 1 (when cuFFT knows about the overlapping), but with method 2 (where it doesn't know that the FFTs are overlapped), it does the job with just one. For the kernels that are used in both cases, it does seem to use the same grid parameters between methods 1 and 2.
I don't see why it should have to use a different implementation here, especially since the input stride istride == 1. It should just use a different base address when fetching data at the transform input; the rest of the algorithm should be exactly the same, I think.
Edit 2: I'm seeing some even more bizarre behavior. I realized by accident that if I fail to destroy the cuFFT handles appropriately, I see differences in measured performance. For example, I modified the test program to skip destruction of the cuFFT handles and then executed the tests in a different sequence: method 1, method 2, then method 2 and method 1 again. I got the following results:
executing method 1...done in 31.5662 msec
executing method 2...done in 17.6484 msec
executing method 2...done in 17.7506 msec
executing method 1...done in 20.2447 msec
So the performance seems to change depending upon whether there are other cuFFT plans in existence when creating a plan for the test case! Using the profiler, I see that the structure of the kernel launches doesn't change between the two cases; the kernels just all seem to execute faster. I have no reasonable explanation for this effect either.

If you specify non-standard strides (doesn't matter if batch/transform) cuFFT uses different path internally.
ad edit 2:
This is likely GPU Boost adjusting clocks on GPU. cuFFT plan do not have impact one on another
Ways to get more stable results:
run warmup kernel (anything that would make full GPU work is good) and then your problem
increase batch size
run test several times and take average
lock clocks of the GPU (not really possible on GeForce - Tesla can do it)

At the suggestion of #llukas, I filed a bug report with NVIDIA regarding the issue (https://partners.nvidia.com/bug/viewbug/1821802 if you're registered as a developer). They acknowledged the poorer performance with overlapped plans. They actually indicated that the kernel configuration used in both cases is suboptimal and they plan to improve that eventually. No ETA was given, but it will likely not be in the next release (8.0 was just released last week). Finally, they said that as of CUDA 8.0, there is no workaround to make cuFFT use a more efficient method with strided inputs.

as3 Number type - Logic issues with large numbers

I'm curious about an issue spotted in our team with a very large number:
var n:Number = 64336512942563914;
trace(n < Number.MAX_VALUE); // true
trace(n); // 64336512942563910
var a1:Number = n +4;
var a2:Number = a1 - n;
trace(a2); // 8 Expect to see 4
trace(n + 4 - n); // 8
var a3:Number = parseInt("64336512942563914");
trace(a3); // 64336512942563920
n++;
trace(n); //64336512942563910
trace(64336512942563914 == 64336512942563910); // true
What's going on here?
Although n is large, it's smaller than Number.MAX_VALUE, so why am I seeing such odd behaviour?
I thought that perhaps it was an issue with formatting large numbers when being trace'd out, but that doesn't explain n + 4 - n == 8
Is this some weird floating point number issue?

Yes, it is a floating point issue, but it is not a weird one. It is all expected behavior.
Number data type in AS3 is actually a "64-bit double-precision format as specified by the IEEE Standard for Binary Floating-Point Arithmetic (IEEE-754)" (source). Because the number you assigned to n has too many digits to fit into thos 64 bits, it gets rounded off, and that's the reason for all the "weird" results.
If you need to do some exact big integer arithmetic, you will have to use a custom big integer class, e.g. this one.

Numbers in flash are double precision floating point numbers. Meaning they store a number of significant digits and and exponent. It favors a larger range of expressible numbers over the precision of the numbers due to memory constraints. At some point, numbers with a lot of significant digits, rounding will occur. Which is what you are seeing; the least significant digits are being rounded. Google double precision floating point numbers and you'll find a bunch of technical information on why.
It is the nature of the datatype. If you need precise numbers you should really stick to uint or int based integers. Other languages have fixed point or bigint number processing libraries (sometimes called BigInt or Decimal) which are wrappers around ints and longs to express much larger numbers at the cost of memory consumption.

We have an as3 implementation of BigDecimal copied from Java that we use for ALL calculations. In a trading app the floating point errors were not acceptable.

To be safe with integer math when using Numbers, I checked the AS3 documentation for Number and it states:
The Number class can be used to represent integer values well beyond
the valid range of the int and uint data types. The Number data type
can use up to 53 bits to represent integer values, compared to the 32
bits available to int and uint.
A 53 bit integer gets you to 2^53 - 1, if I'm not mistaken which is 9007199254740991, or about 9 quadrillion. The other 11 bits that help make up the 64 bit Number are used in the exponent. The number used in the question is about 64.3 quadrillion. Going past that point (9 quadrillion) requires more bits for the significant number portion (the mantissa) than is allotted and so rounding occurs. A helpful video explaining why this makes sense (by PBS Studio's Infinite Series).
So yeah, one must search for outside resources such as the BigInt. Hopefully, the resources I linked to are useful to someone.

This is, indeed, a floating point number approximation issue.
I guess n is to large compared to 4, so it has to stick with children of its age:
trace(n - n + 4) is ok since it does n-n = 0; 0 + 4 = 4;
Actually, Number is not the type to be used for large integers, but floating point numbers. If you want to compute large integers you have to stay within the limit of uint.MAX_VALUE.
Cheers!

CUDA, float precision

I am using CUDA 4.0 on Geforce GTX 580 (Fermi) . I have numbers as small as 7.721155e-43 . I want to multiply them with each other just once or better say I want to calculate 7.721155e-43 * 7.721155e-43 .
My experience showed me I can't do it just straight forward. Could you please give me suggestion? Do I need to use double precision? How?

The magnitude of the smallest normal IEEE single-precision number is about 1.18e-38, the smallest denormal gets you down to about 1.40e-45. As a consequece an operand of magnitude 7.82e-43 will comprise only about 9 non-zero bits, which in itself may already be a problem, even before you get to the multiplication (whose result will underflow to zero in single precision). So you may also want to look at any up-stream computation that produces these tiny numbers.
If these small numbers are intermediate terms in a mathematical expression, rewriting that expression into a mathematically equivalent one that does not involve tiny intermediates would be one way of addressing the issue. Or you could scale some operands by factors that are powers of two (so as to not incur additional round-off due to the scaling). For example, scale by 2^24 = 16777216.
Lastly, you can switch part of the computation to double precision. To do so, simply introduce temporary variables of type double, perform the computation on them, then convert the final result back to float:
float r, f = 7.721155e-43f;
double d, t;
d = (double)f; // explicit cast is not necessary, since converting to wider type
t = d * d;
[... more intermediate computation, leaving result in 't' ...]
r = (float)t; // since conversion is to narrower type, cast will avoid warnings

In statistics we often have to work with likelihoods that end up being very small numbers and the standard technique is to use logs for everything. Then multiplication on a log scale is just addition. All intermediate numbers are stored as logs. Indeed it can take a bit of getting used to - but the alternative will often fail even when doing relatively modest computations. In R (for my convenience!) which uses doubles and prints 7 significant figures by default btw:
> 7.721155e-43 * 7.721155e-43
[1] 5.961623e-85
> exp(log(7.721155e-43) + log(7.721155e-43))
[1] 5.961623e-85

When to use unsigned values over signed ones?

When is it appropriate to use an unsigned variable over a signed one? What about in a for loop?
I hear a lot of opinions about this and I wanted to see if there was anything resembling a consensus.
for (unsigned int i = 0; i < someThing.length(); i++) {
SomeThing var = someThing.at(i);
// You get the idea.
}
I know Java doesn't have unsigned values, and that must have been a concious decision on Sun Microsystems' part.

I was glad to find a good conversation on this subject, as I hadn't really given it much thought before.
In summary, signed is a good general choice - even when you're dead sure all the numbers are positive - if you're going to do arithmetic on the variable (like in a typical for loop case).
unsigned starts to make more sense when:
You're going to do bitwise things like masks, or
You're desperate to to take advantage of the sign bit for that extra positive range .
Personally, I like signed because I don't trust myself to stay consistent and avoid mixing the two types (like the article warns against).

In your example above, when 'i' will always be positive and a higher range would be beneficial, unsigned would be useful. Like if you're using 'declare' statements, such as:
#declare BIT1 (unsigned int 1)
#declare BIT32 (unsigned int reallybignumber)
Especially when these values will never change.
However, if you're doing an accounting program where the people are irresponsible with their money and are constantly in the red, you will most definitely want to use 'signed'.
I do agree with saint though that a good rule of thumb is to use signed, which C actually defaults to, so you're covered.

I would think that if your business case dictates that a negative number is invalid, you would want to have an error shown or thrown.
With that in mind, I only just recently found out about unsigned integers while working on a project processing data in a binary file and storing the data into a database. I was purposely "corrupting" the binary data, and ended up getting negative values instead of an expected error. I found that even though the value converted, the value was not valid for my business case.
My program did not error, and I ended up getting wrong data into the database. It would have been better if I had used uint and had the program fail.

C and C++ compilers will generate a warning when you compare signed and unsigned types; in your example code, you couldn't make your loop variable unsigned and have the compiler generate code without warnings (assuming said warnings were turned on).
Naturally, you're compiling with warnings turned all the way up, right?
And, have you considered compiling with "treat warnings as errors" to take it that one step further?
The downside with using signed numbers is that there's a temptation to overload them so that, for example, the values 0->n are the menu selection, and -1 means nothing's selected - rather than creating a class that has two variables, one to indicate if something is selected and another to store what that selection is. Before you know it, you're testing for negative one all over the place and the compiler is complaining about how you're wanting to compare the menu selection against the number of menu selections you have - but that's dangerous because they're different types. So don't do that.

size_t is often a good choice for this, or size_type if you're using an STL class.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008