What are vector division and multiplication as in CUDA __half2 arithmetic?

What are vector division and multiplication as in CUDA __half2 arithmetic? - cuda

__device__ __half2 __h2div ( const __half2 a, const __half2 b )
Description:
Divides half2 input vector a by input vector b in round-to-nearest mode.
__device__ __half2 __hmul2 ( const __half2 a, const __half2 b )
Description:
Performs half2 vector multiplication of inputs a and b, in round-to-nearest-even mode.
Can someone explain me what exact operations are happening for both of these?

Both are elementwise operations. A __half2 is a vector type, meaning it has multiple elements (2) of a simpler type, namely half (i.e. 16-bit floating point quantity.) These vector types are basically structures where the individual elements are accessed using the structure references .x, .y, .z, and .w, for vector types up to 4 elements.
If we have two items (a, b) that are each of __half2 type:
the division operation:
__half2 a,b;
__half2 result = __hdiv2(a, b);
will create a result where the first element of result is equal to the first element of a divided by the first element of b, and likewise for the second element.
This means when complete, the following statements should "approximately" be correct:
result.x == a.x/b.x;
result.y == a.y/b.y;
The multiplication operation:
__half2 a,b;
__half2 result = __hmul2(a, b);
will create a result where the first element of result is equal to the first element of a multiplied by the first element of b, and likewise for the second element.
This means when complete, the following statements should "approximately" be correct:
result.x == a.x*b.x;
result.y == a.y*b.y;
("approximately" means there may be rounding differences, depending on your exact code and possibly other factors, like compile switches)
Regarding rounding, its no different than when these terms are applied in other (non CUDA) contexts. Roughly speaking:
"round to nearest" is what I would consider the usual form of rounding. When an arithmetic result is not exactly representable in the type, the nearest type representation will be chosen so that:
if the exact result is closer to the next closest type-representable value closer to zero, the type value closer to zero will be chosen
if the exact result is closer to the type-representable value closer to positive or negative infinity, the type value closer to positive or negative infinity will be chosen
if the exact result is exactly at the midpoint between the two closest type-representable values, the type value closer to positive or negative infinity will be chosen.
"round to nearest even" is a modification of the above description to choose the closest type representation in the exact midpoint case that has an even numbered least significant digit.

Related

Explanation on determining if a decimal number has a finite representation in a base

When trying to find the answer I came across this and was wondering if this is true and why it is.
https://stackoverflow.com/a/489870/5712298
If anyone can explain it to me or link me to a page explaining it that would be great.

Stackoverflow markup does not support mathematical notation well, and most readers of this will be programmers, so I am going to use common programming expression syntax:
* multiplication
^ exponentiation
/ division
x[i] Element i of an array x
== equality
PROD product
This deals with the question of whether, given a radix r terminating fraction a/(r^n), there is a terminating radix s fraction b/(s^m) with exactly the same value, a, b integers, r and s positive integers, n and m non-negative integers.
a/(r^n)==b/(s^m) is equivalent to b==a*(s^m)/(r^n). a/(r^n) is exactly equal to some radix s terminating fraction if, and only if, there exists a positive integer m such that a*(s^m)/(r^n) is an integer.
Consider the prime factorization of r, PROD(p[i]^k[i]). If, for some i, p[i]^k[i] is a term in the prime factorization of r, then p[i]^(n*k[i]) is a term in the prime factorization of r^n.
a*(s^m)/(r^n) is an integer if, and only if, every p[i]^(n*k[i]) in the prime factorization of r^n is also a factor of a*(s^m)
First suppose p[i] is also a factor of s. Then for sufficiently large m, p[i]^(n*k[i]) is a factor of s^m.
Now suppose p[i] is not a factor of s. p[i]^(n*k[i]) is a factor of a*(s^m) if, and only if, it is a factor of a.
The necessary and sufficient condition for the existence of a non-negative integer m such that b==a*(s^m)/(r^n) is an integer is that, for each p[i]^k[i] in the prime factorization of r, either p[i] is a factor of s or p[i]^(n*k[i]) is a factor of a.
Applying this to the common case of r=10 and s=2, the prime factorization of r is (2^1)*(5^1). 2 is a factor of 2, so we can ignore it. 5 is not, so we need 5^n to be a factor of a.
Consider some specific cases:
Decimal 0.1 is 1/10, 5 is not factor of 1, so there is no exact binary fraction equivalent.
Decimal 0.625, 625/(10^3). 5^3 is 125, which is a factor of 625, so there is an exact binary fraction equivalent. (It is binary 0.101).
The method in the referenced answer https://stackoverflow.com/a/489870/5712298 is equivalent to this for decimal to binary. It would need some work to extend to the general case, to allow for prime factors whose exponent is not 1.

How to map number in a range to another in the same range with no collisions?

Effectively what I'm looking for is a function f(x) that outputs into a range that is pre-defined. Calling f(f(x)) should be valid as well. The function should be cyclical, so calling f(f(...(x))) where the number of calls is equal to the size of the range should give you the original number, and f(x) should not be time dependent and will always give the same output.
While I can see that taking a list of all possible values and shuffling it would give me something close to what I want, I'd much prefer it if I could simply plug values into the function one at a time so that I do not have to compute the entire range all at once.
I've looked into Minimal Perfect Hash Functions but haven't been able to find one that doesn't use external libraries. I'm okay with using them, but would prefer to not do so.
If an actual range is necessary to help answer my question, I don't think it would need to be bigger than [0, 2^24-1], but the starting and ending values don't matter too much.

You might want to take a look at Linear Congruential Generator. You shall be looking at full period generator (say, m=224), which means parameters shall satisfy Hull-Dobell Theorem.
Calling f(f(x)) should be valid as well.
should work
the number of calls is equal to the size of the range should give you the original number
yes, for LCG with parameters satisfying Hull-Dobell Theorem you'll get full period covered once, and 'm+1' call shall put you back at where you started.
Period of such LCG is exactly equal to m
should not be time dependent and will always give the same output
LCG is O(1) algorithm and it is 100% reproducible
LCG is reversible as well, via extended Euclid algorithm, check Reversible pseudo-random sequence generator for details

Minimal perfect hash functions are overkill, all you've asked for is a function f that is,
bijective, and
"cyclical" (ie fN=f)
For a permutation to be cyclical in that way, its order must divide N (or be N but in a way that's just a special case of dividing N). Which in turn means the LCM of the orders of the sub-cycles must divide N. One way to do that is to just have one "sub"-cycle of order N. For power of two N, it's also really easy to have lots of small cycles of some other power-of-two order. General permutations do not necessarily satisfy the cycle-requirement, of course they are bijective but the LCM of the orders of the sub-cycles may exceed N.
In the following I will leave all reduction modulo N implicit. Without loss of generality I will assume the range starts at 0 and goes up to N-1, where N is the size of the range.
The only thing I can immediately think of for general N is f(x) = x + c where gcd(c, N) == 1. The GCD condition ensures there is only one cycle, which necessarily has order N.
For power-of-two N I have more inspiration:
f(x) = cx where c is odd. Bijective because gcd(c, N) == 1 so c has a modular multiplicative inverse. Also cN=1, because φ(N)=N/2 (since N is a power of two) so cφ(N)=1 (Euler's theorem).
f(x) = x XOR c where c < N. Trivially bijective and trivially cycles with a period of 2, which divides N.
f(x) = clmul(x, c) where c is odd and clmul is carry-less multiplication. Bijective because any odd c has a carry-less multiplicative inverse. Has some power-of-two cycle length (less than N) so it divides N. I don't know why though. This is a weird one, but it has decent special cases such as x ^ (x << k). By symmetry, the "mirrored" version also works.
Eg x ^ (x >> k).
f(x) = x >>> k where >>> is bit-rotation. Obviously bijective, and fN(x) = x >>> Nk, where Nk mod N = 0 so it rotates all the way back to the unrotated position regardless of what k is.

Reverse function

I have been trying to reverse a quite simple looking function.
the function is presented in assembly:
(Argument is loaded into AX)
AND AX, 0xFFFE (round down to even number)
MUL AX (Multiply AX by AX ; the result is represented as DX:AX)
XOR AX,DX
The function can be described as: H(X) = F(X & 0xFFFE); F(X) = ((X * X) mod 2^16) xor ((X * X) div 2^16)
Calculated all of the values from 1 to 2^16 and plotted on matlab in order to "see" some function.
Can anyone help me find an answer to this? (when given y what is the argument x).
It might be that for some values there is more than one answer, so narrowing it down is my goal.
Thanks,
Or.

It's a hash function.
You can't reverse a hash function, because the whole point of it is that it's a one way function.
The multiply is clearly reversible, it's the xor that's not. By combining the low and high part of the multiplication you lose information.
As you can see in the plot there are some white spaces, because there are 2^16 spaces in that plot that means there are also different input values that hash to the same value.
This is common in a hash function.
The only way to 'reverse' it is to build a lookup table that translates output values into possible input values. However you will find that for every output values that be 1 or more input values.
An even number x an even number is always a multiple of 4.
So the low 2 bits are always 0, ergo the low 2 bits of the result are bits 16+17 of the multiplication.
Bits 2..15 are a mix of bits 2..15 xor bits 18..31.
A quick simulation shows 24350 unique outputs ergo on average 1.34 0.34 duplicates for every input value, not bad.
The maximum number of collisions is 6, but most numbers don't collide.
For all those numbers that don't collide you can uniquely lookup your input value in the lookup table (all this disregarding odd input values obviously).

How to interpret the result from KissFFT's kiss_fftr (FFT for a real signal) function

I'm using KissFFT's real function to transform some real audio signals. I'm confused, since I input a real signal with nfft samples, but the result is nfft/2+1 complex frequency bins.
From KissFFT's README:
The real (i.e. not complex) optimization code only works for even length ffts. It does two half-length FFTs in parallel (packed into real&imag), and then combines them via twiddling. The result is nfft/2+1 complex frequency bins from DC to Nyquist.
So I have no concrete knowledge of how to interpret the result. My assumption is the data is packed like r[0]i[0]r[1]i[1]...r[nfft/2]i[nfft/2], where r[0] would be DC, i[0] is the first frequency bin, r[1] the second, and so on. Is this the case?

Yes. The reason kiss_fftr makes only Nfft/2+1 bins is because the DFT of a real signal is conjugate-symmetric. The coefficients corresponding to negative frequencies ( -pi:0 or pi:2pi , whichever way to like to think about it) , are the conjugated coefficients from [0:pi).
Note the out[0] and out[Nfft/2] bins (DC and Nyquist) have zero in the imaginary part. I've seen some libraries pack these two real parts together in the first complex, but I view that as a breach of contract that leads to difficult-to-diagnose, nearly-right bugs.
Tip: If you are using float for your data type (default), you can cast the output array to float complex* (c99) or std::complex* (c++). The packing for the kiss_fft_cpx struct is compatible. The reason it doesn't use these by default is because kiss_fft works with other types beside float and double and on older ANSI C compilers that lack these features.
Here's a contrived example (assuming c99 compiler and type==float)
float get_nth_bin_phase(const float * in, int nfft, int whichbin )
{
kiss_fftr_cfg st = kiss_fftr_alloc(1024,0,0,0);
float complex * out = malloc(sizeof(float complex)*(nfft/2+1));
kiss_fftr(st,in,(kiss_fft_cpx*)out);
whichbin %= nfft;
if ( whichbin <= nfft/2 )
ph = cargf(out[whichbin]);
else
ph = cargf( conjf( out[nfft-whichbin] ) );
free(out);
kiss_fft_free(st);
return ph;
}

r[1]and i[1] of the fftr result constitute a complex vector. Together they give you a magnitude (sqrt of the sum of the squares of the 2 components) and a phase (via atan2()) of the first frequency bin.

For any finite floating point value, is it guaranteed that x - x == 0?

Floating point values are inexact, which is why we should rarely use strict numerical equality in comparisons. For example, in Java this prints false (as seen on ideone.com):
System.out.println(.1 + .2 == .3);
// false
Usually the correct way to compare results of floating point calculations is to see if the absolute difference against some expected value is less than some tolerated epsilon.
System.out.println(Math.abs(.1 + .2 - .3) < .00000000000001);
// true
The question is about whether or not some operations can yield exact result. We know that for any non-finite floating point value x (i.e. either NaN or an infinity), x - x is ALWAYS NaN.
But if x is finite, is any of this guaranteed?
x * -1 == -x
x - x == 0
(In particular I'm most interested in Java behavior, but discussions for other languages are also welcome.)
For what it's worth, I think (and I may be wrong here) the answer is YES! I think it boils down to whether or not for any finite IEEE-754 floating point value, its additive inverse is always computable exactly. Since e.g. float and double has one dedicated bit just for the sign, this seems to be the case, since it only needs flipping of the sign bit to find the additive inverse (i.e. the significand should be left intact).
Related questions
Correct Way to Obtain The Most Negative Double
How many double numbers are there between 0.0 and 1.0?

Both equalities are guaranteed with IEEE 754 floating-point, because the results of both x-x and x * -1 are representable exactly as floating-point numbers of the same precision as x. In this case, regardless of the rounding mode, the exact values have to be returned by a compliant implementation.
EDIT: Comparing to the .1 + .2 example.
You can't add .1 and .2 in IEEE 754 because you can't represent them to pass to +. Addition, subtraction, multiplication, division and square root return the unique floating-point value which, depending on the rounding mode, is immediately below, immediately above, nearest with a rule to handle ties, ..., the result of the operation on the same arguments in R. Consequently, when the result (in R) happens to be representable as a floating-point number, this number is automatically the result regardless of the rounding mode.
The fact that your compiler lets you write 0.1 as shorthand for a different, representable number without a warning is orthogonal to the definition of these operations. When you write - (0.1) for instance, the - is exact: it returns exactly the opposite of its argument. On the other hand, its argument is not 0.1, but the double that your compiler uses in its place.
In short, another part of the reason why the operation x * (-1) is exact is that -1 can be represented as a double.

Although x - x may give you -0 rather than true 0, -0 compares as equal to 0, so you will be safe with your assumption that any finite number minus itself will compare equal to zero.
See Is there a floating point value of x, for which x-x == 0 is false? for more details.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

What are vector division and multiplication as in CUDA __half2 arithmetic? - cuda

Related

Explanation on determining if a decimal number has a finite representation in a base

How to map number in a range to another in the same range with no collisions?

Reverse function

How to interpret the result from KissFFT's kiss_fftr (FFT for a real signal) function

For any finite floating point value, is it guaranteed that x - x == 0?

Categories

Resources