FLT_MAX for half floats

FLT_MAX for half floats - cuda

I am using CUDA with half floats, or __half as they are called in CUDA.
What is the half-float equivalent of FLT_MAX?
The cuda_fp16.h header does not seem to have a macro that resembles this.
$ grep MAX /usr/local/cuda-11.1/targets/x86_64-linux/include/cuda_fp16.h
$

I needed similar macros once before (not in CUDA though) and found some constants in this C++ fp16 proposal for short floats.
The "S" prefix comes from the proposed "short" in short float.
// Smallest positive short float
#define SFLT_MIN 5.96046448e-08
// Smallest positive
// normalized short float
#define SFLT_NRM_MIN 6.10351562e-05
// Largest positive short float
#define SFLT_MAX 65504.0
// Smallest positive e
// for which (1.0 + e) != (1.0)
#define SFLT_EPSILON 0.00097656
// Number of digits in mantissa
// (significand + hidden leading 1)
#define SFLT_MANT_DIG 11
// Number of base 10 digits that
// can be represented without change
#define SFLT_DIG 2
// Base of the exponent
#define SFLT_RADIX 2
// Minimum negative integer such that
// HALF_RADIX raised to the power of
// one less than that integer is a
// normalized short float
#define SFLT_MIN_EXP -13
// Maximum positive integer such that
// HALF_RADIX raised to the power of
// one less than that integer is a
// normalized short float
#define SFLT_MAX_EXP 16
// Minimum positive integer such
// that 10 raised to that power is
// a normalized short float
#define SFLT_MIN_10_EXP -4
// Maximum positive integer such
// that 10 raised to that power is
// a normalized short float
#define SFLT_MAX_10_EXP 4
You can also find similar constants from the half.hpp library.
NOTE: I am not sure what the CUDA compiler supports regarding fp16 literals. So you might need to convert these to hex reinterpret the bits as __half (NOTE: note convert/cast).
None of this is ideal and if someone can point you to some cuda_fp16_limits.h file, then favor that answer over this one.

Related

Calculate __half version of FLT_MAX [duplicate]

I am using CUDA with half floats, or __half as they are called in CUDA.
What is the half-float equivalent of FLT_MAX?
The cuda_fp16.h header does not seem to have a macro that resembles this.
$ grep MAX /usr/local/cuda-11.1/targets/x86_64-linux/include/cuda_fp16.h
$

I needed similar macros once before (not in CUDA though) and found some constants in this C++ fp16 proposal for short floats.
The "S" prefix comes from the proposed "short" in short float.
// Smallest positive short float
#define SFLT_MIN 5.96046448e-08
// Smallest positive
// normalized short float
#define SFLT_NRM_MIN 6.10351562e-05
// Largest positive short float
#define SFLT_MAX 65504.0
// Smallest positive e
// for which (1.0 + e) != (1.0)
#define SFLT_EPSILON 0.00097656
// Number of digits in mantissa
// (significand + hidden leading 1)
#define SFLT_MANT_DIG 11
// Number of base 10 digits that
// can be represented without change
#define SFLT_DIG 2
// Base of the exponent
#define SFLT_RADIX 2
// Minimum negative integer such that
// HALF_RADIX raised to the power of
// one less than that integer is a
// normalized short float
#define SFLT_MIN_EXP -13
// Maximum positive integer such that
// HALF_RADIX raised to the power of
// one less than that integer is a
// normalized short float
#define SFLT_MAX_EXP 16
// Minimum positive integer such
// that 10 raised to that power is
// a normalized short float
#define SFLT_MIN_10_EXP -4
// Maximum positive integer such
// that 10 raised to that power is
// a normalized short float
#define SFLT_MAX_10_EXP 4
You can also find similar constants from the half.hpp library.
NOTE: I am not sure what the CUDA compiler supports regarding fp16 literals. So you might need to convert these to hex reinterpret the bits as __half (NOTE: note convert/cast).
None of this is ideal and if someone can point you to some cuda_fp16_limits.h file, then favor that answer over this one.

GSL Fast-Fourier Transform - Non-zero Imaginary for Transformed Gaussian?

As an extension to this question that I asked. The Fourier transform of a real Gaussian is a real Gaussian. Now of course a DFT of a set of points that only resemble a Gaussian will not always be a perfect Gaussian, but it should certainly be close. In the code below I'm taking this [discrete] Fourier transform using GSL. Aside from the issue of the returned/transformed real components (outlined in linked question), I'm getting a weird result for the imaginary component (which should be identically zero). Granted, it's very small in magnitude, but its still weird. What is the cause for this asymmetric & funky output?
#include <gsl/gsl_fft_complex.h>
#include <gsl/gsl_errno.h>
#include <fstream>
#include <iostream>
#include <iomanip>
#define REAL(z,i) ((z)[2*(i)]) //complex arrays stored as [Re(z0),Im(z0),Re(z1),Im(z1),...]
#define IMAG(z,i) ((z)[2*(i)+1])
#define MODU(z,i) ((z)[2*(i)])*((z)[2*(i)])+((z)[2*(i)+1])*((z)[2*(i)+1])
#define PI 3.14159265359
using namespace std;
int main(){
int n = pow(2,9);
double data[2*n];
double N = (double) n;
ofstream file_out("out.txt");
double xmin=-10.;
double xmax=10.;
double dx=(xmax-xmin)/N;
double x=xmin;
for (int i=0; i<n; ++i){
REAL(data,i)=exp(-100.*x*x);
IMAG(data,i)=0.;
x+=dx;
}
gsl_fft_complex_radix2_forward(data, 1, n);
for (int i=0; i<n; ++i){
file_out<<(i-n/2)<<" "<<IMAG(data,((i+n/2)%n))<<'\n';
}
file_out.close();
}

Your result for the imaginary part is correct and expected.
The difference to zero (10^-15) is less than accuracy that you give to pi (12 digits, pi is used in the FFT, but I'm can't know whether you are overriding the pi inside the routine).
The FFT of a real function is not in general a real function. When you do the math analytically you integrate over the following expression:
f(t) e^{i w t} = f(t) cos wt + i f(t) sin wt,
so only if the function f(t) is real and even will the imaginary part (which is otherwise odd) vanish during integration. This has little meaning though, since the real part and imaginary part have physical meaning only in special cases.
Direct physical meaning is in the abs value (magnitude spectrum), the abs. value squared (intensity spectrum) and the phase or angle (phase spectrum).
A more significant offset from zero in the imaginary part would happen if it wasn't centered at the center of your time vector. Try shifting the x vector by some fraction of dx.
See below how the shift of the input by dx/2 (right column) affects the imaginary part, but not the magnitude (example written in Python, Numpy).
from __future__ import division
import numpy as np
import matplotlib.pyplot as p
%matplotlib inline
n=512 # number of samples 2**9
x0,x1=-10,10
dx=(x1-x0)/n
x= np.arange(-10,10,dx) # even number, asymmetric range [-10, 10-dx]
#make signal
s1= np.exp(-100*x**2)
s2= np.exp(-100*(x+dx/2 )**2)
#make ffts
f1=np.fft.fftshift(np.fft.fft(s1))
f2=np.fft.fftshift(np.fft.fft(s2))
#plots
p.figure(figsize=(16,12))
p.subplot(421)
p.title('gaussian (just ctr shown)')
p.plot(s1[250:262])
p.subplot(422)
p.title('same, shifted by dx/2')
p.plot(s2[250:262])
p.subplot(423)
p.plot(np.imag(f1))
p.title('imaginary part of FFT')
p.subplot(424)
p.plot(np.imag(f2))
p.subplot(425)
p.plot(np.real(f1))
p.title('real part of FFT')
p.subplot(426)
p.plot(np.real(f2))
p.subplot(427)
p.plot(np.abs(f1))
p.title('abs. value of FFT')
p.subplot(428)
p.plot(np.abs(f2))

How to look at a certain bit in C programming?

I'm having trouble trying to find a function to look at a certain bit. If, for example, I had a binary number of 1111 1111 1111 1011, and I wanted to just look at the most significant bit ( the bit all the way to the left, in this case 1) what function could I use to just look at that bit?
The program is to test if a binary number is positive or negative. I started off by using hex number 0x0005, and then using a two's compliment function to make it negative. But now, I need a way to check if the first bit is 1 or 0 and to return a value out of that. The integer n would be equal to 1 or 0 depending on if it is negative or positive. My code is as follows:
#include <msp430.h>
signed long x=0x0005;
int y,i,n;
void main(void)
{
y=~x;
i=y+1;
}

There are two main ways I have done something like this in the past. The first is a bit mask which you would use if you always are checking the exact same bit(s). For example:
#define MASK 0x80000000
// Return value of "0" means the bit wasn't set, "1" means the bit was.
// You can check as many bits as you want with this call.
int ApplyMask(int number) {
return number & MASK;
}
Second is a bit shift, then a mask (for getting an arbitrary bit):
int CheckBit(int number, int bitIndex) {
return number & (1 << bitIndex);
}
One or the other of these should do what you are looking for. Best of luck!

bool isSetBit (signed long number, int bit)
{
assert ((bit >= 0) && (bit < (sizeof (signed long) * 8)));
return (number & (((signed long) 1) << bit)) != 0;
}
To check the sign bit:
if (isSetBit (y, sizeof (y) * 8 - 1))
...

Translation from Complex-FFT to Finite-Field-FFT

Good afternoon!
I am trying to develop an NTT algorithm based on the naive recursive FFT implementation I already have.
Consider the following code (coefficients' length, let it be m, is an exact power of two):
/// <summary>
/// Calculates the result of the recursive Number Theoretic Transform.
/// </summary>
/// <param name="coefficients"></param>
/// <returns></returns>
private static BigInteger[] Recursive_NTT_Skeleton(
IList<BigInteger> coefficients,
IList<BigInteger> rootsOfUnity,
int step,
int offset)
{
// Calculate the length of vectors at the current step of recursion.
// -
int n = coefficients.Count / step - offset / step;
if (n == 1)
{
return new BigInteger[] { coefficients[offset] };
}
BigInteger[] results = new BigInteger[n];
IList<BigInteger> resultEvens =
Recursive_NTT_Skeleton(coefficients, rootsOfUnity, step * 2, offset);
IList<BigInteger> resultOdds =
Recursive_NTT_Skeleton(coefficients, rootsOfUnity, step * 2, offset + step);
for (int k = 0; k < n / 2; k++)
{
BigInteger bfly = (rootsOfUnity[k * step] * resultOdds[k]) % NTT_MODULUS;
results[k] = (resultEvens[k] + bfly) % NTT_MODULUS;
results[k + n / 2] = (resultEvens[k] - bfly) % NTT_MODULUS;
}
return results;
}
It worked for complex FFT (replace BigInteger with a complex numeric type (I had my own)). It doesn't work here even though I changed the procedure of finding the primitive roots of unity appropriately.
Supposedly, the problem is this: rootsOfUnity parameter passed originally contained only the first half of m-th complex roots of unity in this order:
omega^0 = 1, omega^1, omega^2, ..., omega^(n/2)
It was enough, because on these three lines of code:
BigInteger bfly = (rootsOfUnity[k * step] * resultOdds[k]) % NTT_MODULUS;
results[k] = (resultEvens[k] + bfly) % NTT_MODULUS;
results[k + n / 2] = (resultEvens[k] - bfly) % NTT_MODULUS;
I originally made use of the fact, that at any level of recursion (for any n and i), the complex root of unity -omega^(i) = omega^(i + n/2).
However, that property obviously doesn't hold in finite fields. But is there any analogue of it which would allow me to still compute only the first half of the roots?
Or should I extend the cycle from n/2 to n and pre-compute all the m-th roots of unity?
Maybe there are other problems with this code?..
Thank you very much in advance!

I recently wanted to implement NTT for fast multiplication instead of DFFT too. Read a lot of confusing things, different letters everywhere and no simple solution, and also my finite fields knowledge is rusty , but today i finally got it right (after 2 days of trying and analog-ing with DFT coefficients) so here are my insights for NTT:
Computation
X(i) = sum(j=0..n-1) of ( Wn^(i*j)*x(i) );
where X[] is NTT transformed x[] of size n where Wn is the NTT basis. All computations are on integer modulo arithmetics mod p no complex numbers anywhere.
Important values
Wn = r ^ L mod p is basis for NTT
Wn = r ^ (p-1-L) mod p is basis for INTT
Rn = n ^ (p-2) mod p is scaling multiplicative constant for INTT ~(1/n)
p is prime that p mod n == 1 and p>max'
max is max value of x[i] for NTT or X[i] for INTT
r = <1,p)
L = <1,p) and also divides p-1
r,L must be combined so r^(L*i) mod p == 1 if i=0 or i=n
r,L must be combined so r^(L*i) mod p != 1 if 0 < i < n
max' is the sub-result max value and depends on n and type of computation. For single (I)NTT it is max' = n*max but for convolution of two n sized vectors it is max' = n*max*max etc. See Implementing FFT over finite fields for more info about it.
functional combination of r,L,p is different for different n
this is important, you have to recompute or select parameters from table before each NTT layer (n is always half of the previous recursion).
Here is my C++ code that finds the r,L,p parameters (needs modular arithmetics which is not included, you can replace it with (a+b)%c,(a-b)%c,(a*b)%c,... but in that case beware of overflows especial for modpow and modmul) The code is not optimized yet there are ways to speed it up considerably. Also prime table is fairly limited so either use SoE or any other algo to obtain primes up to max' in order to work safely.
DWORD _arithmetics_primes[]=
{
2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173,
179,181,191,193,197,199,211,223,227,229,233,239,241,251,257,263,269,271,277,281,283,293,307,311,313,317,331,337,347,349,353,359,367,373,379,383,389,397,401,409,
419,421,431,433,439,443,449,457,461,463,467,479,487,491,499,503,509,521,523,541,547,557,563,569,571,577,587,593,599,601,607,613,617,619,631,641,643,647,653,659,
661,673,677,683,691,701,709,719,727,733,739,743,751,757,761,769,773,787,797,809,811,821,823,827,829,839,853,857,859,863,877,881,883,887,907,911,919,929,937,941,
947,953,967,971,977,983,991,997,1009,1013,1019,1021,1031,1033,1039,1049,1051,1061,1063,1069,1087,1091,1093,1097,1103,1109,1117,1123,1129,1151,
0}; // end of table is 0, the more primes are there the bigger numbers and n can be used
// compute NTT consts W=r^L%p for n
int i,j,k,n=16;
long w,W,iW,p,r,L,l,e;
long max=81*n; // edit1: max num for NTT for my multiplication purposses
for (e=1,j=0;e;j++) // find prime p that p%n=1 AND p>max ... 9*9=81
{
p=_arithmetics_primes[j];
if (!p) break;
if ((p>max)&&(p%n==1))
for (r=2;r<p;r++) // check all r
{
for (l=1;l<p;l++)// all l that divide p-1
{
L=(p-1);
if (L%l!=0) continue;
L/=l;
W=modpow(r,L,p);
e=0;
for (w=1,i=0;i<=n;i++,w=modmul(w,W,p))
{
if ((i==0) &&(w!=1)) { e=1; break; }
if ((i==n) &&(w!=1)) { e=1; break; }
if ((i>0)&&(i<n)&&(w==1)) { e=1; break; }
}
if (!e) break;
}
if (!e) break;
}
}
if (e) { error; } // error no combination r,l,p for n found
W=modpow(r, L,p); // Wn for NTT
iW=modpow(r,p-1-L,p); // Wn for INTT
and here is my slow NTT and INTT implementations (i havent got to fast NTT,INTT yet) they are both tested with Schönhage–Strassen multiplication successfully.
//---------------------------------------------------------------------------
void NTT(long *dst,long *src,long n,long m,long w)
{
long i,j,wj,wi,a,n2=n>>1;
for (wj=1,j=0;j<n;j++)
{
a=0;
for (wi=1,i=0;i<n;i++)
{
a=modadd(a,modmul(wi,src[i],m),m);
wi=modmul(wi,wj,m);
}
dst[j]=a;
wj=modmul(wj,w,m);
}
}
//---------------------------------------------------------------------------
void INTT(long *dst,long *src,long n,long m,long w)
{
long i,j,wi=1,wj=1,rN,a,n2=n>>1;
rN=modpow(n,m-2,m);
for (wj=1,j=0;j<n;j++)
{
a=0;
for (wi=1,i=0;i<n;i++)
{
a=modadd(a,modmul(wi,src[i],m),m);
wi=modmul(wi,wj,m);
}
dst[j]=modmul(a,rN,m);
wj=modmul(wj,w,m);
}
}
//---------------------------------------------------------------------------
dst is destination array
src is source array
n is array size
m is modulus (p)
w is basis (Wn)
hope this helps to someone. If i forgot something please write ...
[edit1: fast NTT/INTT]
Finally I manage to get fast NTT/INTT to work. Was little bit more tricky than normal FFT:
//---------------------------------------------------------------------------
void _NFTT(long *dst,long *src,long n,long m,long w)
{
if (n<=1) { if (n==1) dst[0]=src[0]; return; }
long i,j,a0,a1,n2=n>>1,w2=modmul(w,w,m);
// reorder even,odd
for (i=0,j=0;i<n2;i++,j+=2) dst[i]=src[j];
for ( j=1;i<n ;i++,j+=2) dst[i]=src[j];
// recursion
_NFTT(src ,dst ,n2,m,w2); // even
_NFTT(src+n2,dst+n2,n2,m,w2); // odd
// restore results
for (w2=1,i=0,j=n2;i<n2;i++,j++,w2=modmul(w2,w,m))
{
a0=src[i];
a1=modmul(src[j],w2,m);
dst[i]=modadd(a0,a1,m);
dst[j]=modsub(a0,a1,m);
}
}
//---------------------------------------------------------------------------
void _INFTT(long *dst,long *src,long n,long m,long w)
{
long i,rN;
rN=modpow(n,m-2,m);
_NFTT(dst,src,n,m,w);
for (i=0;i<n;i++) dst[i]=modmul(dst[i],rN,m);
}
//---------------------------------------------------------------------------
[edit3]
I have optimized my code (3x times faster than code above),but still i am not satisfied with it so i started new question with it. There I have optimized my code even further (about 40x times faster than code above) so its almost the same speed as FFT on floating point of the same bit size. Link to it is here:
Modular arithmetics and NTT (finite field DFT) optimizations

To turn Cooley-Tukey (complex) FFT into modular arithmetic approach, i.e. NTT, you must replace complex definition for omega. For the approach to be purely recursive, you also need to recalculate omega for each level based on current signal size. This is possible because min. suitable modulus decreases as we move down in the call tree, so modulus used for root is suitable for lower layers. Additionally, as we are using same modulus, the same generator may be used as we move down the call tree. Also, for inverse transform, you should take additional step to take recalculated omega a and instead use as omega: b = a ^ -1 (via using inverse modulo operation). Specifically, b = invMod(a, N) s.t. b * a == 1 (mod N), where N is the chosen prime modulus.
Rewriting an expression involving omega by exploiting periodicity still works in modular arithmetic realm. You also need to find a way to determine the modulus (a prime) for the problem and a valid generator.
We note that your code works, though it is not a MWE. We extended it using common sense, and got correct result for a polynomial multiplication application. You just have to provide correct values of omega raised to certain powers.
While your code works, though, like from many other sources, you double spacing for each level. This does not lead to recursion that is as clean, though; this turns out to be identical to recalculating omega based on current signal size because the power for omega definition is inversely proportional to signal size. To reiterate: halving signal size is like squaring omega, which is like giving doubled powers for omega (which is what one would do for doubling of spacing). The nice thing about the approach that deals with recalculating of omega is that each subproblem is more cleanly complete in its own right.
There is a paper that shows some of the math for modular approach; it is a paper by Baktir and Sunar from 2006. See the paper at the end of this post.
You do not need to extend the cycle from n / 2 to n.
So, yes, some sources which say to just drop in a different omega definition for modular arithmetic approach are sweeping under the rug many details.
Another issue is that it is important to acknowledge that the signal size must be large enough if we are to not have overflow for result time-domain signal if we are performing convolution. Additionally, it may be useful to find certain implementations for exponentiation subject to modulus exist that are fast, even if the power is quite large.
References
Baktir and Sunar - Achieving efficient polynomial multiplication in Fermat fields using the fast Fourier transform (2006)

You must make sure that roots of unity actually exist. In R there are only 2 roots of unity: 1 and -1, since only for them x^n=1 can be true.
In C you have infinitely many roots of unity: w=exp(2*pi*i/N) is a primitive N-th roots of unity and all w^k for 0<=k
Now to your problem: you have to make sure the ring you're working in offers the same property: enough roots of unity.
Schönhage and Strassen (http://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm) use integers modulo 2^N+1. This ring has enough roots of unity. 2^N == -1 is a 2nd root of unity, 2^(N/2) is a 4th root of unity and so on. Furthermore, these roots of unity have the advantage that they are powers of two and can be implemented as binary shifts (with a modulo operation afterwards, which comes down to a add/subtract).
I think QuickMul (http://www.cs.nyu.edu/exact/doc/qmul.ps) works modulo 2^N-1.

How many distinct floating-point numbers in a specific range?

How many representable floats are there between 0.0 and 0.5? And how many representable floats are there between 0.5 and 1.0? I'm more interested in the math behind it, and I need the answer for floats and doubles.

For IEEE754 floats, this is fairly straight forward. Fire up the Online Float Calculator and read on.
All pure powers of 2 are represented by a mantissa 0, which is actually 1.0 due to the implied leading 1. The exponent is corrected by a bias, so 1 and 0.5 are respectively 1.0 × 20 and 1.0 × 2−1, or in binary:
S Ex + 127 Mantissa - 1 Hex
1: 0 01111111 00000000000000000000000 0x3F800000
+ 0 + 127 1.0
0.5: 0 01111110 00000000000000000000000 0x3F000000
+ -1 + 127 1.0
Since the floating point numbers represented in this form are ordered in the same order as their binary representation, we only need to take the difference of the integral value of the binary representation and conclude that there are 0x800000 = 223, i.e. 8,388,608 single-precision floating point values in the interval [0.5, 1.0).
Similarly, the answer is 252 for double and 263 for long double.

A floating point number in IEEE754 format is between 0.0 (inclusive) and 0.5 (exclusive) if and only if the sign bit is 0 and the exponent is < -1. The mantissa bits can be arbitrary. For float, that makes 2^23 numbers per admissible exponent, for double 2^52. How many admissible exponents are there? For float, the minimal exponent for normalised numbers is -126, for double it's -1022, so there are
126*2^23 = 1056964608
float values in [0, 0.5) and
1022*2^52 = 4602678819172646912
double values.

Kerrek gave the best explanation :)
Just in case here is the code to play with other intervals too
http://coliru.stacked-crooked.com/a/7a75ba5eceb49f84
#include <iostream>
#include <cmath>
template<typename T>
unsigned long long int floatCount(T a, T b)
{
if (a > b)
return 0;
if (a == b)
return 1;
unsigned long long int count = 1;
while(a < b) {
a = std::nextafter(a, b);
++count;
}
return count;
}
int main()
{
std::cout << "number of floats in [0.5..1.0] interval are " << floatCount(0.5f, 1.0f);
}
prints
number of floats in [0.5..1.0] interval are 8388609

For 0.0..0.5: you need to worry about exponents from -1 down to as low as possible, and then multiply how many you get time the number of distinct values you can represent in the mantissa.
For every value in that range, if you double it, you get a value in the range of 0.5..1.0. And doubling it means just bumping up the exponent.
You also need to worry about unnormalized numbers, where the mantissa isn't used to represent 1.x, but 0.x, and thus will all be in your lower range, but can't be doubled by bumping up the exponent (since a particular value of the exponent is used to indicate that the value is unnormalized).

This isn't an answer per-se, but you might get some milage out of the nextafter function. Something like this ought to help you answer your question, though you'll have to work out the math yourself:
float f = 0;
while(f < 0.5)
{
print("%f (repr: 0x%x)\n", f, *(unsigned *)&f);
f = nextafterf(f, 0.5);
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008