How many distinct floating-point numbers in a specific range? - language-agnostic

How many rep­re­sentable floats are there be­tween 0.0 and 0.5? And how many representable floats are there between 0.5 and 1.0? I'm more interested in the math behind it, and I need the answer for floats and doubles.

For IEEE754 floats, this is fairly straight forward. Fire up the Online Float Calculator and read on.
All pure powers of 2 are represented by a mantissa 0, which is actually 1.0 due to the implied leading 1. The exponent is corrected by a bias, so 1 and 0.5 are respectively 1.0 × 20 and 1.0 × 2−1, or in binary:
S Ex + 127 Mantissa - 1 Hex
1: 0 01111111 00000000000000000000000 0x3F800000
+ 0 + 127 1.0
0.5: 0 01111110 00000000000000000000000 0x3F000000
+ -1 + 127 1.0
Since the floating point numbers represented in this form are ordered in the same order as their binary representation, we only need to take the difference of the integral value of the binary representation and conclude that there are 0x800000 = 223, i.e. 8,388,608 single-precision floating point values in the interval [0.5, 1.0).
Similarly, the answer is 252 for double and 263 for long double.

A floating point number in IEEE754 format is between 0.0 (inclusive) and 0.5 (exclusive) if and only if the sign bit is 0 and the exponent is < -1. The mantissa bits can be arbitrary. For float, that makes 2^23 numbers per admissible exponent, for double 2^52. How many admissible exponents are there? For float, the minimal exponent for normalised numbers is -126, for double it's -1022, so there are
126*2^23 = 1056964608
float values in [0, 0.5) and
1022*2^52 = 4602678819172646912
double values.

Kerrek gave the best explanation :)
Just in case here is the code to play with other intervals too
http://coliru.stacked-crooked.com/a/7a75ba5eceb49f84
#include <iostream>
#include <cmath>
template<typename T>
unsigned long long int floatCount(T a, T b)
{
if (a > b)
return 0;
if (a == b)
return 1;
unsigned long long int count = 1;
while(a < b) {
a = std::nextafter(a, b);
++count;
}
return count;
}
int main()
{
std::cout << "number of floats in [0.5..1.0] interval are " << floatCount(0.5f, 1.0f);
}
prints
number of floats in [0.5..1.0] interval are 8388609

For 0.0..0.5: you need to worry about exponents from -1 down to as low as possible, and then multiply how many you get time the number of distinct values you can represent in the mantissa.
For every value in that range, if you double it, you get a value in the range of 0.5..1.0. And doubling it means just bumping up the exponent.
You also need to worry about unnormalized numbers, where the mantissa isn't used to represent 1.x, but 0.x, and thus will all be in your lower range, but can't be doubled by bumping up the exponent (since a particular value of the exponent is used to indicate that the value is unnormalized).

This isn't an answer per-se, but you might get some milage out of the nextafter function. Something like this ought to help you answer your question, though you'll have to work out the math yourself:
float f = 0;
while(f < 0.5)
{
print("%f (repr: 0x%x)\n", f, *(unsigned *)&f);
f = nextafterf(f, 0.5);
}

Related

Calculate __half version of FLT_MAX [duplicate]

I am using CUDA with half floats, or __half as they are called in CUDA.
What is the half-float equivalent of FLT_MAX?
The cuda_fp16.h header does not seem to have a macro that resembles this.
$ grep MAX /usr/local/cuda-11.1/targets/x86_64-linux/include/cuda_fp16.h
$
I needed similar macros once before (not in CUDA though) and found some constants in this C++ fp16 proposal for short floats.
The "S" prefix comes from the proposed "short" in short float.
// Smallest positive short float
#define SFLT_MIN 5.96046448e-08
// Smallest positive
// normalized short float
#define SFLT_NRM_MIN 6.10351562e-05
// Largest positive short float
#define SFLT_MAX 65504.0
// Smallest positive e
// for which (1.0 + e) != (1.0)
#define SFLT_EPSILON 0.00097656
// Number of digits in mantissa
// (significand + hidden leading 1)
#define SFLT_MANT_DIG 11
// Number of base 10 digits that
// can be represented without change
#define SFLT_DIG 2
// Base of the exponent
#define SFLT_RADIX 2
// Minimum negative integer such that
// HALF_RADIX raised to the power of
// one less than that integer is a
// normalized short float
#define SFLT_MIN_EXP -13
// Maximum positive integer such that
// HALF_RADIX raised to the power of
// one less than that integer is a
// normalized short float
#define SFLT_MAX_EXP 16
// Minimum positive integer such
// that 10 raised to that power is
// a normalized short float
#define SFLT_MIN_10_EXP -4
// Maximum positive integer such
// that 10 raised to that power is
// a normalized short float
#define SFLT_MAX_10_EXP 4
You can also find similar constants from the half.hpp library.
NOTE: I am not sure what the CUDA compiler supports regarding fp16 literals. So you might need to convert these to hex reinterpret the bits as __half (NOTE: note convert/cast).
None of this is ideal and if someone can point you to some cuda_fp16_limits.h file, then favor that answer over this one.

FLT_MAX for half floats

I am using CUDA with half floats, or __half as they are called in CUDA.
What is the half-float equivalent of FLT_MAX?
The cuda_fp16.h header does not seem to have a macro that resembles this.
$ grep MAX /usr/local/cuda-11.1/targets/x86_64-linux/include/cuda_fp16.h
$
I needed similar macros once before (not in CUDA though) and found some constants in this C++ fp16 proposal for short floats.
The "S" prefix comes from the proposed "short" in short float.
// Smallest positive short float
#define SFLT_MIN 5.96046448e-08
// Smallest positive
// normalized short float
#define SFLT_NRM_MIN 6.10351562e-05
// Largest positive short float
#define SFLT_MAX 65504.0
// Smallest positive e
// for which (1.0 + e) != (1.0)
#define SFLT_EPSILON 0.00097656
// Number of digits in mantissa
// (significand + hidden leading 1)
#define SFLT_MANT_DIG 11
// Number of base 10 digits that
// can be represented without change
#define SFLT_DIG 2
// Base of the exponent
#define SFLT_RADIX 2
// Minimum negative integer such that
// HALF_RADIX raised to the power of
// one less than that integer is a
// normalized short float
#define SFLT_MIN_EXP -13
// Maximum positive integer such that
// HALF_RADIX raised to the power of
// one less than that integer is a
// normalized short float
#define SFLT_MAX_EXP 16
// Minimum positive integer such
// that 10 raised to that power is
// a normalized short float
#define SFLT_MIN_10_EXP -4
// Maximum positive integer such
// that 10 raised to that power is
// a normalized short float
#define SFLT_MAX_10_EXP 4
You can also find similar constants from the half.hpp library.
NOTE: I am not sure what the CUDA compiler supports regarding fp16 literals. So you might need to convert these to hex reinterpret the bits as __half (NOTE: note convert/cast).
None of this is ideal and if someone can point you to some cuda_fp16_limits.h file, then favor that answer over this one.

How to look at a certain bit in C programming?

I'm having trouble trying to find a function to look at a certain bit. If, for example, I had a binary number of 1111 1111 1111 1011, and I wanted to just look at the most significant bit ( the bit all the way to the left, in this case 1) what function could I use to just look at that bit?
The program is to test if a binary number is positive or negative. I started off by using hex number 0x0005, and then using a two's compliment function to make it negative. But now, I need a way to check if the first bit is 1 or 0 and to return a value out of that. The integer n would be equal to 1 or 0 depending on if it is negative or positive. My code is as follows:
#include <msp430.h>
signed long x=0x0005;
int y,i,n;
void main(void)
{
y=~x;
i=y+1;
}
There are two main ways I have done something like this in the past. The first is a bit mask which you would use if you always are checking the exact same bit(s). For example:
#define MASK 0x80000000
// Return value of "0" means the bit wasn't set, "1" means the bit was.
// You can check as many bits as you want with this call.
int ApplyMask(int number) {
return number & MASK;
}
Second is a bit shift, then a mask (for getting an arbitrary bit):
int CheckBit(int number, int bitIndex) {
return number & (1 << bitIndex);
}
One or the other of these should do what you are looking for. Best of luck!
bool isSetBit (signed long number, int bit)
{
assert ((bit >= 0) && (bit < (sizeof (signed long) * 8)));
return (number & (((signed long) 1) << bit)) != 0;
}
To check the sign bit:
if (isSetBit (y, sizeof (y) * 8 - 1))
...

Howto convert decimal (xx.xx) to binary

This isn't necessarily a programming question but i'm sure you folks know how to do it. How would i convert floating point numbers into binary.
The number i am looking at is 27.625.
27 would be 11011, but what do i do with the .625?
On paper, a good algorithm to convert the fractional part of a decimal number is the "repeated multiplication by 2" algorithm (see details at http://www.exploringbinary.com/base-conversion-in-php-using-bcmath/, under the heading "dec2bin_f()"). For example, 0.8125 converts to binary as follows:
1. 0.8125 * 2 = 1.625
2. 0.625 * 2 = 1.25
3. 0.25 * 2 = 0.5
4. 0.5 * 2 = 1.0
The integer parts are stripped off and saved at each step, forming the binary result: 0.1101.
If you want a tool to do these kinds of conversions automatically, see my decimal/binary converter.
Assuming you are not thinking about inside a PC, just thinking about binary vs decimal as physically represented on a piece of paper:
You know .1 in binary should be .5 in decimal, so the .1's place is worth .5 (1/2)
the .01 is worth .25 (1/4) (half of the previous one)
the .001 is worth (1/8) (Half of 1/4)
Notice how the denominator is progressing just like the whole numbers to the left of the decimal do--standard ^2 pattern? The next should be 1/16...
So you start with your .625, is it higher than .5? Yes, so set the first bit and subtract the .5
.1 binary with a decimal remainder of .125
Now you have the next spot, it's worth .25dec, is that less than your current remainder of .125? No, so you don't have enough decimal "Money" to buy that second spot, it has to be a 0
.10 binary, still .125 remainder.
Now go to the third position, etc. (Hint: I don't think there will be too much etc.)
There are several different ways to encode a non-integral number in binary. By far the most common type are floating point representations, especially the one codified in IEEE 754.
the code works for me is as below , you can use this code to convert any type of dobule values:
private static String doubleToBinaryString( double n ) {
String val = Integer.toBinaryString((int)n)+"."; // Setting up string for result
String newN ="0" + (""+n).substring((""+n).indexOf("."));
n = Double.parseDouble(newN);
while ( n > 0 ) { // While the fraction is greater than zero (not equal or less than zero)
double r = n * 2; // Multiply current fraction (n) by 2
if( r >= 1 ) { // If the ones-place digit >= 1
val += "1"; // Concat a "1" to the end of the result string (val)
n = r - 1; // Remove the 1 from the current fraction (n)
}else{ // If the ones-place digit == 0
val += "0"; // Concat a "0" to the end of the result string (val)
n = r; // Set the current fraction (n) to the new fraction
}
}
return val; // return the string result with all appended binary values
}

How do I fix my output for floating-point imprecision?

I am doing some float manipulation and end up with the following numbers:
-0.5
-0.4
-0.3000000000000000004
-0.2000000000000000004
-0.1000000000000000003
1.10E-16
0.1
0.2
0.30000000000000000004
0.4
0.5
The algorithm is the following:
var inc:Number = nextMultiple(min, stepSize);
trace(String(inc));
private function nextMultiple(x:Number, y:Number) {
return Math.ceil(x/y)*y;
}
I understand the fact the float cannot always be represented accurately in a byte. e.g 1/3. I also know my stepsize being 0.1. If I have the stepsize how could I get a proper output?
The strange thing is that its the first time I've encountered this type of problem.
Maybe I dont play with float enough.
A language agnostic solution would be to store your numbers as an integer number of steps, given that you know your step size, instead of as floats.
A non-language agnostic solution would be to find out what your language's implementation of printf is.
printf ("float: %.1f\n", number);
The limited floating point precision of binary numbers is your problem, as you recognize. One way around this is not to do floating point math. Translate the problem to integers, then translate back for the output.
Either use integers instead of a floating point type, or use a floating point type where the "point" is a decimal point (e.g. System.Decimal in .NET).
If you're using a language with a round function, you can use that.
Edit
In response to comments about rounding, here's a sample in c#:
float value = 1.0F;
for (int i = 0; i < 20; i++)
{
value -= 0.1F;
Console.WriteLine(Math.Round(value, 1).ToString() + " : " + value.ToString());
}
The results are:
0.9 : 0.9
0.8 : 0.8
0.7 : 0.6999999
0.6 : 0.5999999
(etc)
The rounding does resolve the precision problem. I'm not arguing that it's better than doing integer math and then dividing by 10, just that it works.
With your specific problem, count from -5 to 5 and divide by 10 before actually using the value for something.
I did the following,
var digitsNbr:Number = Math.abs(Math.ceil(((Math.log(stepSize) / Math.log(10))) + 1));
tickTxt.text = String(inc.toPrecision(digitsNbr));
Its not efficient but i dont have many steps.
======
I should just get the nbr of steps as an int and multiply by step ...
If you don't have printf, or if the steps are not just powers of 10 (e.g. if you want to round to the nearest 0.2) then it sounds like you want a quantizer:
q(x,u) = u*floor(x/u + 0.5);
"u" is the step size (0.1 in your case), floor() finds the greatest integer not greater than its input, and the "+ 0.5" is to round to the nearest integer.
So basically, you divide by the step size, round to the nearest integer, and multiply by the step size.
edit: oh, never mind, you're basically doing that anyway & the step where it's multiplying by u is introducing rounding error.
Simply scale the numbers to obtain integers then do maths and scale them back to floats for display:
//this will round to 3 decimal places
var NUM_SCALE = 1000;
function scaleUpNumber(n) {
return (Math.floor(n * NUM_SCALE));
}
function scaleDnNumber(n) {
return (n / NUM_SCALE);
}
var answ = scaleUpNumber(2.1) - scaleUpNumber(3.001);
alert(scaleDnNumber(answ)); // displays: -0.901
Change NUM_SCALE to increase/decrease decimap places
|/|ax
Your best bet is to use a Decimal data type if your language supports it. Decimals were added to a number of languages to combat this exact problem.
This is a bit counter-intuitive, but I tested it and it works (example in AS3):
var inc:Number = nextMultiple(min, stepSize);
trace(String(inc));
private function nextMultiple(x:Number, y:Number) {
return Math.ceil(x/y)*(y*10)/10;
}
So the only thing I added is multiplying y by 10, then dividing by 10. Not an universal solution but works with your stepSize.
[edit:] The logic here seems to be that you multiply by a big enough number so as for the last decimal digits to "drop off the scale", then divide again to get a rounded number. That said, the example above which uses Math.round() is more readable and better in the sense that the code explicitly says what will happen to the numbers passed in.