idl making big numbers = 0.0 - astronomy

I'm trying to the the mass of the black hole at the center of this galaxy, I have the mass in solar masses, but need it in kg. However when I try to convert (1Msolar = 1.989*10^30kg) idl just gives me 0.0000. I have no idea what I'm doing wrong and I tried just telling idl to print both 1.989*10^30 and 1989000000000000000000000000000 and the outputs are 0.00000 and -1 respectively. Can someone please explain why this is happening?

This is a type conversion error/overflow issue. When you use large numbers you either need to explicitly define them as long or long64 (i.e., 64-bit long integer) for integer numbers. For real numbers, you can use float or double and to do this, the easiest way is the following:
msun = 1.989d30
which is equivalent to 1.989 x 1030 as a double-precision floating point number. If you want single precision, then just do the following:
msun = 1.989e30
To make a 32- or 64-bit long integer, just use:
msun = 1989L * 10L^(27)
or for 64-bit
msun = 1989LL * 10LL^(27)

I agree with #honeste_vivere's answer about overflow and data types, but I would add that I often change units to avoid this. I frequently have densities that are order 1e19/m^3, so I cast density in units of 1e19/m^3 and then deal with numbers that are order 1. This prevents math errors during least squares fits and other operations that might do things like squaring my data.

Related

NaN and +-INF in floating point number system following IEEE754

In the standard, representation of NaN and INF is like this:
For NaN: exponent = emax+1 & mantissa != 0;
For INF: exponent = emax+1 & mantissa = 0;
Their are many ways and calculations resulting these two value.
But what ACTUALLY is NaN(INF)?
And HOW does the system "decide" or "judge" to store value as these one(two)?
Here may be a case seeming to be odd to me:
a = b = 1*2(emax);
then calculating c = a+b, the actual result is 1*2^(emax+1);
Now, c is not an available FP value according to the standard;
then how does the system store c in device?
Is it NaN?
If yes, how can this be even reasonable?
I mean, 1*2^(emax+1) IS(Should be) a Number...in a common sense...?
If this is the case, then how ACTUALLY does the standard think what a NaN is?
If not, then how do we deal with this???
I'm considering one like this:
let eM = emax+1;
then 1d.d...d * 2^(eM-1) = 1d.d...d * 2^(emax)
with 1d.d...d having legal number of digits by the system.
This is actually a way like that dealing with denormalized number.
The thing here is this:
Is the judgement posterior or prior to the completion of calculation?
If it's the former, the above may be a problem or not?
On the other hand, then the task seems undonable...
Is there anyone ever thinking about this issue?
Thx for considering it!!
Note: things for +-INF are also presented.
From Wikipedia:
The five possible exceptions are:
Invalid operation (e.g., square root of a negative number) (returns qNaN by default).
Division by zero (an operation on finite operands gives an exact infinite result, e.g., 1/0 or log(0)) (returns ±infinity by default).
Overflow (a result is too large to be represented correctly) (returns ±infinity by default (for round-to-nearest mode)).
Underflow (a result is very small (outside the normal range) and is inexact) (returns a denormalized value by default).
...

Is there a "native" way to convert from numbers to dB in Tcl

dB or decibel is a unit that is used to show ratio in logarithmic scale, and specifecly, the definition of dB that I'm interested in is X(dB) = 20log(x) where x is the "normal" value, and X(dB) is the value in dB. When wrote a code converted between mil. and mm, I noticed that if I use the direct approach, i.e., multiplying by the ratio between the units, I got small errors on the opposite conversion, i.e.: to_mil [to_mm val_in_mil] wasn't equal to val_in_mil and the same with mm. The library units has solved this problem, as the conversions done by it do not have that calculation error. But the specifically doesn't offer (or I didn't find) the option to convert a number to dB in the library.
Is there another library / command that can transform numbers to dB and dB to numbers without calculation errors?
I did an experiment with using the direct math conversion, and I what I got is:
>> set a 0.005
0.005
>> set b [expr {20*log10($a)}]
-46.0205999133
>> expr {pow(10,($b/20))}
0.00499999999999
It's all a matter of precision. We often tend to forget that floating point numbers are not real numbers (in the mathematical sense of ℝ).
How many decimal digit do you need?
If you, for example, would only need 5 decimal digits, rounding 0.00499999999999 will give you 0.00500 which is what you wanted.
Since rounding fp numbers is not an easy task and may generate even more troubles, you might just change the way you determine if two numbers are equal:
>> set a 0.005
0.005
>> set b [expr {20*log10($a)}]
-46.0205999133
>> set c [expr {pow(10,($b/20))}]
0.00499999999999
>> expr {abs($a - $c) < 1E-10}
1
>> expr {abs($a - $c) < 1E-20}
0
>> expr {$a - $c}
8.673617379884035e-19
The numbers in your examples can be considered "equal" up to an error or 10-18. Note that this is just a rough estimate, not a full solution.
If you're really dealing with problems that are sensitive to numerical errors propagation you might look deeper into "numerical analysis". The article What Every Computer Scientist Should Know About Floating-Point Arithmetic or, even better, this site: http://floating-point-gui.de might be a start.
In case you need a larger precision you should drop your "native" requirement.
You may use the BigFloat offered by tcllib (http://tcllib.sourceforge.net/doc/bigfloat.html or even use GMP (the GNU multiple precision arithmetic library) through ffidl (http://elf.org/ffidl). There's an interface already defined for it: gmp.tcl
With the way floating point numbers are stored, every log10(...) can't correspond to exactly one pow(10, ...). So you lose precision, just like the integer divisions 89/7 and 88/7 both are 12.
When you put a value into floating point format, you should forget the ability to know it's exact value anymore unless you keep the old, exact value too. If you want exactly 1/200, store it as the integer 1 and the integer 200. If you want exactly the ten-logarithm of 1/200, store it as 1, 200 and the info that a ten-logarithm has been done on it.
You can fill your entire memory with the first x decimal digits of the square root of 2, but it still won't be the square root of 2 you store.

CUDA, float precision

I am using CUDA 4.0 on Geforce GTX 580 (Fermi) . I have numbers as small as 7.721155e-43 . I want to multiply them with each other just once or better say I want to calculate 7.721155e-43 * 7.721155e-43 .
My experience showed me I can't do it just straight forward. Could you please give me suggestion? Do I need to use double precision? How?
The magnitude of the smallest normal IEEE single-precision number is about 1.18e-38, the smallest denormal gets you down to about 1.40e-45. As a consequece an operand of magnitude 7.82e-43 will comprise only about 9 non-zero bits, which in itself may already be a problem, even before you get to the multiplication (whose result will underflow to zero in single precision). So you may also want to look at any up-stream computation that produces these tiny numbers.
If these small numbers are intermediate terms in a mathematical expression, rewriting that expression into a mathematically equivalent one that does not involve tiny intermediates would be one way of addressing the issue. Or you could scale some operands by factors that are powers of two (so as to not incur additional round-off due to the scaling). For example, scale by 2^24 = 16777216.
Lastly, you can switch part of the computation to double precision. To do so, simply introduce temporary variables of type double, perform the computation on them, then convert the final result back to float:
float r, f = 7.721155e-43f;
double d, t;
d = (double)f; // explicit cast is not necessary, since converting to wider type
t = d * d;
[... more intermediate computation, leaving result in 't' ...]
r = (float)t; // since conversion is to narrower type, cast will avoid warnings
In statistics we often have to work with likelihoods that end up being very small numbers and the standard technique is to use logs for everything. Then multiplication on a log scale is just addition. All intermediate numbers are stored as logs. Indeed it can take a bit of getting used to - but the alternative will often fail even when doing relatively modest computations. In R (for my convenience!) which uses doubles and prints 7 significant figures by default btw:
> 7.721155e-43 * 7.721155e-43
[1] 5.961623e-85
> exp(log(7.721155e-43) + log(7.721155e-43))
[1] 5.961623e-85

Storing decimal number with MySQL

What's the best type to store values such:
48.89384 and -2.34910
Actually I'm using float.
Use decimal for exact values.
Notes:
ABS (Latitude) <= 90
ABS (Longitude) <= 180
So you can us 2 different types
Latitude = decimal (x+2, x)
Longitude = decimal (y+3, y)
x and y will be the desired precision. Given a metre is 1/40,000,000 of the earth's circumferemce, something like 6-8 will be enough depending on whether you're going for street or full stop accuracy in location.
If you want exact representation, and you know the scale that applies, then you can use the decimal data type.
If you work with money use DECIMAL type. It has no floating-points inaccuracy.
#MiniNamin
if you are using sql then it will also work by putting the DataType Numeric(18,4)
The benefit of floating point is that it can scale from very small to very large numbers. The cost of this, however, is that you can encounter rounding errors.
In the case that you know exactly what level of accuracy you need and will work within, the numeric / decimal types made available to you are often more appropriate. While working within the level of accuracy you specifify on creation, they will not encounter any rounding errors.
That depends on what you're using the numbers for. If these numbers are latitude and longitude, and if you don't need exact representations, then FLOAT will work. Use DOUBLE PRECISION for more accuracy.
But for exact representations, use DECIMAL or NUMERIC. Or INT or one of its different sizes, if you'll never have fractions.

What is an integer overflow error?

What is an integer overflow error?
Why do i care about such an error?
What are some methods of avoiding or preventing it?
Integer overflow occurs when you try to express a number that is larger than the largest number the integer type can handle.
If you try to express the number 300 in one byte, you have an integer overflow (maximum is 255). 100,000 in two bytes is also an integer overflow (65,535 is the maximum).
You need to care about it because mathematical operations won't behave as you expect. A + B doesn't actually equal the sum of A and B if you have an integer overflow.
You avoid it by not creating the condition in the first place (usually either by choosing your integer type to be large enough that you won't overflow, or by limiting user input so that an overflow doesn't occur).
The easiest way to explain it is with a trivial example. Imagine we have a 4 bit unsigned integer. 0 would be 0000 and 1111 would be 15. So if you increment 15 instead of getting 16 you'll circle back around to 0000 as 16 is actually 10000 and we can not represent that with less than 5 bits. Ergo overflow...
In practice the numbers are much bigger and it circles to a large negative number on overflow if the int is signed but the above is basically what happens.
Another way of looking at it is to consider it as largely the same thing that happens when the odometer in your car rolls over to zero again after hitting 999999 km/mi.
When you store an integer in memory, the computer stores it as a series of bytes. These can be represented as a series of ones and zeros.
For example, zero will be represented as 00000000 (8 bit integers), and often, 127 will be represented as 01111111. If you add one to 127, this would "flip" the bits, and swap it to 10000000, but in a standard two's compliment representation, this is actually used to represent -128. This "overflows" the value.
With unsigned numbers, the same thing happens: 255 (11111111) plus 1 would become 100000000, but since there are only 8 "bits", this ends up as 00000000, which is 0.
You can avoid this by doing proper range checking for your correct integer size, or using a language that does proper exception handling for you.
An integer overflow error occurs when an operation makes an integer value greater than its maximum.
For example, if the maximum value you can have is 100000, and your current value is 99999, then adding 2 will make it 'overflow'.
You should care about integer overflows because data can be changed or lost inadvertantly, and can avoid them with either a larger integer type (see long int in most languages) or with a scheme that converts long strings of digits to very large integers.
Overflow is when the result of an arithmetic operation doesn't fit in the data type of the operation. You can have overflow with a byte-sized unsigned integer if you add 255 + 1, because the result (256) does not fit in the 8 bits of a byte.
You can have overflow with a floating point number if the result of a floating point operation is too large to represent in the floating point data type's exponent or mantissa.
You can also have underflow with floating point types when the result of a floating point operation is too small to represent in the given floating point data type. For example, if the floating point data type can handle exponents in the range of -100 to +100, and you square a value with an exponent of -80, the result will have an exponent around -160, which won't fit in the given floating point data type.
You need to be concerned about overflows and underflows in your code because it can be a silent killer: your code produces incorrect results but might not signal an error.
Whether you can safely ignore overflows depends a great deal on the nature of your program - rendering screen pixels from 3D data has a much greater tolerance for numerical errors than say, financial calculations.
Overflow checking is often turned off in default compiler settings. Why? Because the additional code to check for overflow after every operation takes time and space, which can degrade the runtime performance of your code.
Do yourself a favor and at least develop and test your code with overflow checking turned on.
From wikipedia:
In computer programming, an integer
overflow occurs when an arithmetic
operation attempts to create a numeric
value that is larger than can be
represented within the available
storage space. For instance, adding 1 to the largest value that can be represented
constitutes an integer overflow. The
most common result in these cases is
for the least significant
representable bits of the result to be
stored (the result is said to wrap).
You should care about it especially when choosing the appropriate data types for your program or you might get very subtle bugs.
From http://www.first.org/conference/2006/papers/seacord-robert-slides.pdf :
An integer overflow occurs when an integer is
increased beyond its maximum value or
decreased beyond its minimum value.
Overflows can be signed or unsigned.
P.S.: The PDF has detailed explanation on overflows and other integer error conditions, and also how to tackle/avoid them.
I'd like to be a bit contrarian to all the other answers so far, which somehow accept crappy broken math as a given. The question is tagged language-agnostic and in a vast number of languages, integers simply never overflow, so here's my kind-of sarcastic answer:
What is an integer overflow error?
An obsolete artifact from the dark ages of computing.
why do i care about it?
You don't.
how can it be avoided?
Use a modern programming language in which integers don't overflow. (Lisp, Scheme, Smalltalk, Self, Ruby, Newspeak, Ioke, Haskell, take your pick ...)
I find showing the Two’s Complement representation on a disc very helpful.
Here is a representation for 4-bit integers. The maximum value is 2^3-1 = 7.
For 32 bit integers, we will see the maximum value is 2^31-1.
When we add 1 to 2^31-1 : Clockwise we move by one and it is clearly -2^31 which is called integer overflow
Ref : https://courses.cs.washington.edu/courses/cse351/17wi/sections/03/CSE351-S03-2cfp_17wi.pdf
This happens when you attempt to use an integer for a value that is higher than the internal structure of the integer can support due to the number of bytes used. For example, if the maximum integer size is 2,147,483,647 and you attempt to store 3,000,000,000 you will get an integer overflow error.