To start out I just want to state that I have read this discussion.
Are floating points uniformly inaccurate across all possible values? Or does inaccuracy increase as the values gets farther and farther away from 0?
To understand this, you need to clearly determine what kind of accuracy you are talking about. It is usually a measure of errors occurring in calculation, and I suspect you are not thinking about calculations in only the relevant floating point format.
These are all answers to your question:
The precision - expressed in number of significant bits - of floating point numbers is constant over most of the range. (Only for denormal numbers the precision reduces as the number gets smaller.)
The accuracy of floating point operations is typically limited by the precision, so mostly constant over the range. See the previous point.
The accuracy by which you can convert decimal numbers to binary floating point will be higher for integers than for numbers with a fractional component. This is because integers can be represented as some multiple of powers of two, while decimal fractions can't be represented as a multiple of negative powers of two. (The typical example is that 0.1 becomes a repeating fraction in binary floating point).
The consequence of the last point is that when you start out with slightly large decimal numbers in scientific notation, e.g. 1.123*10^4, these have the same value as an integer and can therefore be converted accurately to binary floating point.
Related
I am looking for a numerical algorithm to calculate the maximum data length for a given CRC polynomial and a given Hamming Distance.
E.g. lets say I have an 8 bit CRC with full polynomial 0x19b. I want to achieve a Hamming Distance of 4. Now how many bits of data can be guarded under these conditions?
Is there some numerical algorithm (ideally C or C++ code) that can be used to solve this problem?
Not a complete answer, but my spoof code can be adapted to this problem.
To determine that you have not met the requirement of a Hamming distance of 4 for a given message length, you need only find a single codeword with a Hamming distance of 3. If you give spoof a set of bit locations in a message, it will determine which of those bits to invert in order to leave the CRC unchanged. Spoof simply solves a set of linear equations over GF(2) to find the bit locations to invert.
That will quickly narrow down the message lengths that will work. Once you have a candidate length, n, for which you have not been able to find a codeword of distance 3, proving that there are no such codewords will be a little more work. You would need to generate all possible 3-bit patterns, of which there are n(n-1)(n-2)/6, and look to see if any of them have a CRC of zero. Depending on n, that might not be too daunting. A fast way to do this is to generate the CRCs of all messages with a single bit set, and exclusive-oring all choices of three CRCs from that set to see if any of those are zero.
I conjecture that there is a faster way to do that last step by intelligently culling the rows used in the linear equation solver, allowing for all bit positions. However the margin here is not sufficient for me to express the proof.
I understand how integer and floating point data types are stored, and I am guessing that the variable length of decimal data types means it is stored more like a string.
Does that imply a performance overhead when using a decimal data type and searching against them?
Pavel has it quite right, I'd just like to explain a little.
Presuming that you mean a performance impact as compared to floating point, or fixed-point-offset integer (i.e. storing thousandsths of a cent as an integer): Yes, there is very much a performance impact. PostgreSQL, and by the sounds of things MySQL, store DECIMAL / NUMERIC in binary-coded decimal. This format is more compact than storing the digits as text, but it's still not very efficient to work with.
If you're not doing many calculations in the database, the impact is limited to the greater storage space requried for BCD as compared to integer or floating point, and thus the wider rows and slower scans, bigger indexes, etc. Comparision operations in b-tree index searches are also slower, but not enough to matter unless you're already CPU-bound for some other reason.
If you're doing lots of calculations with the DECIMAL / NUMERIC values in the database, then performance can really suffer. This is particularly noticeable, at least in PostgreSQL, because Pg can't use more than one CPU for any given query. If you're doing a huge bunch of division & multiplication, more complex maths, aggregation, etc on numerics you can start to find yourself CPU-bound in situations where you would never be when using a float or integer data type. This is particularly noticeable in OLAP-like (analytics) workloads, and in reporting or data transformation during loading or extraction (ETL).
Despite the fact that there is a performance impact (which varies based on workload from negligible to quite big) you should generally use numeric / decimal when it is the most appropriate type for your task - i.e. when very high range values must be stored and/or rounding error isn't acceptable.
Occasionally it's worth the hassle of using a bigint and fixed-point offset, but that is clumsy and inflexible. Using floating point instead is very rarely the right answer due to all the challenges of working reliably with floating point values for things like currency.
(BTW, I'm quite excited that some new Intel CPUs, and IBM's Power 7 range of CPUs, include hardware support for IEEE 754 decimal floating point. If this ever becomes available in lower end CPUs it'll be a huge win for databases.)
A impact of decimal type (Numeric type in Postgres) depends on usage. For typical OLTP this impact could not be significant - for OLAP can be relative high. In our application a aggregation on large columns with numeric is more times slower than for type double precision.
Although a current CPU are strong, still is rule - you should to use a Numeric only when you need exact numbers or very high numbers. Elsewhere use float or double precision type.
You are correct: fixed-point data is stored as a (packed BCD) string.
To what extent this impacts performance depends on a range of factors, which include:
Do queries utilise an index upon the column?
Can the CPU perform BCD operations in hardware, such as through Intel's BCD opcodes?
Does the OS harness hardware support through library functions?
Overall, any performance impact is likely to be pretty negligable relative to other factors that you may face: so don't worry about it. Remember Knuth's maxim, "premature optimisation is the root of all evil".
I am guessing that the variable length of decimal data types means it
is stored more like a string.
Taken from MySql document here
The document says
as of MySQL 5.0.3
Values for DECIMAL columns no longer are represented as strings that
require 1 byte per digit or sign character. Instead, a binary format
is used that packs nine decimal digits into 4 bytes. This change to
DECIMAL storage format changes the storage requirements as well. The
storage requirements for the integer and fractional parts of each
value are determined separately. Each multiple of nine digits requires
4 bytes, and any remaining digits require some fraction of 4 bytes.
What is the adequate MySQL field type for weight (in kilograms) and height (meters) data?
That all depends. There are advantages and disadvantages to the different numeric types:
If you absolutely need the height/weight to be exactly a certain number of decimal places and not subject to the issues that floating point numbers can cause, use decimal() (decimal takes two parameters, the amount number of digits > 1 and the number of digits < 1, so decimal(5,2) would be able to store numbers as low as 0.01 and as high as 999.99). Fixed precision math is a bit slower, and it's more costly to store them because of the algorithms involved in doing it. It's absolutely mandatory to store currency values or any value you intend to do math with involving currency as a numeric(). If you're, for instance, paying people per pound or per meter for a baby, it would need to be numeric().
If you're willing to accept the possibility that some numeric values will be less precise than others use either FLOAT or DOUBLE depending on the amount of precision and the size of the numbers you're storing. Floating points are approximations based on the way computers do math, we use decimal points to mean out of powers of 10, computers do math in base2, the two are not directly comparable, so to get around that algorithms were developed for cases in which 100% precision is not required, but more speed is necessary that cannot be achieved by modeling it using collections of integers.
For more info:
http://dev.mysql.com/doc/refman/5.0/en/numeric-types.html
http://dev.mysql.com/doc/refman/5.0/en/fixed-point-types.html
http://dev.mysql.com/doc/refman/5.0/en/floating-point-types.html
I am being specific about handling large number of money values. Each value is precise only upto 2 decimal places. But the values will be passed around by a database and one or more web frameworks and there will be arithemetic operations.
Should I insist on decimal datatypes for numbers that need only 2 places of precision? Or are modern floating point implementations robust and standardized to avoid it?
Hell no, absolutely, and the issues are orthogonal, in that order. :-)
Floating point numbers, especially in binary, are never the right choice for fixed-point quantities, least of all those that expect precise fractions, like money values. First of all, they don't express all values of cents (or whatever fractional component) accurately, just like fixed-length decimal numbers can't express 1/3 correctly. Secondly, adding or subtracting small and very large floating point numbers doesn't always produce the result you expect, because of differences in "significance".
Decimal numbers are the way to go for currency calculations. If you absolutely must use binary numbers, use scaled fixed-point binary numbers - for example, compute everything in 1/100ths of your currency unit, and use binary integers to do it.
Lastly, this has nothing to do with "robustness" or "standardization" - it's got everything to do with picking a datatype that matches your data.
No, they are not precise enough. See the floating point guide for details.
I understand basic binary logic and how to do basic addition, subtraction etc. I get that each of the characters in this text is just a binary number representing a number in a charset. The numbers dont really mean anything to the computer. I'm confused however as to how a computer works out that a number is greater than another. what does it do at the bit level?
If you have two numbers, you can compare each bit, from most significant to least significant, using a 1-bit comparator gate:
Of course n-bit comparator gates exist and are described further here.
It subtracts one from the other and sees if the result is less than 0 (by checking the highest-order bit, which is 1 on a number less than 0 since computers use 2's complement notation).
http://academic.evergreen.edu/projects/biophysics/technotes/program/2s_comp.htm
It substracts the two numbers and checks if the result is positive, negative (highest bit - aka "the minus bit" is set), or zero.
Within the processor, often there will be microcode to do operations, using hardwired options, such as add/subtract, that is already there.
So, to do a comparison of an integer the microcode can just do a subtraction, and based on the result determine if one is greater than the other.
Microcode is basically just low-level programs that will be called by assembly, to make it look like there are more commands than is actually hardwired on the processor.
You may find this useful:
http://www.osdata.com/topic/language/asm/intarith.htm
I guess it does a bitwise comparison of two numbers from the most significant bit to the least significant bit, and when they differ, the number with the bit set to "1" is the greater.
In a Big-endian architecture, the comparison of the following Bytes:
A: 0010 1101
B: 0010 1010
would result in A being greatest than B for its 6th bit (from the left) is set to one, while the precedent bits are equal to B.
But this is just a quick theoretic answer, with no concerns about floating point numbers and negative numbers.