How can you handle absurdly large numbers? - language-agnostic

There are some scenarios where programmers need or want to find grossly large numbers. These are often so large that they defy the programmer's comprehension. I'm talking about things like the largest known prime number (with 12978189 digits) and the recently calculated 10 trillion digits of pi.
How can you create a program that handles these? This far exceeds an integer, a long, a double, a BigInteger, a BigDecimal, or anything of the sort. How do these kinds of programs for discovering these numbers get created? How can you even store them in memory when no appropriate datatypes exist, and they would likely consume gigabytes of data each?

To address your specific examples:
A 12 million digit integer isn't terribly large for a typical "large integer" class to handle. This should be able to be stored in memory.
To store 10 trillion digits of π, you could use a disk file and memory-map it. You'll need a 64 bit OS and application, but you can simply create a 10 terabyte file on disk (you'll probably need a few disks and a filesystem like ZFS that can store it across disks), and map it into CPU address space. The algorithms that calculate π (such as BBP) conveniently calculate one hex digit at a time which fits well into half a byte of memory.

The (abstract) answer is to write algorithms using the machine's native types that produce the results you want. For instance, when you do addition by hand on paper of two very large integers, the biggest actual calculation you need is only 9+9+1 (nine plus nine plus one for the carry). Of course you need paper large enough to write the two numbers down in the first place and the answer down as well. So as long as the two numbers and the answer can be stored in a computer's harddisk (the paper), an algorithm can be written that does it with variables that only need a value up to 19; so even a char variable is more than capable of handling this let alone an int variable.
The (concrete) answer is that really good programmers have already done this and there even FOSS libraries for it. One good one is the GNU Project's GMP library which has loads of functions to handle arbitrary size integer arithmetic and arbitrary precision floating point arithmetic. So as long as your computer can store the information needing during the calculation, it can be done. You'll need to invest the time to read the documentation of course.

Related

Does Actionscript have a math specification?

This Flash game has a lot of players including me and some friends. We noticed the same thing can run differently for different people. The math in the simulation is definitely to blame. Whether the cause is in hardware, OS, browser, 32-bit/64-bit, etc. is not really known. But with the combinations we have to test with, we've gotten 5 distinct end results from the same simulation starting conditions, and can likely get more.
This makes me wonder, does Actionscript have a floating point math specification? If so, what does it say about the accuracy and determinism of the computations?
I compare to Java, which differentiates between regular floating point math with the Math class and deterministic floating point with the StrictMath class and strictfp keyword. Both are always within 1 ulp of the exact result, this also implies the regular math and strict math always give results within 1 ulp of each other for a single operation or function call. The docs are very clear about this. I'd expect other respectable languages to have something similar, saying how accurate their floating point computations are and if they give the same results everywhere.
Update since some people have been saying the game is dishonest:
Some others have taken apart the swf and even made mods for it, they've seen the game engine and can confirm there is no randomness. Box2d is used for its physics. If a design ever does run differently on subsequent runs, it has actually changed due to some bug, usually this is a visible difference, but if not, you can check the raw data with this tool and see it is different. Different starting conditions as expected get different end results.
As for what we know so far, this is results on a test level:
For example, if I am running 32-bit Chrome on my desktop (AMD A10-5700 as CPU), I will always get that result of "946 ticks". But if I run on Firefox or Internet Explorer instead I always get the result of "794 ticks".
Actionscript doesn't really have a math specification in that sense. This is the closest you'll get:
https://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/Math.html
It says at the bottom of the top section:
The Math functions acos, asin, atan, atan2, cos, exp, log, pow, sin, and sqrt may result in slightly different values depending on the algorithms used by the CPU or operating system. Flash runtimes call on the CPU (or operating system if the CPU doesn't support floating point calculations) when performing the calculations for the listed functions, and results have shown slight variations depending upon the CPU or operating system in use.
So to answer our two questions:
What does it say about accuracy? Nothing, actually. At no point does it mention a limit to how inaccurate a result can be.
What does it say about determinism? Hardware and operating system are definitely factors, so it is platform-dependent. No confirmation for other factors.
If you want to look any deeper, you're on your own.
According to the docs, Actionscript has a catch-all Number data type in addition to int and uint types:
The Number data type uses the 64-bit double-precision format as specified by the IEEE Standard for Binary Floating-Point Arithmetic (IEEE-754). This standard dictates how floating-point numbers are stored using the 64 available bits. One bit is used to designate whether the number is positive or negative. Eleven bits are used for the exponent, which is stored as base 2. The remaining 52 bits are used to store the significand (also called mantissa), the number that is raised to the power indicated by the exponent.
By using some of its bits to store an exponent, the Number data type can store floating-point numbers significantly larger than if it used all of its bits for the significand. For example, if the Number data type used all 64 bits to store the significand, it could store a number as large as 265 – 1. By using 11 bits to store an exponent, the Number data type can raise its significand to a power of 21023.
Although this range of numbers is enormous, it comes at the cost of precision. Because the Number data type uses 52 bits to store the significand, numbers that require more than 52 bits for accurate representation, such as the fraction 1/3, are only approximations. If your application requires absolute precision with decimal numbers, use software that implements decimal floating-point arithmetic as opposed to binary floating-point arithmetic.
This could account for the varying results you're seeing.

Choosing a magic byte least likely to appear in real data

I hope this isn't too opinionated for SO; it may not have a good answer.
In a portion of a library I'm writing, I have a byte array that gets populated with values supplied by the user. These values might be of type Float, Double, Int (of different sizes), etc. with binary representations you might expect from C, say. This is all we can say about the values.
I have an opportunity for an optimization: I can initialize my byte array with the byte MAGIC, and then whenever no byte of the user-supplied value is equal to MAGIC I can take a fast path, otherwise I need to take the slow path.
So my question is: what is a principled way to go about choosing my magic byte, such that it will be reasonably likely not to appear in the (variously-encoded and distributed) data I receive?
Part of my question, I suppose, is whether there's something like a Benford's law that can tell me something about the distribution of bytes in many sorts of data.
Capture real-world data from a diverse set of inputs that would be used by applications of your library.
Write a quick and dirty program to analyze dataset. It sounds like what you want to know is which bytes are most frequently totally excluded. So the output of the program would say, for each byte value, how many inputs do not contain it.
This is not the same as least frequent byte. In data analysis you need to be careful to mind exactly what you're measuring!
Use the analysis to define your architecture. If no byte never appears, you can abandon the optimization entirely.
I was inclined to use byte 255 but I discovered that is also prevalent in MSWord files. So I use byte 254 now, for EOF code to terminate a file.

Is there a performance hit using decimal data types (MySQL / Postgres)

I understand how integer and floating point data types are stored, and I am guessing that the variable length of decimal data types means it is stored more like a string.
Does that imply a performance overhead when using a decimal data type and searching against them?
Pavel has it quite right, I'd just like to explain a little.
Presuming that you mean a performance impact as compared to floating point, or fixed-point-offset integer (i.e. storing thousandsths of a cent as an integer): Yes, there is very much a performance impact. PostgreSQL, and by the sounds of things MySQL, store DECIMAL / NUMERIC in binary-coded decimal. This format is more compact than storing the digits as text, but it's still not very efficient to work with.
If you're not doing many calculations in the database, the impact is limited to the greater storage space requried for BCD as compared to integer or floating point, and thus the wider rows and slower scans, bigger indexes, etc. Comparision operations in b-tree index searches are also slower, but not enough to matter unless you're already CPU-bound for some other reason.
If you're doing lots of calculations with the DECIMAL / NUMERIC values in the database, then performance can really suffer. This is particularly noticeable, at least in PostgreSQL, because Pg can't use more than one CPU for any given query. If you're doing a huge bunch of division & multiplication, more complex maths, aggregation, etc on numerics you can start to find yourself CPU-bound in situations where you would never be when using a float or integer data type. This is particularly noticeable in OLAP-like (analytics) workloads, and in reporting or data transformation during loading or extraction (ETL).
Despite the fact that there is a performance impact (which varies based on workload from negligible to quite big) you should generally use numeric / decimal when it is the most appropriate type for your task - i.e. when very high range values must be stored and/or rounding error isn't acceptable.
Occasionally it's worth the hassle of using a bigint and fixed-point offset, but that is clumsy and inflexible. Using floating point instead is very rarely the right answer due to all the challenges of working reliably with floating point values for things like currency.
(BTW, I'm quite excited that some new Intel CPUs, and IBM's Power 7 range of CPUs, include hardware support for IEEE 754 decimal floating point. If this ever becomes available in lower end CPUs it'll be a huge win for databases.)
A impact of decimal type (Numeric type in Postgres) depends on usage. For typical OLTP this impact could not be significant - for OLAP can be relative high. In our application a aggregation on large columns with numeric is more times slower than for type double precision.
Although a current CPU are strong, still is rule - you should to use a Numeric only when you need exact numbers or very high numbers. Elsewhere use float or double precision type.
You are correct: fixed-point data is stored as a (packed BCD) string.
To what extent this impacts performance depends on a range of factors, which include:
Do queries utilise an index upon the column?
Can the CPU perform BCD operations in hardware, such as through Intel's BCD opcodes?
Does the OS harness hardware support through library functions?
Overall, any performance impact is likely to be pretty negligable relative to other factors that you may face: so don't worry about it. Remember Knuth's maxim, "premature optimisation is the root of all evil".
I am guessing that the variable length of decimal data types means it
is stored more like a string.
Taken from MySql document here
The document says
as of MySQL 5.0.3
Values for DECIMAL columns no longer are represented as strings that
require 1 byte per digit or sign character. Instead, a binary format
is used that packs nine decimal digits into 4 bytes. This change to
DECIMAL storage format changes the storage requirements as well. The
storage requirements for the integer and fractional parts of each
value are determined separately. Each multiple of nine digits requires
4 bytes, and any remaining digits require some fraction of 4 bytes.

Sorting a AS3 ByteArray

I am designing an Air application that needs to store thousands of records in memory, and needs to sort them efficiently, by various keys.
I thought of using a ByteArray, since that would avoid all the overhead of normal AS3 objects, and would allow me to use memory more efficiently.
However, the challenge is how to sort the records inside the ByteArray. I have thought of two possibilities:
1- Implement quick-sort or heap-sort in AS3, and sort the array this way. However, I am not sure this will be performant enough. For example, ByteArrays do not have methods to copy chunks of memory around; it has to be done byte-by-byte.
2- Create an Air Native Extension (ANE) that takes the ByteArray and sorts it, using C. the drawback of this is that it will be harder to implement for all the platforms it needs to run on.
What would you recommend? Do you have any previous experience doing something similar?
I'd say use Array or Vector objects, there's a possibility to sort Arrays on whatever key you want via sortOn(), and Vectors via sort(), so you can achieve whatever behavior you need, as the latter accepts a function as its parameter, check here. And I believe you won't get anywhere with ByteArrays, since what is actually done in sorting objects is sorting links in there, while a ByteArray will contain actual data.
You should never design anything that HAS to have hundreds of thousands of anything in the memory at once. Offload stuff while you don't need it. Do you know how much 100,000 is? Taking a single byte and multiplying by 100,000 gives you a MB. For every 1 byte of data in a record, you will generate 1MB of memory. Recording 100,000 ints takes 4MB.
If your records have 2 20 character strings (a first and last name), a String character is represented with 8 bytes, so you have just filled the memory with 640 MB of nothing more than first and last names. Most 'bad' computers have like, what... 2GB of memory? Good Job taking up 1/4 of that. Even if you managed to truncate this down to ByteArray level with superhuman uber bitshifing, you're still talking about reducing data by a factor of 8. So now you have 80MB for just first and last names and no other data. You could survive on that- except I suspect your records have more data then 2 strings. 20 strings? You're eating 800MB of data. Offload everything but 100 records at a time, and you're down to 640KB of memory for those names. And yes, you can load and offload while sorting.
Chunks of memory don't copy faster than bytes. It's all the same. The reasons Vectors of Objects are performant when switching is because they switch references/pointers/one single 32 bit/64 bit number instead of copying chunks of memory.
It's not clear what you're sorting. Bytes only go up to values of 256, so clearly you're using more bytes than 1 for each record. You want to evaluate each set of... like 2000+ bytes against each other set of 2000+ bytes? Like "Ah, last name is bytes 32-83, so extract those bytes, for every group of 4 bytes, bit shift them 0, 8, 16, 32 bits respectively, add them together, concatinate their integer values into a a String, do a comparison, now compare bytes 84-124 against the bytes in the next option, now transfer bytes 0-2000 to location 443530-441530 and....... Do these records have variable length strings or arrays in them? Oh lord.
Flash is not the place to write in assembly!
Use objects and test the speed & memory consumption. If either makes you cry, use more conventional methods of reducing load; like offloading materials temporarily into text files. The ugliest you should be getting is avoiding objects by storing each individual property in a different Vector. Vector., etc and having the same index refer to one record across the board.

Does using binary numbers in code improves performance?

I've seen quite a few examples where binary numbers are being used in code, like 32,64,128 and so on (for instance, very well known example - minecraft)
I want to ask, does using binary numbers in such high level languages as Java / C++ help anything?
I know assembly and that you would always rather use these because in low level language it overcomplicates things if you go above register limit.
Will programs run any faster/save up more memory if you use binary numbers?
As with most things, "it depends".
In compiled languages, the better compilers will deduce that slow machine instructions can sometimes be done with different faster machine instructions (but only for special values, such as powers of two). Sometimes coders know this and program accordingly. (e.g. multiplying by a power of two is cheap)
Other times, algorithms are suited towards representations involving powers of two (e.g. many divide and conquer algorithms like the Fast Fourier Transform or a merge sort).
Yet other times, it's the most compact way to represent boolean values (like a bitmask).
And on top of that, other times it's more efficiency for memory purposes (typically because it's so fast do to multiply and divide logic with powers of two, the OS/hardware/etc will use cache line / page sizes / etc that are powers of two, so you'd do well to have nice power of two sizes for your important data structures).
And then, on top of that, other times.. programmers are just so used to using powers of two that they simply do it because it seems like a nice number.
There are some benefits of using powers of two numbers in your programs. Bitmasks are one application of this, mainly because bitwise operators (&, |, <<, >>, etc) are incredibly fast.
In C++ and Java, this is done a fair bit- especially with GUI applications. You could have a field of 32 different menu options (such as resizable, removable, editable, etc), and apply each one without having to go through convoluted addition of values.
In terms of raw speedup or any performance improvement, that really depends on the application itself. GUI packages can be huge, so getting any speedup out of those when applying menu/interface options is a big win.
From the title of your question, it sounds like you mean, "Does it make your program more efficient if you write constants in binary?" If that's what you meant, the answer is emphatically, No. The compiler translates all your constants to binary at compile time, so by the time the program runs, it makes no difference. I don't know if the compiler can interpret binary constants faster than decimal, but the difference would surely be trivial.
But the body of your question seems to indicate that you mean, "use constants that are round number in binary" rather than necessarily expressing them in binary digits.
For most purposes, the answer would be no. If, say, the computer has to add two numbers together, adding a number that happens to be a round number in binary is not going to be any faster than adding a not-round number.
It might be slightly faster for multiplication. Some compilers are smart enough to turn multiplication by powers of 2 into a bit shift operation rather than a hardware multiply, and bit shifts are usually faster than multiplies.
Back in my assembly-language days I often made elements in arrays have sizes that were powers of 2 so I could index into the array with a bit-shift rather than a multiply. But in a high-level language that would be hard to do, as you'd have to do some research to find out just how much space your primitives take in memory, whether the compiler adds padding bytes between them, etc etc. And if you did add some bytes to an array element to pad it out to a power of 2, the entire array is now bigger, and so you might generate an extra page fault, i.e. the operating system runs out of memory and has to write a chunck of your data to the hard drive and then read it back when it needs it. One extra hard drive right takes more time than 1000 multiplications.
In practice, (a) the difference is so trivial that it would almost never be worth worrying about; and (b) you don't normally know everything happenning at the low level, so it would often be hard to predict whether a change with its intendent ramifications would help or hurt.
In short: Don't bother. Use the constant values that are natural to the problem.
The reason they're used is probably different - e.g. bitmasks.
If you see them in array sizes, it doesn't really increase performance, but usually memory is allocated by power of 2. E.g. if you wrote char x[100], you'd probably get 128 allocated bytes.
No, your code will ran the same way, no matter what is the number you use.
If by binary numbers you mean numbers that are power of 2, like: 2, 4, 8, 16, 1024.... they are common due to optimization of space, normally. Example, if you have a 8 bit pointer it is capable of point to 256 (that is a power of 2), addresses, so if you use less than 256 you are wasting your pointer.... so normally you allocate a 256 buffer... this same works for all other power of 2 numbers....
In most cases the answer is almost always no, there is no noticeable performance difference.
However, there are certain cases (very few) when NOT using binary numbers for array/structure sizes/length will give noticeable performance benefits. These are cases when you're filling the cache and because you're looping over a structure that fills the cache in a such a way that you have cache collisions every time you loop through your array/structure. This case is very rare, and shouldn't be preoptimized unless you're having problems with your code performing much more slowly than theoretical limits say it should. Also, this case is very hardware dependent and will change from system to system.