Does using binary numbers in code improves performance?

Does using binary numbers in code improves performance? - binary

I've seen quite a few examples where binary numbers are being used in code, like 32,64,128 and so on (for instance, very well known example - minecraft)
I want to ask, does using binary numbers in such high level languages as Java / C++ help anything?
I know assembly and that you would always rather use these because in low level language it overcomplicates things if you go above register limit.
Will programs run any faster/save up more memory if you use binary numbers?

As with most things, "it depends".
In compiled languages, the better compilers will deduce that slow machine instructions can sometimes be done with different faster machine instructions (but only for special values, such as powers of two). Sometimes coders know this and program accordingly. (e.g. multiplying by a power of two is cheap)
Other times, algorithms are suited towards representations involving powers of two (e.g. many divide and conquer algorithms like the Fast Fourier Transform or a merge sort).
Yet other times, it's the most compact way to represent boolean values (like a bitmask).
And on top of that, other times it's more efficiency for memory purposes (typically because it's so fast do to multiply and divide logic with powers of two, the OS/hardware/etc will use cache line / page sizes / etc that are powers of two, so you'd do well to have nice power of two sizes for your important data structures).
And then, on top of that, other times.. programmers are just so used to using powers of two that they simply do it because it seems like a nice number.

There are some benefits of using powers of two numbers in your programs. Bitmasks are one application of this, mainly because bitwise operators (&, |, <<, >>, etc) are incredibly fast.
In C++ and Java, this is done a fair bit- especially with GUI applications. You could have a field of 32 different menu options (such as resizable, removable, editable, etc), and apply each one without having to go through convoluted addition of values.
In terms of raw speedup or any performance improvement, that really depends on the application itself. GUI packages can be huge, so getting any speedup out of those when applying menu/interface options is a big win.

From the title of your question, it sounds like you mean, "Does it make your program more efficient if you write constants in binary?" If that's what you meant, the answer is emphatically, No. The compiler translates all your constants to binary at compile time, so by the time the program runs, it makes no difference. I don't know if the compiler can interpret binary constants faster than decimal, but the difference would surely be trivial.
But the body of your question seems to indicate that you mean, "use constants that are round number in binary" rather than necessarily expressing them in binary digits.
For most purposes, the answer would be no. If, say, the computer has to add two numbers together, adding a number that happens to be a round number in binary is not going to be any faster than adding a not-round number.
It might be slightly faster for multiplication. Some compilers are smart enough to turn multiplication by powers of 2 into a bit shift operation rather than a hardware multiply, and bit shifts are usually faster than multiplies.
Back in my assembly-language days I often made elements in arrays have sizes that were powers of 2 so I could index into the array with a bit-shift rather than a multiply. But in a high-level language that would be hard to do, as you'd have to do some research to find out just how much space your primitives take in memory, whether the compiler adds padding bytes between them, etc etc. And if you did add some bytes to an array element to pad it out to a power of 2, the entire array is now bigger, and so you might generate an extra page fault, i.e. the operating system runs out of memory and has to write a chunck of your data to the hard drive and then read it back when it needs it. One extra hard drive right takes more time than 1000 multiplications.
In practice, (a) the difference is so trivial that it would almost never be worth worrying about; and (b) you don't normally know everything happenning at the low level, so it would often be hard to predict whether a change with its intendent ramifications would help or hurt.
In short: Don't bother. Use the constant values that are natural to the problem.

The reason they're used is probably different - e.g. bitmasks.
If you see them in array sizes, it doesn't really increase performance, but usually memory is allocated by power of 2. E.g. if you wrote char x[100], you'd probably get 128 allocated bytes.

No, your code will ran the same way, no matter what is the number you use.
If by binary numbers you mean numbers that are power of 2, like: 2, 4, 8, 16, 1024.... they are common due to optimization of space, normally. Example, if you have a 8 bit pointer it is capable of point to 256 (that is a power of 2), addresses, so if you use less than 256 you are wasting your pointer.... so normally you allocate a 256 buffer... this same works for all other power of 2 numbers....

In most cases the answer is almost always no, there is no noticeable performance difference.
However, there are certain cases (very few) when NOT using binary numbers for array/structure sizes/length will give noticeable performance benefits. These are cases when you're filling the cache and because you're looping over a structure that fills the cache in a such a way that you have cache collisions every time you loop through your array/structure. This case is very rare, and shouldn't be preoptimized unless you're having problems with your code performing much more slowly than theoretical limits say it should. Also, this case is very hardware dependent and will change from system to system.

Related

How can you handle absurdly large numbers?

There are some scenarios where programmers need or want to find grossly large numbers. These are often so large that they defy the programmer's comprehension. I'm talking about things like the largest known prime number (with 12978189 digits) and the recently calculated 10 trillion digits of pi.
How can you create a program that handles these? This far exceeds an integer, a long, a double, a BigInteger, a BigDecimal, or anything of the sort. How do these kinds of programs for discovering these numbers get created? How can you even store them in memory when no appropriate datatypes exist, and they would likely consume gigabytes of data each?

To address your specific examples:
A 12 million digit integer isn't terribly large for a typical "large integer" class to handle. This should be able to be stored in memory.
To store 10 trillion digits of π, you could use a disk file and memory-map it. You'll need a 64 bit OS and application, but you can simply create a 10 terabyte file on disk (you'll probably need a few disks and a filesystem like ZFS that can store it across disks), and map it into CPU address space. The algorithms that calculate π (such as BBP) conveniently calculate one hex digit at a time which fits well into half a byte of memory.

The (abstract) answer is to write algorithms using the machine's native types that produce the results you want. For instance, when you do addition by hand on paper of two very large integers, the biggest actual calculation you need is only 9+9+1 (nine plus nine plus one for the carry). Of course you need paper large enough to write the two numbers down in the first place and the answer down as well. So as long as the two numbers and the answer can be stored in a computer's harddisk (the paper), an algorithm can be written that does it with variables that only need a value up to 19; so even a char variable is more than capable of handling this let alone an int variable.
The (concrete) answer is that really good programmers have already done this and there even FOSS libraries for it. One good one is the GNU Project's GMP library which has loads of functions to handle arbitrary size integer arithmetic and arbitrary precision floating point arithmetic. So as long as your computer can store the information needing during the calculation, it can be done. You'll need to invest the time to read the documentation of course.

speed up ideas -- can CUDA help here?

I'm working on an algorithm that has to do a small number
of operations on a large numbers of small arrays, somewhat independently.
To give an idea:
1k sorting of arrays of length typically of 0.5k-1k elements.
1k of LU-solve of matrices that have rank 10-20.
everything is in floats.
Then, there is some horizontality to this problem: the above
operations have to be carried independently on 10k arrays.
Also, the intermediate results need not be stored: for example, i don't
need to keep the sorted arrays, only the sum of the smallest $m$ elements.
The whole thing has been programmed in c++ and runs. My question is:
would you expect a problem like this to enjoy significant speed ups
(factor 2 or more) with CUDA?

You can run this in 5 lines of ArrayFire code. I'm getting speedups of ~6X with this over the CPU. I'm getting speedups of ~4X with this over Thrust (which was designed for vectors, not matrices). Since you're only using a single GPU, you can run ArrayFire Free version.
array x = randu(512,1000,f32);
array y = sort(x); // sort each 512-element column independently
array x = randu(15,15,1000,f32), y;
gfor (array i, x.dim(2))
y(span,span,i) = lu(x(span,span,i)); // LU-decomposition of each 15x15 matrix
Keep in mind that GPUs perform best when memory accesses are aligned to multiples of 32, so a bunch of 32x32 matrices will perform better than a bunch of 31x31.

If you "only" need a factor of 2 speed up I would suggest looking at more straightforward optimisation possibilities first, before considering GPGPU/CUDA. E.g. assuming x86 take a look at using SSE for a potential 4x speed up by re-writing performance critical parts of your code to use 4 way floating point SIMD. Although this would tie you to x86 it would be more portable in that it would not require the presence of an nVidia GPU.
Having said that, there may even be simpler optimisation opportunities in your code base, such as eliminating redundant operations (useless copies and initialisations are a favourite) or making your memory access pattern more cache-friendly. Try profiling your code with a decent profiler to see where the bottlenecks are.
Note however that in general sorting is not a particularly good fit for either SIMD or CUDA, but other operations such as LU decomposition may well benefit.

Just a few pointers, you maybe already incorporated:
1) If you just need the m smallest elements, you are probably better of to just search the smallest element, remove it and repeat m - times.
2) Did you already parallelize the code on the cpu? OpenMP or so ...
3) Did you think about buying better hardware? (I know it´s not the nice think to do, but if you want to reach performance goals for a specific application it´s sometimes the cheapest possibility ...)
If you want to do it on CUDA, it should work conceptually, so no big problems should occur. However, there are always the little things, which depend on experience and so on.
Consider the thrust-library for the sorting thing, hopefully someone else can suggest some good LU-decomposition algorithm.

comparision and addition of two integers : in detail

I would like to know how many machine cycles does it take to compare two integers and how many if I add that and which one is easier?
basically i m looking to see which one is more expensive generally ??
Also I need an answer from c, c++, java perspective ....
helps is appreciated thanks!!

The answer is yes. And no. And maybe.
There are machines that can compare two values in their spare time between cycles, and others that need several cycles. On the old PDP8 you first had to negate one operand, do an add, and then test the result to do a compare.
But other machines can compare much faster than add, because no register needs to be modified.
But on still other machines the operations take the same time, but it takes several cycles for the result of the compare to make it to a place where one can test it, so, if you can use those cycles the compare is almost free, but fairly expensive if you have no other operations to shove into those cycles.

The simple answer is one cycle, both operations are equally easy.
A totally generic answer is difficult to give, since processor architectures are amazingly complex when you get down into the details.
All modern processors are pipelined. That is, there are no instructions where the operands go in on cycle c, and the result is available on cycle c+1. Instead, the instruction is broken down into multiple steps.
The instructions are read into the front end of the processor, which decodes the instruction. This may include breaking it down into multiple micro-ops. The operands are then read into registers, and then the execution units handle the actual operation. Eventually the answer is returned back to a register.
The instructions go through one pipeline stage each cycle, and modern CPUs have 10-20 pipeline stages. So it could be upto 20 processor cycles to add or compare two numbers. However, once one instruction has been through one stage of the pipeline, another instruction can be read into that stage. The ideal is that each clock cycle, one instruction goes into the front end, while one set of results comes out the other.
There is massive complexity involved in getting all this to work. If you want to do a+b+c, you need to add a+b before you can add c. So a lot of the work in the front end of the processor involves scheduling. Modern processors employ out-of-order execution, so that the processor will examine the incoming instructions, and re-order them such that it does a+b, then gets on with some other work, and then does result+c once the result is available.
Which all brings us back to the original question of which is easier. Because usually, if you're comparing two integers, it is to make a decision on what to do next. Which means you won't know your next instruction until you've got the result of the last one. Because the instructions are pipelined, this means you can lose 20 clock cycles of work if you wait.
So modern CPUs have a branch predictor which makes a guess what the result will be, and continues executing the instructions. If it guesses wrong, the pipeline has to be thrown out, and work restarted on the other branch. The branch predictor helps enormously, but still, if the comparison is a decision point in the code, that is for more difficult for the CPU to deal with than the addition.

Comparison is done via subtraction, which is almost the same as addition, except that the carry and subtrahend are complemented, so a - b - c becomes a + ~b + ~c. This is already accounted for in the CPU and basically takes the same amount of time either way.

High-level/semantic optimization

I'm writing a compiler, and I'm looking for resources on optimization. I'm compiling to machine code, so anything at runtime is out of the question.
What I've been looking for lately is less code optimization and more semantic/high-level optimization. For example:
free(malloc(400)); // should be completely optimized away
Even if these functions were completely inlined, they could eventually call OS memory functions which can never be inlined. I'd love to be able to eliminate that statement completely without building special-case rules into the compiler (after all, malloc is just another function).
Another example:
string Parenthesize(string str) {
StringBuilder b; // similar to C#'s class of the same name
foreach(str : ["(", str, ")"])
b.Append(str);
return b.Render();
}
In this situation I'd love to be able to initialize b's capacity to str.Length + 2 (enough to exactly hold the result, without wasting memory).
To be completely honest, I have no idea where to begin in tackling this problem, so I was hoping for somewhere to get started. Has there been any work done in similar areas? Are there any compilers that have implemented anything like this in a general sense?

To do an optimization across 2 or more operations, you have to understand the
algebraic relationship of those two operations. If you view operations
in their problem domain, they often have such relationships.
Your free(malloc(400)) is possible because free and malloc are inverses
in the storage allocation domain.
Lots of operations have inverses and teaching the compiler that they are inverses,
and demonstrating that the results of one dataflow unconditionally into the other,
is what is needed. You have to make sure that your inverses really are inverses
and there isn't a surprise somewhere; a/x*x looks like just the value a,
but if x is zero you get a trap. If you don't care about the trap, it is an inverse;
if you do care about the trap then the optimization is more complex:
(if (x==0) then trap() else a)
which is still a good optimization if you think divide is expensive.
Other "algebraic" relationships are possible. For instance, there are
may idempotent operations: zeroing a variable (setting anything to the same
value repeatedly), etc. There are operations where one operand acts
like an identity element; X+0 ==> X for any 0. If X and 0 are matrices,
this is still true and a big time savings.
Other optimizations can occur when you can reason abstractly about what the code
is doing. "Abstract interpretation" is a set of techniques for reasoning about
values by classifying results into various interesting bins (e.g., this integer
is unknown, zero, negative, or positive). To do this you need to decide what
bins are helpful, and then compute the abstract value at each point. This is useful
when there are tests on categories (e.g., "if (x<0) { ... " and you know
abstractly that x is less than zero; you can them optimize away the conditional.
Another way is to define what a computation is doing symbolically, and simulate the computation to see the outcome. That is how you computed the effective size of the required buffer; you computed the buffer size symbolically before the loop started,
and simulated the effect of executing the loop for all iterations.
For this you need to be able to construct symbolic formulas
representing program properties, compose such formulas, and often simplify
such formulas when they get unusably complex (kinds of fades into the abstract
interpretation scheme). You also want such symbolic computation to take into
account the algebraic properties I described above. Tools that do this well are good at constructing formulas, and program transformation systems are often good foundations for this. One source-to-source program transformation system that can be used to do this
is the DMS Software Reengineering Toolkit.
What's hard is to decide which optimizations are worth doing, because you can end
of keeping track of vast amounts of stuff, which may not pay off. Computer cycles
are getting cheaper, and so it makes sense to track more properties of the code in the compiler.

The Broadway framework might be in the vein of what you're looking for. Papers on "source-to-source transformation" will probably also be enlightening.

How do you interpret the results from shootout.alioth.debian.org?

A lot of people talk about the performance comparison of some languages by referring to the tests on shootout.alioth.debian.org . The thing is, I don't know how to read the results. The image seems incomprehensible, as I can't seem to find a NORMAL legend. Can you explain one of the tests, with a image? Choose whatever languages you want.

All results are ratios between the speed / memory usage / source code size of the given programs in the two chosen languages.
Take Perl vs. Ruby, for example. Every benchmark is expressed in a ratio Perl / Ruby. For the mandelbrot program, the Perl implementation completed 8 times faster than the Ruby implementation. The result therefore is 1/8. This is then marked in the graph at the 1/8 point. The memory usage is actually better in Ruby, with a factor of 191.
The result of this is that the line marked by 1 indicates that the two chosen languages are equal in performance / memory usage / source code size, with the given implementations. Every value below 1 (downwards) means that the first mentioned language is faster / consumes less memory / is smaller. Everything above 1 (upwards) means that the latter language is faster, etc.
The vertical scale is logarithmic, meaning that small bars mean little difference, while long bars mean enormous difference.
All the vertical bars per measurement unit represent all the benchmarks that exist for this comparison, ordered from good to bad.
I hope this helps.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008