What's the term to describe an algorithm that has the same complexity as the underlying prroblem? - terminology

Whilst researching data structures for a project a few months back, I came across a term that I quite liked, that could be used as follows:
This [Algorithm/Solution/Data structure] is ?????ally-optimal
Meaning that the time (or space, depending on context) complexity of the solution being referred to is the same as the fundamental complexity as the problem it solves.
For example, if we ignore quantum computation and accept that problem of sorting is O(n log n) time in the general case, then with respect to time complexity heap sort is ?????ally-optimal because its complexity is also O(n log n), whereas bubble sort is not ?????ally-optimal because O(n^2) is worse than O(n log n).
I have no idea where I read it, I've so far failed to find it with google, and not being able to remember it has been bothering me ever since!

Maybe you are talking about Asymptotically optimal algorithm:
In computer science, an algorithm is said to be asymptotically optimal if, roughly speaking, for large inputs it performs at worst a constant factor (independent of the input size) worse than the best possible algorithm.

Are you thinking computationally optimal? probably "asymptotically optimal," like another answer said. It seems what you're describing is big-theta:
If a problem has been proven to take at least f(x), it is called Omega(f(x)); an algorithm's worst case is big-O(g(x)). When f(x) == g(x), that is to say the worst case for the solution is the best case for the problem, the algorithm is big-theta(f(x)). So heapsort, e.g. is theta(n*log(n)).

Related

What is a modifying function with amortized complexity requirement vs. one with non amortized?

What is the official theoretical definition of amortized vs. non amortized complexity of a modification function? The question is especially relevant inside the C++ standard for the STL, but it's a more general question.
Can "constant" (non mutating) functions (observers, or "getters") have amortized complexity?
EDIT: clarification of apparently confusing "observers" (which can either mean observing function or external observer, according to context).
There are at least three (indeed more) independent and orthogonal axes along which asymptotic analysis of algorithms may occur: the case (best, average, worst); the bound (big-O, little-omega, Theta); and whether amortization is to be considered.
The case describes the class of input under consideration. It is literally a subset of the inputs to the algorithm. When performing asymptotic complexity analysis, it is natural to partition the input state into subsets based on the algorithm's performance w.r.t. elements of the subset. So, you might have a best case for which the algorithm does asymptotically as well as possible; a subset where it does asymptotically as poorly as possible; or, in the case of average complexity, a set of inputs along with their relative weight or probability of occurring. In the absence of a specifically described case, it's natural to assume that all inputs are included and have equal weight.
The bound describes the how the algorithm's complexity behaves asymptotically for inputs of a certain class. The complexity can be bounded from above, from below, or both; the bounds can be tight or not; and while which bound you choose might in practice be informed by the case under consideration (lower bound on best case, upper bound on worst case, etc.) in theory the choice is completely independent.
Amortized analysis is performed on top of the underlying complexity analysis and contemplates not single inputs but sequences of inputs. Amortized analysis seeks to explain how the aggregate time complexity of a sequence of operations behaves. For instance, consider the simple case of inserting a new element into an array-backed vector structure. If we have enough capacity, the operation is O(1). if we lack capacity, the operation is O(n). If we increase the capacity of the vector arithmetically (adding, e.g., k new spots each time), then we will have O(1) accesses (k-1)/k of the time, and O(n) accesses 1/k of the time, for O(n) amortized complexity. However, if we geometrically increase the capacity each time we need more capacity, then we find a sequence of adds will have O(1) amortized complexity.
Truly constant functions can have amortized analysis performed but there is really no reason to do so. Amortized analysis only makes sense when a potentially small number of repeated requests have poor individual performance, while the majority (asymptotically speaking) of requests has asymptotically better performance.

Does using binary numbers in code improves performance?

I've seen quite a few examples where binary numbers are being used in code, like 32,64,128 and so on (for instance, very well known example - minecraft)
I want to ask, does using binary numbers in such high level languages as Java / C++ help anything?
I know assembly and that you would always rather use these because in low level language it overcomplicates things if you go above register limit.
Will programs run any faster/save up more memory if you use binary numbers?
As with most things, "it depends".
In compiled languages, the better compilers will deduce that slow machine instructions can sometimes be done with different faster machine instructions (but only for special values, such as powers of two). Sometimes coders know this and program accordingly. (e.g. multiplying by a power of two is cheap)
Other times, algorithms are suited towards representations involving powers of two (e.g. many divide and conquer algorithms like the Fast Fourier Transform or a merge sort).
Yet other times, it's the most compact way to represent boolean values (like a bitmask).
And on top of that, other times it's more efficiency for memory purposes (typically because it's so fast do to multiply and divide logic with powers of two, the OS/hardware/etc will use cache line / page sizes / etc that are powers of two, so you'd do well to have nice power of two sizes for your important data structures).
And then, on top of that, other times.. programmers are just so used to using powers of two that they simply do it because it seems like a nice number.
There are some benefits of using powers of two numbers in your programs. Bitmasks are one application of this, mainly because bitwise operators (&, |, <<, >>, etc) are incredibly fast.
In C++ and Java, this is done a fair bit- especially with GUI applications. You could have a field of 32 different menu options (such as resizable, removable, editable, etc), and apply each one without having to go through convoluted addition of values.
In terms of raw speedup or any performance improvement, that really depends on the application itself. GUI packages can be huge, so getting any speedup out of those when applying menu/interface options is a big win.
From the title of your question, it sounds like you mean, "Does it make your program more efficient if you write constants in binary?" If that's what you meant, the answer is emphatically, No. The compiler translates all your constants to binary at compile time, so by the time the program runs, it makes no difference. I don't know if the compiler can interpret binary constants faster than decimal, but the difference would surely be trivial.
But the body of your question seems to indicate that you mean, "use constants that are round number in binary" rather than necessarily expressing them in binary digits.
For most purposes, the answer would be no. If, say, the computer has to add two numbers together, adding a number that happens to be a round number in binary is not going to be any faster than adding a not-round number.
It might be slightly faster for multiplication. Some compilers are smart enough to turn multiplication by powers of 2 into a bit shift operation rather than a hardware multiply, and bit shifts are usually faster than multiplies.
Back in my assembly-language days I often made elements in arrays have sizes that were powers of 2 so I could index into the array with a bit-shift rather than a multiply. But in a high-level language that would be hard to do, as you'd have to do some research to find out just how much space your primitives take in memory, whether the compiler adds padding bytes between them, etc etc. And if you did add some bytes to an array element to pad it out to a power of 2, the entire array is now bigger, and so you might generate an extra page fault, i.e. the operating system runs out of memory and has to write a chunck of your data to the hard drive and then read it back when it needs it. One extra hard drive right takes more time than 1000 multiplications.
In practice, (a) the difference is so trivial that it would almost never be worth worrying about; and (b) you don't normally know everything happenning at the low level, so it would often be hard to predict whether a change with its intendent ramifications would help or hurt.
In short: Don't bother. Use the constant values that are natural to the problem.
The reason they're used is probably different - e.g. bitmasks.
If you see them in array sizes, it doesn't really increase performance, but usually memory is allocated by power of 2. E.g. if you wrote char x[100], you'd probably get 128 allocated bytes.
No, your code will ran the same way, no matter what is the number you use.
If by binary numbers you mean numbers that are power of 2, like: 2, 4, 8, 16, 1024.... they are common due to optimization of space, normally. Example, if you have a 8 bit pointer it is capable of point to 256 (that is a power of 2), addresses, so if you use less than 256 you are wasting your pointer.... so normally you allocate a 256 buffer... this same works for all other power of 2 numbers....
In most cases the answer is almost always no, there is no noticeable performance difference.
However, there are certain cases (very few) when NOT using binary numbers for array/structure sizes/length will give noticeable performance benefits. These are cases when you're filling the cache and because you're looping over a structure that fills the cache in a such a way that you have cache collisions every time you loop through your array/structure. This case is very rare, and shouldn't be preoptimized unless you're having problems with your code performing much more slowly than theoretical limits say it should. Also, this case is very hardware dependent and will change from system to system.

segmented reduction with scattered segments

I got to solve a pretty standard problem on the GPU, but I'm quite new to practical GPGPU, so I'm looking for ideas to approach this problem.
I have many points in 3-space which are assigned to a very small number of groups (each point belongs to one group), specifically 15 in this case (doesn't ever change). Now I want to compute the mean and covariance matrix of all the groups. So on the CPU it's roughly the same as:
for each point p
{
mean[p.group] += p.pos;
covariance[p.group] += p.pos * p.pos;
++count[p.group];
}
for each group g
{
mean[g] /= count[g];
covariance[g] = covariance[g]/count[g] - mean[g]*mean[g];
}
Since the number of groups is extremely small, the last step can be done on the CPU (I need those values on the CPU, anyway). The first step is actually just a segmented reduction, but with the segments scattered around.
So the first idea I came up with, was to first sort the points by their groups. I thought about a simple bucket sort using atomic_inc to compute bucket sizes and per-point relocation indices (got a better idea for sorting?, atomics may not be the best idea). After that they're sorted by groups and I could possibly come up with an adaption of the segmented scan algorithms presented here.
But in this special case, I got a very large amount of data per point (9-10 floats, maybe even doubles if the need arises), so the standard algorithms using a shared memory element per thread and a thread per point might make problems regarding per-multiprocessor resources as shared memory or registers (Ok, much more on compute capability 1.x than 2.x, but still).
Due to the very small and constant number of groups I thought there might be better approaches. Maybe there are already existing ideas suited for these specific properties of such a standard problem. Or maybe my general approach isn't that bad and you got ideas for improving the individual steps, like a good sorting algorithm suited for a very small number of keys or some segmented reduction algorithm minimizing shared memory/register usage.
I'm looking for general approaches and don't want to use external libraries. FWIW I'm using OpenCL, but it shouldn't really matter as the general concepts of GPU computing don't really differ over the major frameworks.
Even though there are few groups, I don't think you will be able to avoid the initial sorting into groups while still keeping the reduction step efficient. You will probably also want to perform the full sort, not just sorting indexes, because that will help keep memory access efficient in the reduction step.
For sorting, read about general strategies here:
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter46.html
For reduction (old but still good):
http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf
For an example implementation of parallel reduction:
http://developer.nvidia.com/cuda-cc-sdk-code-samples#reduction

Counting FLOPS/GFLOPS in program - CUDA

Already finished my application which multiplies CRS matrix and vector (SpMV) and the only thing to do now is to count FLOPS my application did. In my opinion it's really hard to estimate number of floating point operation in case of sparse matrix - vector multiplication, because the number of multiplies in one row is really "jumpy" or fluent.
I only tried to measure time using "cudaprof" ( available in ./CUDA/bin directory) - it works fine.
Any sugestions and instruction pastes appreciated !
That's not just your opinion; it's simple fact that the number of operations in the case of a sparse matrix is data-dependent, and so you can't get a reasonable answer without knowing something about the data. That makes it impossible to have a one-number-fits-all-data estimate.
This is probably one of the sorts of situations where you could think hard about it for many hours (and do lots of research) to make a maybe-accurate estimate, or you could spend a few minutes writing a variant of your existing implementation that increments a counter each time it does an operation. Sure, that's going to take quite a while to run (especially if you don't do it in a CUDA-enabled form), but probably a lot less time than it would take to do the thinking, and when the answer comes out, you don't have to do a lot of work to convince yourself that it's right.

High-level/semantic optimization

I'm writing a compiler, and I'm looking for resources on optimization. I'm compiling to machine code, so anything at runtime is out of the question.
What I've been looking for lately is less code optimization and more semantic/high-level optimization. For example:
free(malloc(400)); // should be completely optimized away
Even if these functions were completely inlined, they could eventually call OS memory functions which can never be inlined. I'd love to be able to eliminate that statement completely without building special-case rules into the compiler (after all, malloc is just another function).
Another example:
string Parenthesize(string str) {
StringBuilder b; // similar to C#'s class of the same name
foreach(str : ["(", str, ")"])
b.Append(str);
return b.Render();
}
In this situation I'd love to be able to initialize b's capacity to str.Length + 2 (enough to exactly hold the result, without wasting memory).
To be completely honest, I have no idea where to begin in tackling this problem, so I was hoping for somewhere to get started. Has there been any work done in similar areas? Are there any compilers that have implemented anything like this in a general sense?
To do an optimization across 2 or more operations, you have to understand the
algebraic relationship of those two operations. If you view operations
in their problem domain, they often have such relationships.
Your free(malloc(400)) is possible because free and malloc are inverses
in the storage allocation domain.
Lots of operations have inverses and teaching the compiler that they are inverses,
and demonstrating that the results of one dataflow unconditionally into the other,
is what is needed. You have to make sure that your inverses really are inverses
and there isn't a surprise somewhere; a/x*x looks like just the value a,
but if x is zero you get a trap. If you don't care about the trap, it is an inverse;
if you do care about the trap then the optimization is more complex:
(if (x==0) then trap() else a)
which is still a good optimization if you think divide is expensive.
Other "algebraic" relationships are possible. For instance, there are
may idempotent operations: zeroing a variable (setting anything to the same
value repeatedly), etc. There are operations where one operand acts
like an identity element; X+0 ==> X for any 0. If X and 0 are matrices,
this is still true and a big time savings.
Other optimizations can occur when you can reason abstractly about what the code
is doing. "Abstract interpretation" is a set of techniques for reasoning about
values by classifying results into various interesting bins (e.g., this integer
is unknown, zero, negative, or positive). To do this you need to decide what
bins are helpful, and then compute the abstract value at each point. This is useful
when there are tests on categories (e.g., "if (x<0) { ... " and you know
abstractly that x is less than zero; you can them optimize away the conditional.
Another way is to define what a computation is doing symbolically, and simulate the computation to see the outcome. That is how you computed the effective size of the required buffer; you computed the buffer size symbolically before the loop started,
and simulated the effect of executing the loop for all iterations.
For this you need to be able to construct symbolic formulas
representing program properties, compose such formulas, and often simplify
such formulas when they get unusably complex (kinds of fades into the abstract
interpretation scheme). You also want such symbolic computation to take into
account the algebraic properties I described above. Tools that do this well are good at constructing formulas, and program transformation systems are often good foundations for this. One source-to-source program transformation system that can be used to do this
is the DMS Software Reengineering Toolkit.
What's hard is to decide which optimizations are worth doing, because you can end
of keeping track of vast amounts of stuff, which may not pay off. Computer cycles
are getting cheaper, and so it makes sense to track more properties of the code in the compiler.
The Broadway framework might be in the vein of what you're looking for. Papers on "source-to-source transformation" will probably also be enlightening.