Do different arithmetic operations have different processing times? - language-agnostic

Are the basic arithmetic operations same with respect to processor usage. For e.g if I do an addition vs division in a loop, will the calculation time for addition be less than that for division?
I am not sure if this question belongs here or computer science SE

Yes. Here is a quick example:
http://my.safaribooksonline.com/book/hardware/9788131732465/instruction-set-and-instruction-timing-of-8086/app_c
those are the microcode and the timing of the operation of a massively old architecture, the 8086. it is a fairly simple point to start.
of relevant note, they are measured in cycles, or clocks, and everything move at the speed of the cpu (they are synchronized on the main clock or frequency of the microprocessor)
if you scroll down on that table you'll see a division taking anywhere from 80 to 150 cycles.
also note operation speed is affected by which area of memory the operand reside.
note that on modern processor you can have parallel instruction executed concurrently (even if the cpu is single threaded) and some of them are executed out of order, then vector instruction murky the question even more.
i.e. a SSE multiplication can multiply multiple number in a single operation (taking multiple time)

Yes. Different machine instructions are not equally expensive.
You can either do measurements yourself or use one of the references in this question to help you understand the costs for various instructions.

Related

Most efficient computational method to numerically minimize a 8 variables constrained system

I'm working for quite some time on finding a numerical instance for solution of a 8 variables system of 7 very complicated inequalities plus region specification. Unfortunately I cannot produce a MWE or nothing of the sort since the inputs are really long.
My current method is Mathematica's NMinimize routine, minimizing one of the 7 inequalities subject to every other condition as constraint -- The FindInstance command simply quits the kernel without being able to finish running.
The NMinimize is able to produce output, but besides being slower than would be optimal, produce results that do not obey every constraint.
The thing is that I need to be certain, for each benchmark I run, that if the output doesn't satisfy every constraint it is because such a set of real numbers doesn't exist -- which with my current method I can't be, by experience.
So: is there a foolproof, as efficient as possible, computational method for me to find a single instance of numerical solution to 7 complicated inequalities (involving trigonometric functions) of 8 variables or be sure that such a set doesn't exist?
It could be a Mathematica/python/fortran package, genetic algorithm or anything -- as long as there is clear enough documentation.
You need to give importance multiplier to constraints and the optimization method should not be greedy.
A genetic algorithm combined with multiple-starting points (or simulated annealing for diminishing mutations) tends to converge to global minima (hence not greedy) with more time given to it but there is not guarantee that the heuristic will complete X function in Y time. The more time given to it, the better it converges to global minima.
In genetic algorithm, you can add big constraint penalties like this:
fitness_minima = some_function_output_between_1_and_10 +
constraints_breached?1000.0f:0;
so that the DNAs with no contraint-violations will be favored for the crossover part of GA.
"As efficient as possible" depends on your algorithm. If you can parallelize the algorithm and run it on multiple GPUs, it should give substantial speedup over CPU. Compared to some hours of Mona-Lisa painting by CPU, a parallelized version running on 3 low-end GPUs complete within 10 minutes (https://www.youtube.com/watch?v=QRZqBLJ6brQ). At least some OpenCL/CUDA supporting libraries/frameworks (like Tensorflow) should be able to accelerate your algorithm if you don't want to do the work distribution yourself.

CUDA Lookup Table vs. Algorithm

I know this can be tested but I am interested in the theory, on paper what should be faster.
I'm trying to work out what would be theoretically faster, a random look-up from a table in shared memory (so bank conflicts possible) vs an algorithm with say, 'n' fp multiplications.
Best case scenario is the shared memory look-up has no bank conflicts and so takes 20-40 clock cycles, worst case is 32 bank conflicts and 640-1280 clock cycles. The multiplications will be 'n' * cycles per instruction. Is this proper reasoning?
Do the fp multiplications each take 1 cycle? 5 cycles? At which point, as a number of multiplications, does it make sense to use a shared memory look-up table?
The multiplications will be 'n' x cycles per instruction. Is this proper reasoning? When doing 'n' fp multiplications, it is keeping the cores busy with those operations. It's probably not just 'mult' instructions, it will be other ones like 'mov' in-between also. So maybe it might be n*3 instructions total. When you fetch a cached value from shared memory the (20-40) * 5(avg max bank conflicts..guessing)= ~150 clocks the cores are free to do other things. If the kernel is compute bound(limited) then using shared memory might be more efficient. If the kernel has limited shared memory or using more shared memory would result in fewer in-flight warps then re-calculating it would be faster.
Do the fp multiplications each take 1 cycle? 5 cycles?
When I wrote this it was 6 cycles but that was 7 years ago. It might (or might not) be faster now. This is only for a particular core though and not the entire SM.
At which point, as a number of multiplications, does it make sense to use a shared memory look-up table? It's really hard to say. There are a lot of variables here like GPU generation, what the rest of the kernel is doing, the setup time for the shared memory, etc.
A problem with building random numbers in a kernel is also the additional registers requirements. This might cause slowdown for the rest of the kernel because there would be more register usage so that could cause less occupancy.
Another solution (again depending on the problem) would be to use a GPU RNG and fill a global memory array with random numbers. Then have your kernel access these. It would take 300-500 clock cycles but there would not be any bank conflicts. Also with Pascal(not release yet) there will be hbm2 and this will likely lower the global memory access time even further.
Hope this helps. Hopefully some other experts can chime in and give you better information.

Why order of dimension makes big difference in performance?

To launch a CUDA kernel, we use dim3 to specify the dimensions, and I think the meaning of each dimension is opt to the user, for example, it could mean (width, height) or (rows, cols), which has the meaning reversed.
So I did an experiment with the CUDA sample in the SDK: 3_Imaging/convolutionSeparable, simply exchage .x and .y in the kernel function, and reverse the dimensions of blocks and threads used to launch the kernel, so the meaning changes from dim(width, height)/idx(x, y) to dim(rows, cols)/idx(row, col).
The result is the same, however, the performance decreases, the original one takes about 26ms, while the modified one takes about 40ms on my machine(SM 3.0).
My question is, what makes the difference? is (rows, cols) not feasible for CUDA?
P.S. I only modified convolutionRows, no convolutionColumns
EDIT: The change can be found here.
There are at least two potential consequences of your changes:
First, you are changing the memory access pattern to the main memory so the
access is as not coalesced as in the original case.
You should think about the GPU main memory in the same way as it was
a "CPU" memory, i.e., prefetching, blocking, sequential accesses...
techniques to applies in order to get performance.
If you want to know more about this topic, it is mandatory to read
this paper. What every programmer should know about memory.
You'll find an example a comparison between row and column major
access to the elements of a matrix there.
To get and idea on how important this is, think that most -if not
all- GPU high performance codes perform a matrix transposition
before any computation in order to achieve a more coalesced memory
access, and still this additional step worths in terms on
performance. (sparse matrix operations, for instance)
Second. This is more subtle, but in some scenarios it has a deep impact on the performance of a kernel; the launching configuration. It is not the same launching 20 blocks of 10 threads as launching 10 blocks of 20 threads. There is a big difference in the amount of resources a thread needs (shared memory, number of registers,...). The more resources a thread needs the less warps can be mapped on a single SM so the less occupancy... and the -most of the times- less performance.
This not applies to your question, since the number of blocks is equal to the number of threads.
When programming for GPUs you must be aware of the architecture in order to understand how that changes will modify the performance. Of course, I am not familiar with the code so there will be others factors among these two.

comparision and addition of two integers : in detail

I would like to know how many machine cycles does it take to compare two integers and how many if I add that and which one is easier?
basically i m looking to see which one is more expensive generally ??
Also I need an answer from c, c++, java perspective ....
helps is appreciated thanks!!
The answer is yes. And no. And maybe.
There are machines that can compare two values in their spare time between cycles, and others that need several cycles. On the old PDP8 you first had to negate one operand, do an add, and then test the result to do a compare.
But other machines can compare much faster than add, because no register needs to be modified.
But on still other machines the operations take the same time, but it takes several cycles for the result of the compare to make it to a place where one can test it, so, if you can use those cycles the compare is almost free, but fairly expensive if you have no other operations to shove into those cycles.
The simple answer is one cycle, both operations are equally easy.
A totally generic answer is difficult to give, since processor architectures are amazingly complex when you get down into the details.
All modern processors are pipelined. That is, there are no instructions where the operands go in on cycle c, and the result is available on cycle c+1. Instead, the instruction is broken down into multiple steps.
The instructions are read into the front end of the processor, which decodes the instruction. This may include breaking it down into multiple micro-ops. The operands are then read into registers, and then the execution units handle the actual operation. Eventually the answer is returned back to a register.
The instructions go through one pipeline stage each cycle, and modern CPUs have 10-20 pipeline stages. So it could be upto 20 processor cycles to add or compare two numbers. However, once one instruction has been through one stage of the pipeline, another instruction can be read into that stage. The ideal is that each clock cycle, one instruction goes into the front end, while one set of results comes out the other.
There is massive complexity involved in getting all this to work. If you want to do a+b+c, you need to add a+b before you can add c. So a lot of the work in the front end of the processor involves scheduling. Modern processors employ out-of-order execution, so that the processor will examine the incoming instructions, and re-order them such that it does a+b, then gets on with some other work, and then does result+c once the result is available.
Which all brings us back to the original question of which is easier. Because usually, if you're comparing two integers, it is to make a decision on what to do next. Which means you won't know your next instruction until you've got the result of the last one. Because the instructions are pipelined, this means you can lose 20 clock cycles of work if you wait.
So modern CPUs have a branch predictor which makes a guess what the result will be, and continues executing the instructions. If it guesses wrong, the pipeline has to be thrown out, and work restarted on the other branch. The branch predictor helps enormously, but still, if the comparison is a decision point in the code, that is for more difficult for the CPU to deal with than the addition.
Comparison is done via subtraction, which is almost the same as addition, except that the carry and subtrahend are complemented, so a - b - c becomes a + ~b + ~c. This is already accounted for in the CPU and basically takes the same amount of time either way.

matrix multiplication in cuda

say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.
a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.
b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.
I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?
As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.
If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.
The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.
Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.
The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.
There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.
Have you looked at the CUDA documentation: Cuda Programming Model
Also, sample source code: Matrix Multiplication
Did you look at
$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul
i.e. the matrix multiplication example in the SDK?
If you don't need to implement this yourself, just use a library -- CUBLAS, MAGMA, etc., provide tuned matrix multiplication implementations.