Counting FLOPS/GFLOPS in program - CUDA - cuda

Already finished my application which multiplies CRS matrix and vector (SpMV) and the only thing to do now is to count FLOPS my application did. In my opinion it's really hard to estimate number of floating point operation in case of sparse matrix - vector multiplication, because the number of multiplies in one row is really "jumpy" or fluent.
I only tried to measure time using "cudaprof" ( available in ./CUDA/bin directory) - it works fine.
Any sugestions and instruction pastes appreciated !

That's not just your opinion; it's simple fact that the number of operations in the case of a sparse matrix is data-dependent, and so you can't get a reasonable answer without knowing something about the data. That makes it impossible to have a one-number-fits-all-data estimate.
This is probably one of the sorts of situations where you could think hard about it for many hours (and do lots of research) to make a maybe-accurate estimate, or you could spend a few minutes writing a variant of your existing implementation that increments a counter each time it does an operation. Sure, that's going to take quite a while to run (especially if you don't do it in a CUDA-enabled form), but probably a lot less time than it would take to do the thinking, and when the answer comes out, you don't have to do a lot of work to convince yourself that it's right.

Related

gnuRadio Dual Tone detection

I am trying to come up with an efficient way to characterize two narrowband tones separated by about 900kHz (one at around 100kHZ and one at around 1MHz once translated to baseband). They don't move much in freq over time but may have amplitude variations we want to monitor.
Each tone is roughly about 100Hz wide and we are required to characterize these two beasts over long periods of time down to a resolution of about 0.1 Hz. The samples are coming in at over 2M Samples/sec (TBD) to adequately acquire the highest tone.
I'm trying to avoid (if possible) doing brute force >2MSample FFTs on the data once a second to extract frequency domain data. Is there an efficient approach? Something akin to performing two (much) smaller FFTs around the bands of interest? Ive looked at Goertzel and chirp z methods but I am not certain it helps save processing.
Something akin to performing two (much) smaller FFTs around the bands of interest
There is, it's called Goertzel, and is kind of the FFT for single bins, and you already have looked at it. It will save you CPU time.
Anyway, there's no reason to do a 2M-point FFT; first of all, you only want a resolution of about 1/20 the sampling rate, hence, a 20-point FFT would totally do, and should be pretty doable for your CPU at these low rates; since you don't seem to care about phase of your tones, FFT->complex_to_mag.
However, there's one thing that you should always do: look at your signal of interest, and decimate down to the rate that fits exactly that. Since GNU Radio's filters are implemented cleverly, the filter itself will only run at the decimated rate, and you can spend the CPU cycles saved on a better filter.
Because a direct decimation from 2MHz to 100Hz (decimation: 20000) will really have an ugly filter length, you should do this multi-rated:
I'd try first decimating by 100, and then in a second step by 100, leaving you with 200Hz observable spectrum. The xlating fir filter blocks will let you use a simple low-pass filter (use the "Low-Pass Filter Taps" block to define a variable that contains such taps) as a band-selector.

speed up ideas -- can CUDA help here?

I'm working on an algorithm that has to do a small number
of operations on a large numbers of small arrays, somewhat independently.
To give an idea:
1k sorting of arrays of length typically of 0.5k-1k elements.
1k of LU-solve of matrices that have rank 10-20.
everything is in floats.
Then, there is some horizontality to this problem: the above
operations have to be carried independently on 10k arrays.
Also, the intermediate results need not be stored: for example, i don't
need to keep the sorted arrays, only the sum of the smallest $m$ elements.
The whole thing has been programmed in c++ and runs. My question is:
would you expect a problem like this to enjoy significant speed ups
(factor 2 or more) with CUDA?
You can run this in 5 lines of ArrayFire code. I'm getting speedups of ~6X with this over the CPU. I'm getting speedups of ~4X with this over Thrust (which was designed for vectors, not matrices). Since you're only using a single GPU, you can run ArrayFire Free version.
array x = randu(512,1000,f32);
array y = sort(x); // sort each 512-element column independently
array x = randu(15,15,1000,f32), y;
gfor (array i, x.dim(2))
y(span,span,i) = lu(x(span,span,i)); // LU-decomposition of each 15x15 matrix
Keep in mind that GPUs perform best when memory accesses are aligned to multiples of 32, so a bunch of 32x32 matrices will perform better than a bunch of 31x31.
If you "only" need a factor of 2 speed up I would suggest looking at more straightforward optimisation possibilities first, before considering GPGPU/CUDA. E.g. assuming x86 take a look at using SSE for a potential 4x speed up by re-writing performance critical parts of your code to use 4 way floating point SIMD. Although this would tie you to x86 it would be more portable in that it would not require the presence of an nVidia GPU.
Having said that, there may even be simpler optimisation opportunities in your code base, such as eliminating redundant operations (useless copies and initialisations are a favourite) or making your memory access pattern more cache-friendly. Try profiling your code with a decent profiler to see where the bottlenecks are.
Note however that in general sorting is not a particularly good fit for either SIMD or CUDA, but other operations such as LU decomposition may well benefit.
Just a few pointers, you maybe already incorporated:
1) If you just need the m smallest elements, you are probably better of to just search the smallest element, remove it and repeat m - times.
2) Did you already parallelize the code on the cpu? OpenMP or so ...
3) Did you think about buying better hardware? (I know it´s not the nice think to do, but if you want to reach performance goals for a specific application it´s sometimes the cheapest possibility ...)
If you want to do it on CUDA, it should work conceptually, so no big problems should occur. However, there are always the little things, which depend on experience and so on.
Consider the thrust-library for the sorting thing, hopefully someone else can suggest some good LU-decomposition algorithm.

Does using binary numbers in code improves performance?

I've seen quite a few examples where binary numbers are being used in code, like 32,64,128 and so on (for instance, very well known example - minecraft)
I want to ask, does using binary numbers in such high level languages as Java / C++ help anything?
I know assembly and that you would always rather use these because in low level language it overcomplicates things if you go above register limit.
Will programs run any faster/save up more memory if you use binary numbers?
As with most things, "it depends".
In compiled languages, the better compilers will deduce that slow machine instructions can sometimes be done with different faster machine instructions (but only for special values, such as powers of two). Sometimes coders know this and program accordingly. (e.g. multiplying by a power of two is cheap)
Other times, algorithms are suited towards representations involving powers of two (e.g. many divide and conquer algorithms like the Fast Fourier Transform or a merge sort).
Yet other times, it's the most compact way to represent boolean values (like a bitmask).
And on top of that, other times it's more efficiency for memory purposes (typically because it's so fast do to multiply and divide logic with powers of two, the OS/hardware/etc will use cache line / page sizes / etc that are powers of two, so you'd do well to have nice power of two sizes for your important data structures).
And then, on top of that, other times.. programmers are just so used to using powers of two that they simply do it because it seems like a nice number.
There are some benefits of using powers of two numbers in your programs. Bitmasks are one application of this, mainly because bitwise operators (&, |, <<, >>, etc) are incredibly fast.
In C++ and Java, this is done a fair bit- especially with GUI applications. You could have a field of 32 different menu options (such as resizable, removable, editable, etc), and apply each one without having to go through convoluted addition of values.
In terms of raw speedup or any performance improvement, that really depends on the application itself. GUI packages can be huge, so getting any speedup out of those when applying menu/interface options is a big win.
From the title of your question, it sounds like you mean, "Does it make your program more efficient if you write constants in binary?" If that's what you meant, the answer is emphatically, No. The compiler translates all your constants to binary at compile time, so by the time the program runs, it makes no difference. I don't know if the compiler can interpret binary constants faster than decimal, but the difference would surely be trivial.
But the body of your question seems to indicate that you mean, "use constants that are round number in binary" rather than necessarily expressing them in binary digits.
For most purposes, the answer would be no. If, say, the computer has to add two numbers together, adding a number that happens to be a round number in binary is not going to be any faster than adding a not-round number.
It might be slightly faster for multiplication. Some compilers are smart enough to turn multiplication by powers of 2 into a bit shift operation rather than a hardware multiply, and bit shifts are usually faster than multiplies.
Back in my assembly-language days I often made elements in arrays have sizes that were powers of 2 so I could index into the array with a bit-shift rather than a multiply. But in a high-level language that would be hard to do, as you'd have to do some research to find out just how much space your primitives take in memory, whether the compiler adds padding bytes between them, etc etc. And if you did add some bytes to an array element to pad it out to a power of 2, the entire array is now bigger, and so you might generate an extra page fault, i.e. the operating system runs out of memory and has to write a chunck of your data to the hard drive and then read it back when it needs it. One extra hard drive right takes more time than 1000 multiplications.
In practice, (a) the difference is so trivial that it would almost never be worth worrying about; and (b) you don't normally know everything happenning at the low level, so it would often be hard to predict whether a change with its intendent ramifications would help or hurt.
In short: Don't bother. Use the constant values that are natural to the problem.
The reason they're used is probably different - e.g. bitmasks.
If you see them in array sizes, it doesn't really increase performance, but usually memory is allocated by power of 2. E.g. if you wrote char x[100], you'd probably get 128 allocated bytes.
No, your code will ran the same way, no matter what is the number you use.
If by binary numbers you mean numbers that are power of 2, like: 2, 4, 8, 16, 1024.... they are common due to optimization of space, normally. Example, if you have a 8 bit pointer it is capable of point to 256 (that is a power of 2), addresses, so if you use less than 256 you are wasting your pointer.... so normally you allocate a 256 buffer... this same works for all other power of 2 numbers....
In most cases the answer is almost always no, there is no noticeable performance difference.
However, there are certain cases (very few) when NOT using binary numbers for array/structure sizes/length will give noticeable performance benefits. These are cases when you're filling the cache and because you're looping over a structure that fills the cache in a such a way that you have cache collisions every time you loop through your array/structure. This case is very rare, and shouldn't be preoptimized unless you're having problems with your code performing much more slowly than theoretical limits say it should. Also, this case is very hardware dependent and will change from system to system.

comparision and addition of two integers : in detail

I would like to know how many machine cycles does it take to compare two integers and how many if I add that and which one is easier?
basically i m looking to see which one is more expensive generally ??
Also I need an answer from c, c++, java perspective ....
helps is appreciated thanks!!
The answer is yes. And no. And maybe.
There are machines that can compare two values in their spare time between cycles, and others that need several cycles. On the old PDP8 you first had to negate one operand, do an add, and then test the result to do a compare.
But other machines can compare much faster than add, because no register needs to be modified.
But on still other machines the operations take the same time, but it takes several cycles for the result of the compare to make it to a place where one can test it, so, if you can use those cycles the compare is almost free, but fairly expensive if you have no other operations to shove into those cycles.
The simple answer is one cycle, both operations are equally easy.
A totally generic answer is difficult to give, since processor architectures are amazingly complex when you get down into the details.
All modern processors are pipelined. That is, there are no instructions where the operands go in on cycle c, and the result is available on cycle c+1. Instead, the instruction is broken down into multiple steps.
The instructions are read into the front end of the processor, which decodes the instruction. This may include breaking it down into multiple micro-ops. The operands are then read into registers, and then the execution units handle the actual operation. Eventually the answer is returned back to a register.
The instructions go through one pipeline stage each cycle, and modern CPUs have 10-20 pipeline stages. So it could be upto 20 processor cycles to add or compare two numbers. However, once one instruction has been through one stage of the pipeline, another instruction can be read into that stage. The ideal is that each clock cycle, one instruction goes into the front end, while one set of results comes out the other.
There is massive complexity involved in getting all this to work. If you want to do a+b+c, you need to add a+b before you can add c. So a lot of the work in the front end of the processor involves scheduling. Modern processors employ out-of-order execution, so that the processor will examine the incoming instructions, and re-order them such that it does a+b, then gets on with some other work, and then does result+c once the result is available.
Which all brings us back to the original question of which is easier. Because usually, if you're comparing two integers, it is to make a decision on what to do next. Which means you won't know your next instruction until you've got the result of the last one. Because the instructions are pipelined, this means you can lose 20 clock cycles of work if you wait.
So modern CPUs have a branch predictor which makes a guess what the result will be, and continues executing the instructions. If it guesses wrong, the pipeline has to be thrown out, and work restarted on the other branch. The branch predictor helps enormously, but still, if the comparison is a decision point in the code, that is for more difficult for the CPU to deal with than the addition.
Comparison is done via subtraction, which is almost the same as addition, except that the carry and subtrahend are complemented, so a - b - c becomes a + ~b + ~c. This is already accounted for in the CPU and basically takes the same amount of time either way.

Benchmarking: When can I stop making measurements?

I have a series of functions that are all designed to do the same thing. The same inputs produce the same outputs, but the time that it takes to do them varies by function. I want to determine which one is 'fastest', and I want to have some confidence that my measurement is 'statistically significant'.
Perusing Wikipedia and the interwebs tells me that statistical significance means that a measurement or group of measurements is different from a null hypothesis by a p-value threshold. How would that apply here? What is the null hypothesis between function A being faster than function B?
Once I've got that whole setup defined, how do I figure out when to stop measuring? I'll typically see that a benchmark is run three times, and then the average is reported; why three times and not five or seven? According to this page on Statistical Significance (which I freely admit I do not understand fully), Fisher used 8 as the number of samples that he needed to measure something with 98% confidence; why 8?
I would not bother applying statistics principles to benchmarking results. In general, the term "statistical significance" refers to the likelihood that your results were achieved accidentally, and do not represent an accurate assessment of the true values. In statistics, as a result of simple probability, the likelihood of a result being achieved by chance decreases as the number of measurements increases. In the benchmarking of computer code, it is a trivial matter to increase the number of trials (the "n" in statistics) so that the likelihood of an accidental result is below any arbitrary threshold you care to define (the "alpha" or level of statistical significance).
To simplify: benchmark by running your code a huge number of times, and don't worry about statistical measurements.
Note to potential down-voters of this answer: this answer is somewhat of a simplification of the matter, designed to illustrate the concepts in an accessible way. Comments like "you clearly don't understand statistics" will result in a savage beat-down. Remember to be polite.
You are asking two questions:
How do you perform a test of statistical significance that the mean time of function A is greater than the mean time of function B?
If you want a certain confidence in your answer, how many samples should you take?
The most common answer to the first question is that you either want to compute a confidence interval or perform a t-test. It's not different than any other scientific experiment with random variation. To compute the 95% confidence interval of the mean response time for function A simply take the mean and add 1.96 times the standard error to either side. The standard error is the square root of the variance divided by N. That is,
95% CI = mean +/- 1.96 * sqrt(sigma2/N))
where sigma2 is the variance of speed for function A and N is the number of runs you used to calculate mean and variance.
Your second question relates to statistical power analysis and the design of experiments. You describe a sequential setup where you are asking whether to continue sampling. The design of sequential experiments is actually a very tricky problem in statistics, since in general you are not allowed to calculate confidence intervals or p-values and then draw additional samples conditional on not reaching your desired significance. If you wish to do this, it would be wiser to set up a Bayesian model and calculate your posterior probability that speed A is greater than speed B. This, however, is massive overkill.
In a computing environment it is generally pretty trivial to achieve a very small confidence interval both because drawing large N is easy and because the variance is generally small -- one function obviously wins.
Given that Wikipedia and most online sources are still horrible when it comes to statistics, I recommend buying Introductory Statistics with R. You will learn both the statistics and the tools to apply what you learn.
The research you site sounds more like a highly controlled environment. This is purely a practical answer that has proven itself time and again to be effective for performance testing.
If you are benchmarking code in a modern, multi-tasking, multi-core, computing environment, the number of iterations required to achieve a useful benchmark goes up as the length of time of the operation to be measured goes down.
So, if you have an operation that takes ~5 seconds, you'll want, typically, 10 to 20 iterations. As long as the deviation across the iterations remains fairly constant, then your data is sound enough to draw conclusions. You'll often want to throw out the first iteration or two because the system is typically warming up caches, etc...
If you are testing something in the millisecond range, you'll want 10s of thousands of iterations. This will eliminate noise caused by other processes, etc, firing up.
Once you hit the sub-millisecond range -- 10s of nanoseconds -- you'll want millions of iterations.
Not exactly scientific, but neither is testing "in the real world" on a modern computing system.
When comparing the results, consider the difference in execution speed as percentage, not absolute. Anything less than about 5% difference is pretty close to noise.
Do you really care about statistical significance or plain old significance? Ultimately you're likely to have to form a judgement about readability vs performance - and statistical significance isn't really going to help you there.
A couple of rules of thumb I use:
Where possible, test for enough time to make you confident that little blips (like something else interrupting your test for a short time) won't make much difference. Usually I reckon 30 seconds is enough for this, although it depends on your app. The longer you test for, the more reliable the test will be - but obviously your results will be delayed :)
Running a test multiple times can be useful, but if you're timing for long enough then it's not as important IMO. It would alleviate other forms of error which made a whole test take longer than it should. If a test result looks suspicious, certainly run it again. If you see significantly different results for different runs, run it several more times and try to spot a pattern.
The fundamental question you're trying to answer is how likley is it that what you observe could have happened by chance? Is this coin fair? Throw it once: HEADS. No it's not fair it always comes down heads. Bad conclusion! Throw it 10 times and get 7 Heads, now what do you conclude? 1000 times and 700 heads?
For simple cases we can imagine how to figure out when to stop testing. But you have a slightly different situation - are you really doing a statistical analysis?
How much control do you have of your tests? Does repeating them add any value? Your computer is deterministic (maybe). Eistein's definition of insanity is to repeat something and expect a different outcome. So when you run your tests do you get repeatable answers? I'm not sure that statistical analyses help if you are doing good enough tests.
For what you're doing I would say that the first key thing is to make sure that you really are measuring what you think. Run every test for long enough that any startup or shutdown effects are hidden. Useful performance tests tend to run for quite extended periods for that reason. Make sure that you are not actually measuing the time in your test harness rather than the time in your code.
You have two primary variables: how many iterations of your method to run in one test? How many tests to run?
Wikipedia says this
In addition to expressing the
variability of a population, standard
deviation is commonly used to measure
confidence in statistical conclusions.
For example, the margin of error in
polling data is determined by
calculating the expected standard
deviation in the results if the same
poll were to be conducted multiple
times. The reported margin of error is
typically about twice the standard
deviation.
Hence if your objective is to be sure that one function is faster than another you could run a number of tests of each, compute the means and standard deviations. My expectation is that if your number of iterations within any one test is high then the standard deviation is going to be low.
If we accept that defintion of margin of error, you can see whether the two means are further apart than their total margin's of error.