Benchmarking: When can I stop making measurements?

Benchmarking: When can I stop making measurements? - language-agnostic

I have a series of functions that are all designed to do the same thing. The same inputs produce the same outputs, but the time that it takes to do them varies by function. I want to determine which one is 'fastest', and I want to have some confidence that my measurement is 'statistically significant'.
Perusing Wikipedia and the interwebs tells me that statistical significance means that a measurement or group of measurements is different from a null hypothesis by a p-value threshold. How would that apply here? What is the null hypothesis between function A being faster than function B?
Once I've got that whole setup defined, how do I figure out when to stop measuring? I'll typically see that a benchmark is run three times, and then the average is reported; why three times and not five or seven? According to this page on Statistical Significance (which I freely admit I do not understand fully), Fisher used 8 as the number of samples that he needed to measure something with 98% confidence; why 8?

I would not bother applying statistics principles to benchmarking results. In general, the term "statistical significance" refers to the likelihood that your results were achieved accidentally, and do not represent an accurate assessment of the true values. In statistics, as a result of simple probability, the likelihood of a result being achieved by chance decreases as the number of measurements increases. In the benchmarking of computer code, it is a trivial matter to increase the number of trials (the "n" in statistics) so that the likelihood of an accidental result is below any arbitrary threshold you care to define (the "alpha" or level of statistical significance).
To simplify: benchmark by running your code a huge number of times, and don't worry about statistical measurements.
Note to potential down-voters of this answer: this answer is somewhat of a simplification of the matter, designed to illustrate the concepts in an accessible way. Comments like "you clearly don't understand statistics" will result in a savage beat-down. Remember to be polite.

You are asking two questions:
How do you perform a test of statistical significance that the mean time of function A is greater than the mean time of function B?
If you want a certain confidence in your answer, how many samples should you take?
The most common answer to the first question is that you either want to compute a confidence interval or perform a t-test. It's not different than any other scientific experiment with random variation. To compute the 95% confidence interval of the mean response time for function A simply take the mean and add 1.96 times the standard error to either side. The standard error is the square root of the variance divided by N. That is,
95% CI = mean +/- 1.96 * sqrt(sigma2/N))
where sigma2 is the variance of speed for function A and N is the number of runs you used to calculate mean and variance.
Your second question relates to statistical power analysis and the design of experiments. You describe a sequential setup where you are asking whether to continue sampling. The design of sequential experiments is actually a very tricky problem in statistics, since in general you are not allowed to calculate confidence intervals or p-values and then draw additional samples conditional on not reaching your desired significance. If you wish to do this, it would be wiser to set up a Bayesian model and calculate your posterior probability that speed A is greater than speed B. This, however, is massive overkill.
In a computing environment it is generally pretty trivial to achieve a very small confidence interval both because drawing large N is easy and because the variance is generally small -- one function obviously wins.
Given that Wikipedia and most online sources are still horrible when it comes to statistics, I recommend buying Introductory Statistics with R. You will learn both the statistics and the tools to apply what you learn.

The research you site sounds more like a highly controlled environment. This is purely a practical answer that has proven itself time and again to be effective for performance testing.
If you are benchmarking code in a modern, multi-tasking, multi-core, computing environment, the number of iterations required to achieve a useful benchmark goes up as the length of time of the operation to be measured goes down.
So, if you have an operation that takes ~5 seconds, you'll want, typically, 10 to 20 iterations. As long as the deviation across the iterations remains fairly constant, then your data is sound enough to draw conclusions. You'll often want to throw out the first iteration or two because the system is typically warming up caches, etc...
If you are testing something in the millisecond range, you'll want 10s of thousands of iterations. This will eliminate noise caused by other processes, etc, firing up.
Once you hit the sub-millisecond range -- 10s of nanoseconds -- you'll want millions of iterations.
Not exactly scientific, but neither is testing "in the real world" on a modern computing system.
When comparing the results, consider the difference in execution speed as percentage, not absolute. Anything less than about 5% difference is pretty close to noise.

Do you really care about statistical significance or plain old significance? Ultimately you're likely to have to form a judgement about readability vs performance - and statistical significance isn't really going to help you there.
A couple of rules of thumb I use:
Where possible, test for enough time to make you confident that little blips (like something else interrupting your test for a short time) won't make much difference. Usually I reckon 30 seconds is enough for this, although it depends on your app. The longer you test for, the more reliable the test will be - but obviously your results will be delayed :)
Running a test multiple times can be useful, but if you're timing for long enough then it's not as important IMO. It would alleviate other forms of error which made a whole test take longer than it should. If a test result looks suspicious, certainly run it again. If you see significantly different results for different runs, run it several more times and try to spot a pattern.

The fundamental question you're trying to answer is how likley is it that what you observe could have happened by chance? Is this coin fair? Throw it once: HEADS. No it's not fair it always comes down heads. Bad conclusion! Throw it 10 times and get 7 Heads, now what do you conclude? 1000 times and 700 heads?
For simple cases we can imagine how to figure out when to stop testing. But you have a slightly different situation - are you really doing a statistical analysis?
How much control do you have of your tests? Does repeating them add any value? Your computer is deterministic (maybe). Eistein's definition of insanity is to repeat something and expect a different outcome. So when you run your tests do you get repeatable answers? I'm not sure that statistical analyses help if you are doing good enough tests.
For what you're doing I would say that the first key thing is to make sure that you really are measuring what you think. Run every test for long enough that any startup or shutdown effects are hidden. Useful performance tests tend to run for quite extended periods for that reason. Make sure that you are not actually measuing the time in your test harness rather than the time in your code.
You have two primary variables: how many iterations of your method to run in one test? How many tests to run?
Wikipedia says this
In addition to expressing the
variability of a population, standard
deviation is commonly used to measure
confidence in statistical conclusions.
For example, the margin of error in
polling data is determined by
calculating the expected standard
deviation in the results if the same
poll were to be conducted multiple
times. The reported margin of error is
typically about twice the standard
deviation.
Hence if your objective is to be sure that one function is faster than another you could run a number of tests of each, compute the means and standard deviations. My expectation is that if your number of iterations within any one test is high then the standard deviation is going to be low.
If we accept that defintion of margin of error, you can see whether the two means are further apart than their total margin's of error.

Related

Simulating a matrix of variables with predefined correlation structure

For a simulation study I am working on, we are trying to test an algorithm that aims to identify specific culprit factors that predict a binary outcome of interest from a large mixture of possible exposures that are mostly unrelated to the outcome. To test this algorithm, I am trying to simulate the following data:
A binary dependent variable
A set of, say, 1000 variables, most binary and some continuous, that are not associated with the outcome (that is, are completely independent from the binary dependent variable, but that can still be correlated with one another).
A group of 10 or so binary variables which will be associated with the dependent variable. I will a-priori determine the magnitude of the correlation with the binary dependent variable, as well as their frequency in the data.
Generating a random set of binary variables is easy. But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?
Thank you!

"But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?"
With statistical sampling you can't ensure anything, you can only adjust the acceptable risk. Finding an acceptable level of risk may be harder than many people think.
Spurious correlations are a very real phenomenon. Real independent observations will often contain correlations, and if you want to actually test your algorithm to see how it will perform in reality then your tests should produce such phenomena in a manner similar to the real world—you should be generating independent candidate factors and allowing spurious correlations to occur.
If you are performing ~1000 independent tests of candidate factors, and you're targeting a risk level of α = 0.05, you can expect 50 non-significant terms to leak through into your analysis. To avoid this, you need to adjust your testing threshold using something along the lines of a Bonferroni correction. Recall that statistical discriminating power is based on standard error, which is inversely proportional to the square root of the sample size. Bonferroni says that 1000 simultaneous tests need their individual test threshold to be adjusted by a factor of 1000, which in turn means the sample size needs to be a million times larger than when performing a single test for significance.
So in summary I'd say that you shouldn't attempt to ensure lack of correlation, it's going to occur in the real world. You can mitigate the risk of non-predictive factors being included due to spurious correlation by generating massive amounts of data. In practice there will be non-predictors that leak through unless you can obtain enough data, so I'd suggest that your testing should address the rates of occurrence as a function of number of candidate factors and the sample size.

Most efficient computational method to numerically minimize a 8 variables constrained system

I'm working for quite some time on finding a numerical instance for solution of a 8 variables system of 7 very complicated inequalities plus region specification. Unfortunately I cannot produce a MWE or nothing of the sort since the inputs are really long.
My current method is Mathematica's NMinimize routine, minimizing one of the 7 inequalities subject to every other condition as constraint -- The FindInstance command simply quits the kernel without being able to finish running.
The NMinimize is able to produce output, but besides being slower than would be optimal, produce results that do not obey every constraint.
The thing is that I need to be certain, for each benchmark I run, that if the output doesn't satisfy every constraint it is because such a set of real numbers doesn't exist -- which with my current method I can't be, by experience.
So: is there a foolproof, as efficient as possible, computational method for me to find a single instance of numerical solution to 7 complicated inequalities (involving trigonometric functions) of 8 variables or be sure that such a set doesn't exist?
It could be a Mathematica/python/fortran package, genetic algorithm or anything -- as long as there is clear enough documentation.

You need to give importance multiplier to constraints and the optimization method should not be greedy.
A genetic algorithm combined with multiple-starting points (or simulated annealing for diminishing mutations) tends to converge to global minima (hence not greedy) with more time given to it but there is not guarantee that the heuristic will complete X function in Y time. The more time given to it, the better it converges to global minima.
In genetic algorithm, you can add big constraint penalties like this:
fitness_minima = some_function_output_between_1_and_10 +
constraints_breached?1000.0f:0;
so that the DNAs with no contraint-violations will be favored for the crossover part of GA.
"As efficient as possible" depends on your algorithm. If you can parallelize the algorithm and run it on multiple GPUs, it should give substantial speedup over CPU. Compared to some hours of Mona-Lisa painting by CPU, a parallelized version running on 3 low-end GPUs complete within 10 minutes (https://www.youtube.com/watch?v=QRZqBLJ6brQ). At least some OpenCL/CUDA supporting libraries/frameworks (like Tensorflow) should be able to accelerate your algorithm if you don't want to do the work distribution yourself.

Is standard deviation (STDDEV) the right function for the job?

We wrote a monitoring system. This monitor is made of agents. Each agent runs on a different server, and monitors that specific server resources (RAM, CPU, SQL Server Status, Replication Status, Free Disk Space, Internet Access, specific bussiness metrics, etc.).
The agents report every measure they take to a central database where these "observations" are stored.
For example, every few seconds an agent would store in the central database a specific bussiness metric called "unprocessed_files" with its corresponding value:
(unprocessed_files, 41)
That value is constanty being written to our DB (among many others, as explained above).
We are now implementing a client application, a screen, that displays the status of every thing we monitor. So, how can we calculate what's a "normal" value and what's a wrong value?
For example, we know that if our servers are working correctly, the unprocessed_files would always be close to 0, but maybe (We don't know yet), 45 is an acceptable value.
So the question is, should we use the Standard Deviation in order to know what the acceptable range of values is?
ACCEPTABLE_RANGE = AVG(value) +- STDDEV(value) ?
We would like to notify with a red color when something is not going well.

For your backlog (unprocessed file) metric, using a standard deviation to know when to sound an alarm (turn something red) is going to drive you crazy with false alarms.
Why? most of the time your backlog will be zero. So, the standard deviation will also be very close to zero. Standard deviation tells you how much your metric varies. Therefore, whenever you get a nonzero backlog, it will be outside the avg + stdev range.
For a backlog, you may want to turn stuff yellow when the value is > 1 and red when the value is > 10.
If you have a "how long did it take" metric, standard deviation might be a valid way to identify alarm conditions. For example, you might have a web request that usually takes about half a second, but typically varies from 0.25 to 0.8 second. If they suddenly start taking 2.5 seconds, then you know something has gone wrong.
Standard deviation is a measurement that makes most sense for a normal distribution (bell curve distribution). When you handle your measurements as if they fit a bell curve, you're implicitly making the assumption that each measurement is entirely independent of the others. That assumption works poorly for typical metrics of a computing system (backlog, transaction time, load average, etc). So, using stdev is OK, but not great. You'll probably struggle to make sense of stdev numbers: that's because they don't actually make much sense.
You'd be better off, like #duffymo suggested, looking at the 95th percentile (the worst-performing operations). But MySQL doesn't compute those kinds of distributions natively. postgreSQL does. So does Oracle Standard Edition and higher.
How do you determine an out-of-bounds metric? It depends on the metric, and on what you're trying to do. If it's a backlog measurement, and it grows from minute to minute, you have a problem to investigate. If it's a transaction time, and it's far longer than average (avg + 3 x stdev, for example, you have a problem. The open source monitoring system Nagios has worked this out for various kinds of metrics.
Read a book by N. N. Taleb called "The Black Swan" if you want to know how assuming the real world fits normal distributions can crash the global economy.

Standard deviation is just a way of characterizing how much a set of values spreads away from its average (i.e. mean). In a sense, it's an "average deviation from average", though a little more complicated than that. It is true that values which differ from the mean by many times the standard deviation tend to be rare, but that doesn't mean the standard deviation is a good benchmark for identifying anomalous values that might indicate something is wrong.
For one thing, if you set your acceptable range at the average plus or minus one standard deviation, you're probably going to get very frequent results outside that range! You could use the average plus or minus two standard deviations, or three, or however many you want to reduce the number of notifications/error conditions as low as you want, but there's no telling whether any of this actually helps you identify error conditions.
I think your main problem is not statistics. Your problem is that you don't know what kinds of results actually indicate an error. So before you program in any acceptable range, just let the system run for a while and collect some calibration data showing what kinds of values you see when it's running normally, and what kinds of values you see when it's not running normally. Make sure you have some way to tell which are which. Once you have a good amount of data for both conditions, you can analyze it (start with a simple histogram) and see what kinds of values are characteristic of normal operation and what kinds are characteristics of error conditions. Then you can set your acceptable range based on that.
If you want to get fancy, there is a statistical technique called likelihood ratio testing that can help you evaluate just how likely it is that your system is working properly. But I think it's probably overkill. Monitoring systems don't need to be super-precise about this stuff; just show a cautionary notice whenever the readings start to seem abnormal.

Counting FLOPS/GFLOPS in program - CUDA

Already finished my application which multiplies CRS matrix and vector (SpMV) and the only thing to do now is to count FLOPS my application did. In my opinion it's really hard to estimate number of floating point operation in case of sparse matrix - vector multiplication, because the number of multiplies in one row is really "jumpy" or fluent.
I only tried to measure time using "cudaprof" ( available in ./CUDA/bin directory) - it works fine.
Any sugestions and instruction pastes appreciated !

That's not just your opinion; it's simple fact that the number of operations in the case of a sparse matrix is data-dependent, and so you can't get a reasonable answer without knowing something about the data. That makes it impossible to have a one-number-fits-all-data estimate.
This is probably one of the sorts of situations where you could think hard about it for many hours (and do lots of research) to make a maybe-accurate estimate, or you could spend a few minutes writing a variant of your existing implementation that increments a counter each time it does an operation. Sure, that's going to take quite a while to run (especially if you don't do it in a CUDA-enabled form), but probably a lot less time than it would take to do the thinking, and when the answer comes out, you don't have to do a lot of work to convince yourself that it's right.

Use of LOC to determine project size

How many lines of code (LOC) does it take to be considered a large project? How about for just one person writing it?
I know this metric is questionable, but there is a significant difference, for a single developer, between 1k and 10k LOC. I typically use space for readability, especially for SQL statements, and I try to reduce the amount of LOC for maintenance purpose to follow as many best practice as i can.
For example, I created a unified diff of the code I modified today, and it was over 1k LOC (including comments and blank lines). Is "modified LOC" a better metric? I have ~2k LOC, so it's surprising I modified 1k. I guess rewriting counts as both a deletion and addition which doubles the stats.

A slightly less useless metric - time of compilation.
If your project takes more than... say, 30 minutes to compile, it's large :)

Using Steve Yegge as the benchmark at the upper range of the scale, let's say that 500k lines of code is (over?) the maximum a single developer can maintain.
More seriously though; I think once you hit 100k LOC you are probably going to want to start looking for re-factorings before extensions to the code.
Note however that one way around this limit is obviously to compartmentalise the code more. If the sum-total of all code consists of two or three large libraries and an application, then combined this may well be more than you could maintain as a single code-base, but as long as each library is nicely self-contained you aren't going to exceed the capacity to understand each part of the solution.

Maybe another measurement for this would be the COCOMO measure - even though it is probably as useless as LOC.
A single developer could only do organic projects - "small" teams with "good" experience working with "less than rigid" requirements.
In this case, efford applied in man months are calculated as
2.4 * (kLOC)^1.05
This said, 1kLOC would need 2.52 man month. You can use several factors to refine that, based on product, hardware, personel, and project attributes.
But all we have done now is projected LOC to a time measurement. Here you again have to decide whether a 2-month or 20-month project is considered large.
But as you said, LOC probably is not the right measure to use. Keywords: software metrics, function points, evidence based scheduling, the planing game.

In my opinion it also depends on the design of your code - i've worked on projects in the 1-10K loc range, that was so poorly designed, that it felt like a really large project.
But is LOC really an interesting meassure for code? ;-)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008