Related
I'm doing some performance/load testing of a service. Imagine the test function like this:
bytesPerSecond = test(filesize: 10MB, concurrency: 5)
Using this, I'll populate a table of results for different sizes and levels of concurrency. There are other variables too, but you get the idea.
The test function spins up concurrency requests and tracks throughput. This rate starts off at zero, then spikes and dips until it eventually stabilises on the 'true' value.
However it can take a while for this stability to occur, and there are lot of combinations of input to evaluate.
How can the test function decide when it's performed enough samples? By enough, I suppose I mean that the result isn't going to change beyond some margin if testing continues.
I remember reading an article about this a while ago (from one of the jsperf authors) that discussed a robust method, but I cannot find the article any more.
One simple method would be to compute the standard deviation over a sliding window of values. Is there a better approach?
IIUC, you're describing the classic problem of estimating the confidence interval of the mean with unknown variance. That is, suppose you have n results, x1, ..., xn, where each of the xi is a sample from some process of which you don't know much: not the mean, not the variance, and not the distribution's shape. For some required confidence interval, you'd like to now whether n is large enough so that, with high probability the true mean is within the interval of your mean.
(Note that with relatively-weak conditions, the Central Limit Theorem guarantees that the sample mean will converge to a normal distribution, but to apply it directly you would need the variance.)
So, in this case, the classic solution to determine if n is large enough, is as follows:
Start by calculating the sample mean μ = ∑i [xi] / n. Also calculate the normalized sample variance s2 = ∑i [(xi - μ)2] / (n - 1)
Depending on the size of n:
If n > 30, the confidence interval is approximated as μ ± zα / 2(s / √(n)), where, if necessary, you can find here an explanation on the z and α.
If n < 30, the confidence interval is approximated as μ ± tα / 2(s / √(n)); see again here an explanation of the t value, as well as a table.
If the confidence is enough, stop. Otherwise, increase n.
Stability means rate of change (derivative) is zero or close to zero.
The test function spins up concurrency requests and tracks throughput.
This rate starts off at zero, then spikes and dips until it eventually
stabilises on the 'true' value.
I would track your past throughput values. For example last X values or so. According to this values, I would calculate rate of change (derivative of your throughput). If your derivative is close to zero, then your test is stable. I will stop test.
How to find X? I think instead of constant value, such as 10, choosing a value according to maximum number of test can be more suitable, for example:
X = max(10,max_test_count * 0.01)
The documentation for both of these methods are both very generic wherever I look. I would like to know what exactly I'm looking at with the returned arrays I'm getting from each method.
For getByteTimeDomainData, what time period is covered with each pass? I believe most oscopes cover a 32 millisecond span for each pass. Is that what is covered here as well? For the actual element values themselves, the range seems to be 0 - 255. Is this equivalent to -1 - +1 volts?
For getByteFrequencyData the frequencies covered is based on the sampling rate, so each index is an actual frequency, but what about the actual element values themselves? Is there a dB range that is equivalent to the values returned in the returned array?
getByteTimeDomainData (and the newer getFloatTimeDomainData) return an array of the size you requested - its frequencyBinCount, which is calculated as half of the requested fftSize. That array is, of course, at the current sampleRate exposed on the AudioContext, so if it's the default 2048 fftSize, frequencyBinCount will be 1024, and if your device is running at 44.1kHz, that will equate to around 23ms of data.
The byte values do range between 0-255, and yes, that maps to -1 to +1, so 128 is zero. (It's not volts, but full-range unitless values.)
If you use getFloatFrequencyData, the values returned are in dB; if you use the Byte version, the values are mapped based on minDecibels/maxDecibels (see the minDecibels/maxDecibels description).
Mozilla 's documentation describes the difference between getFloatTimeDomainData and getFloatFrequencyData, which I summarize below. Mozilla docs reference the Web Audio
experiment ; the voice-change-o-matic. The voice-change-o-matic illustrates the conceptual difference to me (it only works in my Firefox browser; it does not work in my Chrome browser).
TimeDomain/getFloatTimeDomainData
TimeDomain functions are over some span of time.
We often visualize TimeDomain data using oscilloscopes.
In other words:
we visualize TimeDomain data with a line chart,
where the x-axis (aka the "original domain") is time
and the y axis is a measure of a signal (aka the "amplitude").
Change the voice-change-o-matic "visualizer setting" to Sinewave to
see getFloatTimeDomainData(...)
Frequency/getFloatFrequencyData
Frequency functions (GetByteFrequencyData) are at a point in time; i.e. right now; "the current frequency data"
We sometimes see these in mp3 players/ "winamp bargraph style" music players (aka "equalizer" visualizations).
In other words:
we visualize Frequency data with a bar graph
where the x-axis (aka "domain") are frequencies or frequency bands
and the y-axis is the strength of each frequency band
Change the voice-change-o-matic "visualizer setting" to Frequency bars to see getFloatFrequencyData(...)
Fourier Transform (aka Fast Fourier Transform/FFT)
Another way to think about "time domain vs frequency" is shown the diagram below, from Fast Fourier Transform wikipedia
getFloatTimeDomainData gives you the chart on on the top (x-axis is Time)
getFloatFrequencyData gives you the chart on the bottom (x-axis is Frequency)
a Fast Fourier Transform (FFT) converts the Time Domain data into Frequency data, in other words, FFT converts the first chart to the second chart.
cwilso has it backwards.
the time data array is the longer one (fftSize), and the frequency data array is the shorter one (half that, frequencyBinCount).
fftSize of 2048 at the usual sample rate of 44.1kHz means each sample has 1/44100 duration, you have 2048 samples at hand, and thus are covering a duration of 2048/44100 seconds, which 46 milliseconds, not 23 milliseconds. The frequencyBinCount is indeed 1024, but that refers to the frequency domain (as the name suggests), not the time domain, and it the computation 1024/44100, in this context, is about as meaningful as adding your birth date to the fftSize.
A little math illustrating what's happening: Fourier transform is a 'vector space isomorphism', that is, a mapping going bijectively (i.e., reversible) between 2 vector spaces of the same dimension; the 'time domain' and the 'frequency domain.' The vector space dimension we have here (in both cases) is fftSize.
So where does the 'half' come from? The frequency domain coefficients 'count double'. Either because they 'actually are' complex numbers, or because you have the 'sin' and the 'cos' flavor. Or, because you have a 'magnitude' and a 'phase', which you'll understand if you know how complex numbers work. (Those are 3 ways to say the same in a different jargon, so to speak.)
I don't know why the API only gives us half of the relevant numbers when it comes to frequency - I can only guess. And my guess is that those are the 'magnitude' numbers, and the 'phase' numbers are thrown out. The reason that this is my guess is that in applications, magnitude is far more important than phase. Still, I'm quite surprised that the API throws out information, and I'd be glad if some expert who actually knows (and isn't guessing) can confirm that it's indeed the magnitude. Or - even better (I love to learn) - correct me.
I have two codes that theoretically should return the exact same output. However, this does not happen. The issue is that the two codes handle very small numbers (doubles) to the order of 1e-100 or so. I suspect that there could be some numerical issues which are related to that, and lead to the two outputs being different even though they should be theoretically the same.
Does it indeed make sense that handling numbers on the order of 1e-100 cause such problems? I don't mind the difference in output, if I could safely assume that the source is numerical issues. Does anyone have a good source/reference that talks about issues that come up with stability of algorithms when they handle numbers in such order?
Thanks.
Does anyone have a good source/reference that talks about issues that come up with stability of algorithms when they handle numbers in such order?
The first reference that comes to mind is What Every Computer Scientist Should Know About Floating-Point Arithmetic. It covers floating-point maths in general.
As far as numerical stability is concerned, the best references probably depend on the numerical algorithm in question. Two wide-ranging works that come to mind are:
Numerical Recipes by Press et al;
Matrix Computations by Golub and Van Loan.
It is not necessarily the small numbers that are causing the problem.
How do you check whether the outputs are the "exact same"?
I would check equality with tolerance. You may consider the floating point numbers x and y equal if either fabs(x-y) < 1.0e-6 or fabs(x-y) < fabs(x)*1.0e-6 holds.
Usually, there is a HUGE difference between the two algorithms if there are numerical issues. Often, a small change in the input may result in extreme changes in the output, if the algorithm suffers from numerical issues.
What makes you think that there are "numerical issues"?
If possible, change your algorithm to use Kahan Summation (aka compensated summation). From Wikipedia:
function KahanSum(input)
var sum = 0.0
var c = 0.0 //A running compensation for lost low-order bits.
for i = 1 to input.length do
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
//Next time around, the lost low part will be added to y in a fresh attempt.
return sum
This works by keeping a second running total of the cumulative error, similar to the Bresenham line drawing algorithm. The end result is that you get precision that is nearly double the data type's advertised precision.
Another technique I use is to sort my numbers from small to large (by manitude, ignoring sign) and add or subtract the small numbers first, then the larger ones. This has the virtue that if you add and subtract the same value multiple times, such numbers may cancel exactly and can be removed from the list.
I have a list of documents each having a relevance score for a search query. I need older documents to have their relevance score dampened, to try to introduce their date in the ranking process. I already tried fiddling with functions such as 1/(1+date_difference), but the reciprocal function is too discriminating for close recent dates.
I was thinking maybe a mathematical function with range (0..1) and domain(0..x) to amplify their score, where the x-axis is the age of a document. It's best to explain what I further need from the function by an image:
Decaying behavior is often modeled well by an exponentional function (many decaying processes in nature also follow it). You would use 2 positive parameters A and B and get
y(x) = A exp(-B x)
Since you want a y-range [0,1] set A=1. Larger B give slower decays.
If a simple 1/(1+x) decreases too quickly too soon, a sigmoid function like 1/(1+e^-x) or the error function might be better suited to your purpose. Let the current date be somewhere in the negative numbers for such a function, and you can get a value that is current for some configurable time and then decreases towards a base value.
log((x+1)-age_of_document)
Where the base of the logarithm is (x+1). Note the x is as per your diagram and is the "threshold". If the age of the document is greater than x the score goes negative. Multiply by the maximum possible score to introduce scaling.
E.g. Domain = (0,10) with a maximum score of 10: 10*(log(11-x))/log(11)
A bit late, but as thiton says, you might want to use a sigmoid function instead, since it has a "floor" value for your long tail data points. E.g.:
0.8/(1+5^(x-3)) + 0.2 - You can adjust the constants 5 and 3 to control the slope of the curve. The 0.2 is where the floor will be.
What is the best way to constrain the values of a PRNG to a smaller range? If you use modulus and the old max number is not evenly divisible by the new max number you bias toward the 0 through (old_max - new_max - 1). I assume the best way would be something like this (this is floating point, not integer math)
random_num = PRNG() / max_orginal_range * max_smaller_range
But something in my gut makes me question that method (maybe floating point implementation and representation differences?).
The random number generator will produce consistent results across hardware and software platforms, and the constraint needs to as well.
I was right to doubt the pseudocode above (but not for the reasons I was thinking). MichaelGG's answer got me thinking about the problem in a different way. I can model it using smaller numbers and test every outcome. So, let's assume we have a PRNG that produces a random number between 0 and 31 and you want the smaller range to be 0 to 9. If you use modulus you bias toward 0, 1, 2, and 3. If you use the pseudocode above you bias toward 0, 2, 5, and 7. I don't think there can be a good way to map one set into the other. The best that I have come up with so far is to regenerate the random numbers that are greater than old_max/new_max, but that has deep problems as well (reducing the period, time to generate new numbers until one is in the right range, etc.).
I think I may have naively approached this problem. It may be time to start some serious research into the literature (someone has to have tackled this before).
I know this might not be a particularly helpful answer, but I think the best way would be to conceive of a few different methods, then trying them out a few million times, and check the result sets.
When in doubt, try it yourself.
EDIT
It should be noted that many languages (like C#) have built in limiting in their functions
int maximumvalue = 20;
Random rand = new Random();
rand.Next(maximumvalue);
And whenever possible, you should use those rather than any code you would write yourself. Don't Reinvent The Wheel.
This problem is akin to rolling a k-sided die given only a p-sided die, without wasting randomness.
In this sense, by Lemma 3 in "Simulating a dice with a dice" by B. Kloeckner, this waste is inevitable unless "every prime number dividing k also divides p". Thus, for example, if p is a power of 2 (and any block of random bits is the same as rolling a die with a power of 2 number of faces) and k has prime factors other than 2, the best you can do is get arbitrarily close to no waste of randomness, such as by batching multiple rolls of the p-sided die until p^n is "close enough" to a power of k.
Let me also go over some of your concerns about regenerating random numbers:
"Reducing the period": Besides batching of bits, this concern can be dealt with in several ways:
Use a PRNG with a bigger "period" (maximum cycle length).
Add a Bays–Durham shuffle to the PRNG's implementation.
Use a "true" random number generator; this is not trivial.
Employ randomness extraction, which is discussed in Devroye and Gravel 2015-2020 and in my Note on Randomness Extraction. However, randomness extraction is pretty involved.
Ignore the problem, especially if it isn't a security application or serious simulation.
"Time to generate new numbers until one is in the right range": If you want unbiased random numbers, then any algorithm that does so will generally have to run forever in the worst case. Again, by Lemma 3, the algorithm will run forever in the worst case unless "every prime number dividing k also divides p", which is not the case if, say, k is 10 and p is 32.
See also the question: How to generate a random integer in the range [0,n] from a stream of random bits without wasting bits?, especially my answer there.
If PRNG() is generating uniformly distributed random numbers then the above looks good. In fact (if you want to scale the mean etc.) the above should be fine for all purposes. I guess you need to ask what the error associated with the original PRNG() is, and whether further manipulating will add to that substantially.
If in doubt, generate an appropriately sized sample set, and look at the results in Excel or similar (to check your mean / std.dev etc. for what you'd expect)
If you have access to a PRNG function (say, random()) that'll generate numbers in the range 0 <= x < 1, can you not just do:
random_num = (int) (random() * max_range);
to give you numbers in the range 0 to max_range?
Here's how the CLR's Random class works when limited (as per Reflector):
long num = maxValue - minValue;
if (num <= 0x7fffffffL) {
return (((int) (this.Sample() * num)) + minValue);
}
return (((int) ((long) (this.GetSampleForLargeRange() * num))) + minValue);
Even if you're given a positive int, it's not hard to get it to a double. Just multiply the random int by (1/maxint). Going from a 32-bit int to a double should provide adequate precision. (I haven't actually tested a PRNG like this, so I might be missing something with floats.)
Psuedo random number generators are essentially producing a random series of 1s and 0s, which when appended to each other, are an infinitely large number in base two. each time you consume a bit from you're prng, you are dividing that number by two and keeping the modulus. You can do this forever without wasting a single bit.
If you need a number in the range [0, N), then you need the same, but instead of base two, you need base N. It's basically trivial to convert the bases. Consume the number of bits you need, return the remainder of those bits back to your prng to be used next time a number is needed.