Testing large input range scenarios with JUnit - junit

I am pondering on how it is best to develop a JUnit test for a function that calculates a number of points and values in time based on a number of inputs. The purpose of the method is to calculate a series of points in time given a series of gradient value pairs, i.e.
Gradient 1 to Value 1, Gradient 2 to Value 2, Gradient 3 to Value 3, and so on...
Given a starting point in time and starting value, the function calculates the points in time each Value is reached (in the gradient value pairs) up until a target value is reached. This is essentially to plot a line on a graph with x-axis having date values and the y-axing having numeric values.
The method to test takes the following inputs:
StartTime (Date)
StartValue (Double)
TargetValue (Double)
GradientValuePairs (ArrayList)
EnsurePointEvery5Minutes (Boolean)
Where GradientValuePair is like:
class GradientValuePair {
Double gradient; // Gradient up to Target
Double target;
...
}
The output from this method is essentially ArrayList - a profile - with:
class DatePoint {
Date date;
Double value;
...
}
The EnsurePointEvery5Minuntes parameter basically adds a date point every 5 minutes for the calcualted profile which is then returned by the method.
To ensure the test has worked I will need to check each date and value is to what is expected by either:
Iterating through the array with an array of what is expected.
Store minute/second offsets from the StartTime with the expected value in some sort of structure.
Now the difficult part for me is deciding on how to write the TestCase. I want to test a broad/diverse range of inputs so that:
StartTime will cover 30 minutes i.e. in range of 2012-03-08 00:00 to 2012-03-08 00:30.
StartValue will be in the range of 0 to 1000.
TargetValue will be in the range of StartValue to 1000.
GradientValuePairs will require around 10 different arrays to be tested.
EnsurePointEvery5Minutes will be tested with both true and false.
Now given the number of different input sets will be something like:
30 * (0 to 1000 * 0 to 1000 = 500500) * 10 * 2 = 300,300,000 different test input sets per GradientValuePairs input
Or call us crazy for wanting to do this. Maybe the tests are too diverse for this instance.
I am wondering if anybody has any advice for testing such scenarios like this. I can't think of any other way to do this than implement my own algorithm for calculating the output before each call to the method I am testing - then who is to say that the algorithm I implement to test it is correct.

If I understand correctly. you are proposing that you test every possible set of combination of numeric inputs. That is almost never required of unit tests, as it would be essentially equivalent to testing whether the Java math library works for all numbers for all operations. Generally what you will do is try to identify edge conditions and write tests for those. These would include things like 0's. negatives, numeric overflow, and combinations of inputs which have intermediate computations that result in the same things. Then of course, you would want to test a handful of normal vanilla cases of data as well that are not edge cases.
So short answer: no you should not need to test 300M+ input sets.

Related

Function that will not return 0

I am writing a formula which to use as a decay multiplier on a given value.
The problem is the following : I have a window of processing - days lets say 10, this window is computed every day anew. I need to decay a certain parameter with a factor reflecting the days that an id is present. Currently what I do is (previousWinSize-(start of the current window-start of the previous window))/previousWinSize
In this case if my previous window size is 10 and the difference in the days of processing is two (10-2)/10 which gives me 0.8 to multiply my variable by and respectively decay .2 of it.
However if I have a 3 day window and again 2 days of difference (3-2)/3 I get value close to 0 which cuts more than I would like to.
I am looking for a formula that would scale better when the numbers are small and would not produce a huge decay factor.
Thank you in advance.
I recommend making use of a sigmoid function e.g.
You can take the output of your function i.e. returns a number between 0 and 1 based on the difference of days of processing and feed it into the sigmoid. If you set up the a (slope) and b (inflection point) parameters properly you can for example, ensure that the lowest decay multiplier you get is ~0.5 when your original equation returns a number close to 0.
I've graphed the example I stated above here:
https://www.desmos.com/calculator/nqemuexjhg
(This is based on: https://www.desmos.com/calculator/rna4aqta0c)
I think you do have two edge cases with this method though. When your equation returns 0 the sigmoid isn't exactly going to give you 0.5 (which you may not even want to begin with), you'll end up getting something that's close to 0.5. In this scenario what you may start to see is your values drifting if you keep applying the sigmoid. The same is true for when your equation returns 1. After putting it through the sigmoid you won't get 1, you'll get something close to 1.
What I think I'd do in such a scenario is have some sort of check before the sigmoid gets applied
e.g.
if(x == 0)
y = 0;
else if(x == 1)
y = 1;
else
y = sigmoid(x);
Sources / Possible further reading:
https://en.wikipedia.org/wiki/Sigmoid_function

understanding getByteTimeDomainData and getByteFrequencyData in web audio

The documentation for both of these methods are both very generic wherever I look. I would like to know what exactly I'm looking at with the returned arrays I'm getting from each method.
For getByteTimeDomainData, what time period is covered with each pass? I believe most oscopes cover a 32 millisecond span for each pass. Is that what is covered here as well? For the actual element values themselves, the range seems to be 0 - 255. Is this equivalent to -1 - +1 volts?
For getByteFrequencyData the frequencies covered is based on the sampling rate, so each index is an actual frequency, but what about the actual element values themselves? Is there a dB range that is equivalent to the values returned in the returned array?
getByteTimeDomainData (and the newer getFloatTimeDomainData) return an array of the size you requested - its frequencyBinCount, which is calculated as half of the requested fftSize. That array is, of course, at the current sampleRate exposed on the AudioContext, so if it's the default 2048 fftSize, frequencyBinCount will be 1024, and if your device is running at 44.1kHz, that will equate to around 23ms of data.
The byte values do range between 0-255, and yes, that maps to -1 to +1, so 128 is zero. (It's not volts, but full-range unitless values.)
If you use getFloatFrequencyData, the values returned are in dB; if you use the Byte version, the values are mapped based on minDecibels/maxDecibels (see the minDecibels/maxDecibels description).
Mozilla 's documentation describes the difference between getFloatTimeDomainData and getFloatFrequencyData, which I summarize below. Mozilla docs reference the Web Audio
experiment ; the voice-change-o-matic. The voice-change-o-matic illustrates the conceptual difference to me (it only works in my Firefox browser; it does not work in my Chrome browser).
TimeDomain/getFloatTimeDomainData
TimeDomain functions are over some span of time.
We often visualize TimeDomain data using oscilloscopes.
In other words:
we visualize TimeDomain data with a line chart,
where the x-axis (aka the "original domain") is time
and the y axis is a measure of a signal (aka the "amplitude").
Change the voice-change-o-matic "visualizer setting" to Sinewave to
see getFloatTimeDomainData(...)
Frequency/getFloatFrequencyData
Frequency functions (GetByteFrequencyData) are at a point in time; i.e. right now; "the current frequency data"
We sometimes see these in mp3 players/ "winamp bargraph style" music players (aka "equalizer" visualizations).
In other words:
we visualize Frequency data with a bar graph
where the x-axis (aka "domain") are frequencies or frequency bands
and the y-axis is the strength of each frequency band
Change the voice-change-o-matic "visualizer setting" to Frequency bars to see getFloatFrequencyData(...)
Fourier Transform (aka Fast Fourier Transform/FFT)
Another way to think about "time domain vs frequency" is shown the diagram below, from Fast Fourier Transform wikipedia
getFloatTimeDomainData gives you the chart on on the top (x-axis is Time)
getFloatFrequencyData gives you the chart on the bottom (x-axis is Frequency)
a Fast Fourier Transform (FFT) converts the Time Domain data into Frequency data, in other words, FFT converts the first chart to the second chart.
cwilso has it backwards.
the time data array is the longer one (fftSize), and the frequency data array is the shorter one (half that, frequencyBinCount).
fftSize of 2048 at the usual sample rate of 44.1kHz means each sample has 1/44100 duration, you have 2048 samples at hand, and thus are covering a duration of 2048/44100 seconds, which 46 milliseconds, not 23 milliseconds. The frequencyBinCount is indeed 1024, but that refers to the frequency domain (as the name suggests), not the time domain, and it the computation 1024/44100, in this context, is about as meaningful as adding your birth date to the fftSize.
A little math illustrating what's happening: Fourier transform is a 'vector space isomorphism', that is, a mapping going bijectively (i.e., reversible) between 2 vector spaces of the same dimension; the 'time domain' and the 'frequency domain.' The vector space dimension we have here (in both cases) is fftSize.
So where does the 'half' come from? The frequency domain coefficients 'count double'. Either because they 'actually are' complex numbers, or because you have the 'sin' and the 'cos' flavor. Or, because you have a 'magnitude' and a 'phase', which you'll understand if you know how complex numbers work. (Those are 3 ways to say the same in a different jargon, so to speak.)
I don't know why the API only gives us half of the relevant numbers when it comes to frequency - I can only guess. And my guess is that those are the 'magnitude' numbers, and the 'phase' numbers are thrown out. The reason that this is my guess is that in applications, magnitude is far more important than phase. Still, I'm quite surprised that the API throws out information, and I'd be glad if some expert who actually knows (and isn't guessing) can confirm that it's indeed the magnitude. Or - even better (I love to learn) - correct me.

Temperature Scale in SA

First, this is not a question about temperature iteration counts or automatically optimized scheduling. It's how the data magnitude relates to the scaling of the exponentiation.
I'm using the classic formula:
if(delta < 0 || exp(-delta/tK) > random()) { // new state }
The input to the exp function is negative because delta/tK is positive, so the exp result is always less then 1. The random function also returns a value in the 0 to 1 range.
My test data is in the range 1 to 20, and the delta values are below 20. I pick a start temperature equal to the initial computed temperature of the system and linearly ramp down to 1.
In order to get SA to work, I have to scale tK. The working version uses:
exp(-delta/(tK * .001)) > random()
So how does the magnitude of tK relate to the magnitude of delta? I found the scaling factor by trial and error, and I don't understand why it's needed. To my understanding, as long as delta > tK and the step size and number of iterations are reasonable, it should work. In my test case, if I leave out the extra scale the temperature of the system does not decrease.
The various online sources I've looked at say nothing about working with real data. Sometimes they include the Boltzmann constant as a scale, but since I'm not simulating a physical particle system that doesn't help. Examples (typically with pseudocode) use values like 100 or 1000000.
So what am I missing? Is scaling another value that I must set by trial and error? It's bugging me because I don't just want to get this test case running, I want to understand the algorithm, and magic constants mean I don't know what's going on.
Classical SA has 2 parameters: startingTemperate and cooldownSchedule (= what you call scaling).
Configuring 2+ parameters is annoying, so in OptaPlanner's implementation, I automatically calculate the cooldownSchedule based on the timeGradiant (which is a double going from 0.0 to 1.0 during the solver time). This works well. As a guideline for the startingTemperature, I use the maximum score diff of a single move. For more information, see the docs.

Function to dampen a value

I have a list of documents each having a relevance score for a search query. I need older documents to have their relevance score dampened, to try to introduce their date in the ranking process. I already tried fiddling with functions such as 1/(1+date_difference), but the reciprocal function is too discriminating for close recent dates.
I was thinking maybe a mathematical function with range (0..1) and domain(0..x) to amplify their score, where the x-axis is the age of a document. It's best to explain what I further need from the function by an image:
Decaying behavior is often modeled well by an exponentional function (many decaying processes in nature also follow it). You would use 2 positive parameters A and B and get
y(x) = A exp(-B x)
Since you want a y-range [0,1] set A=1. Larger B give slower decays.
If a simple 1/(1+x) decreases too quickly too soon, a sigmoid function like 1/(1+e^-x) or the error function might be better suited to your purpose. Let the current date be somewhere in the negative numbers for such a function, and you can get a value that is current for some configurable time and then decreases towards a base value.
log((x+1)-age_of_document)
Where the base of the logarithm is (x+1). Note the x is as per your diagram and is the "threshold". If the age of the document is greater than x the score goes negative. Multiply by the maximum possible score to introduce scaling.
E.g. Domain = (0,10) with a maximum score of 10: 10*(log(11-x))/log(11)
A bit late, but as thiton says, you might want to use a sigmoid function instead, since it has a "floor" value for your long tail data points. E.g.:
0.8/(1+5^(x-3)) + 0.2 - You can adjust the constants 5 and 3 to control the slope of the curve. The 0.2 is where the floor will be.

Generate random variable in real-time without state

I want a function which takes, as input, the number of seconds elapsed since the last time it was called, and returns true or false for whether an event should have happened in that time period. I want it such that it will fire, on average, once per X time passed, say 5 seconds. I also am interested if it's possible to do without any state, which the answer from this question used.
I guess to be fully accurate it would have to return an integer for the number of events that should've happened, in the case of it being called once every 10*X times or something like that, so bonus points for that!
It sounds like you're describing a Poisson process, with the mean number of events in a given time interval is given by the Poisson distribution with parameter lambda=1/X.
The way to use the expression on the latter page is as follows, for a given value of lambda, and the parameter value of t:
Calculate a random number between zero and one; call this p
Calculate Pr(k=0) (ie, exp(-lambda*t) * (lambda*t)**0 / factorial(0))
If this number is bigger than p, then the number of simulated events is 0. END
Otherwise, calculate Pr(k=1) and add it to Pr(k=0).
If this number is bigger than p, then the answer is 1. END
...and so on.
Note that, yes, this can end up with more than one event in a time period, if t is large compared with 1/lambda (ie X). If t is always going to be small compared to 1/lambda, then you are very unlikely to get more than one event in the period, and so the algorithm is simplified considerably (if p < exp(-lambda*t), then 0, else 1).
Note 2: there is no guarantee that you will get at least one event per interval X. It's just that it'll average out to that.
(the above is rather off the top of my head; test your implementation carefully)
Asssume some event type happens on average once per 10 seconds, and you want to print a simulated list of timestamps on which the events happened.
A good method would be to generate a random integer in the range [0,9] each 1 second. If it is 0 - fire the event for this second. Of course you can control the resolution: You can generate a random integer in the range [0,99] each 0.1 second, and if it comes 0 - fire the event for this DeciSecond.
Assuming there is no dependency between events, there is no need to keep state.
To find out how many times the event happens in a given timeslice - just generate enough random integers - according to the required resolution.
Edit
You should use high resolution (at least 20 randoms per period of one event) for the simulation to be valid.