I have a coordinate time series for 2 hours and 1 second interval (observation=7200). How can I compute the hourly amplitude from this data?
If your data does not contain noise (You did not say nothing about it), then you can calculate amplitude very easy by using definition of amplitude:
A = (max(x) - min(x))/2, where x - your data
P.S.
If you are not satisfied with the answer, then you need to clarify your question. In particular - what signal and what you mean by amplitude and how you going to use this number
Related
I'm doing some performance/load testing of a service. Imagine the test function like this:
bytesPerSecond = test(filesize: 10MB, concurrency: 5)
Using this, I'll populate a table of results for different sizes and levels of concurrency. There are other variables too, but you get the idea.
The test function spins up concurrency requests and tracks throughput. This rate starts off at zero, then spikes and dips until it eventually stabilises on the 'true' value.
However it can take a while for this stability to occur, and there are lot of combinations of input to evaluate.
How can the test function decide when it's performed enough samples? By enough, I suppose I mean that the result isn't going to change beyond some margin if testing continues.
I remember reading an article about this a while ago (from one of the jsperf authors) that discussed a robust method, but I cannot find the article any more.
One simple method would be to compute the standard deviation over a sliding window of values. Is there a better approach?
IIUC, you're describing the classic problem of estimating the confidence interval of the mean with unknown variance. That is, suppose you have n results, x1, ..., xn, where each of the xi is a sample from some process of which you don't know much: not the mean, not the variance, and not the distribution's shape. For some required confidence interval, you'd like to now whether n is large enough so that, with high probability the true mean is within the interval of your mean.
(Note that with relatively-weak conditions, the Central Limit Theorem guarantees that the sample mean will converge to a normal distribution, but to apply it directly you would need the variance.)
So, in this case, the classic solution to determine if n is large enough, is as follows:
Start by calculating the sample mean μ = ∑i [xi] / n. Also calculate the normalized sample variance s2 = ∑i [(xi - μ)2] / (n - 1)
Depending on the size of n:
If n > 30, the confidence interval is approximated as μ ± zα / 2(s / √(n)), where, if necessary, you can find here an explanation on the z and α.
If n < 30, the confidence interval is approximated as μ ± tα / 2(s / √(n)); see again here an explanation of the t value, as well as a table.
If the confidence is enough, stop. Otherwise, increase n.
Stability means rate of change (derivative) is zero or close to zero.
The test function spins up concurrency requests and tracks throughput.
This rate starts off at zero, then spikes and dips until it eventually
stabilises on the 'true' value.
I would track your past throughput values. For example last X values or so. According to this values, I would calculate rate of change (derivative of your throughput). If your derivative is close to zero, then your test is stable. I will stop test.
How to find X? I think instead of constant value, such as 10, choosing a value according to maximum number of test can be more suitable, for example:
X = max(10,max_test_count * 0.01)
First, this is not a question about temperature iteration counts or automatically optimized scheduling. It's how the data magnitude relates to the scaling of the exponentiation.
I'm using the classic formula:
if(delta < 0 || exp(-delta/tK) > random()) { // new state }
The input to the exp function is negative because delta/tK is positive, so the exp result is always less then 1. The random function also returns a value in the 0 to 1 range.
My test data is in the range 1 to 20, and the delta values are below 20. I pick a start temperature equal to the initial computed temperature of the system and linearly ramp down to 1.
In order to get SA to work, I have to scale tK. The working version uses:
exp(-delta/(tK * .001)) > random()
So how does the magnitude of tK relate to the magnitude of delta? I found the scaling factor by trial and error, and I don't understand why it's needed. To my understanding, as long as delta > tK and the step size and number of iterations are reasonable, it should work. In my test case, if I leave out the extra scale the temperature of the system does not decrease.
The various online sources I've looked at say nothing about working with real data. Sometimes they include the Boltzmann constant as a scale, but since I'm not simulating a physical particle system that doesn't help. Examples (typically with pseudocode) use values like 100 or 1000000.
So what am I missing? Is scaling another value that I must set by trial and error? It's bugging me because I don't just want to get this test case running, I want to understand the algorithm, and magic constants mean I don't know what's going on.
Classical SA has 2 parameters: startingTemperate and cooldownSchedule (= what you call scaling).
Configuring 2+ parameters is annoying, so in OptaPlanner's implementation, I automatically calculate the cooldownSchedule based on the timeGradiant (which is a double going from 0.0 to 1.0 during the solver time). This works well. As a guideline for the startingTemperature, I use the maximum score diff of a single move. For more information, see the docs.
I am trying to fully understand the sampling frequency concept. I got a useful answer from PaxRomana99 (I need help setting up the correct frequency vector) and it seems like the way to get the correct frequency vector is when you set up the time interval to be related to fs:
t = 0:1/fs:10; % 10 seconds of data sampled at fs
y = sin(2*pi*100.*t);
However, in my case, I have no control on how my time interval is set up, I am given 2 arrays, one is the time interval and one is the sinusoidal wave resulting from that time interval. So should I got through the time array and extract fs? which should be the spacing between the time steps? am I in the wrong path here?
Thanks for any hints and any help!
Yes, as you suggest, you get the sampling frequency from the time array. It's just 1/(time sample interval).
I have a list of documents each having a relevance score for a search query. I need older documents to have their relevance score dampened, to try to introduce their date in the ranking process. I already tried fiddling with functions such as 1/(1+date_difference), but the reciprocal function is too discriminating for close recent dates.
I was thinking maybe a mathematical function with range (0..1) and domain(0..x) to amplify their score, where the x-axis is the age of a document. It's best to explain what I further need from the function by an image:
Decaying behavior is often modeled well by an exponentional function (many decaying processes in nature also follow it). You would use 2 positive parameters A and B and get
y(x) = A exp(-B x)
Since you want a y-range [0,1] set A=1. Larger B give slower decays.
If a simple 1/(1+x) decreases too quickly too soon, a sigmoid function like 1/(1+e^-x) or the error function might be better suited to your purpose. Let the current date be somewhere in the negative numbers for such a function, and you can get a value that is current for some configurable time and then decreases towards a base value.
log((x+1)-age_of_document)
Where the base of the logarithm is (x+1). Note the x is as per your diagram and is the "threshold". If the age of the document is greater than x the score goes negative. Multiply by the maximum possible score to introduce scaling.
E.g. Domain = (0,10) with a maximum score of 10: 10*(log(11-x))/log(11)
A bit late, but as thiton says, you might want to use a sigmoid function instead, since it has a "floor" value for your long tail data points. E.g.:
0.8/(1+5^(x-3)) + 0.2 - You can adjust the constants 5 and 3 to control the slope of the curve. The 0.2 is where the floor will be.
For a particular project, we acquire data for a number of events and collect variables about those events at the same time. After the data has been collected, we perform a user-customizable analysis on said data to determine whatever it is that the user is interested in.
The data is collected in a form similar to this:
Timestamp Event
0 x = 0
0 y = 1
3 Event A occurred
3 x = 1
4 Event A occurred
4 x = 2
9 Event B occurred
9 y = 2
9 x = 0
To understand the entire state at any time, the most straightforward approach is to walk over the entire set of data. For example, if I start at time 0, and "analyze" until timestamp 5, I know that at that point x = 2, y = 1, and Event A has occurred twice. That's a really simple example. The user might be (and often is) interested in the time between events, say from A to B, and they might specify the first occurrence of A, then B, or the last occurrence of A, then B (respectively, 9-3 = 6 or 9-4 = 5). Like I said, this is easy to analyze when you're walking over the entire set.
Now, we need to adapt the model to analyze an arbitrary window of time. If we look at 0-N, that's the easy case. But if I look at 1-5, for instance, I have no notion of y unless I begin at 0 and know that y was initially 1 and did not change in the window 1-5.
Our approach is to essentially create a dictionary of variables, and run callbacks on events. If one analysis was "What is x when Event A occurs and time is > 3" then we would run that callback on the first Event A, and it would immediately return because time is not greater than 3. It would run again at 4, and and it would report that x was 1 at t=4.
To adapt to the "time-windowing", I think I am going to (in the background) tack on additional conditions to the analysis. If their analysis is just "What is x when Event A occurs", and the current window is 1-5, then I will change it to "What is x when Event A occurs and time >= 1 and time <= 5". Then if the next window is 6-10, I can readjust the condition as necessary.
My main question is: what pattern does this fit? We are obviously not the first people to approach a problem like this, but I have not been able to find how others have approached it. I probably just don't know what exactly to search on Google. Is there any other approach besides keeping a dictionary of the entire global state for looking up a single state at a given time? Note also that the data could have several, maybe tens of thousands of records, so the fewer iterations over the data set, the better.
I think your best approach here would be to take periodic "snapshots" of the full state data, say every 1000 samples (for example), along with recording the deltas. When you're storing your data as offsets from some original value (aka deltas), you don't have any choice but to reconstruct the full data starting with the original values. Storing periodic snapshots will lessen the amount of reconstruction you have to do - the design tradeoff is between low storage requirements but long reconstruction time on the one hand, and higher storage requirements but shorter reconstruction time on the other.
MPEGs, for example, store each frame as the differences between the current frame and the previous frame. Ordinarily, this would force an MPEG to be viewed from the beginning, but the format also periodically stores full frames so that the decoder doesn't have to backtrack all the way to the beginning of the file.
You can search by time in Log(N), and you can have a feeling about how many updates ares acceptable... hence here's my solution:
Pick a number, N, of updates that are acceptable in order to return a result. 256 might be good, given the scales you've mentioned so far.
Every N records, commit an entry of all state to a dictionary, with a timestamp.
Now, you have a tradeoff, dictionary size against speed. N->\infty is regular searching. N<-1 is your current solution, N anywhere else will require less memory, but be slower.
Your implementation is now (for time X):
Log(n) search of subsampled global dictionary to timestamp before X, (timestamped as Y).
Log(n) search of eventlist to timestamp Y, and perform less than N updates.
Picking N as a power of two will even allow you to do some nice shift tricks to do a rounded-down integer divide nice and fast.