I have a histogram, where I count the number of occurrences that a function takes particular values in the range 0.8 and 2.2.
I would like to get the cumulative distribution function for the set of values. Is it correct to just count the total number of occurrences until each particular value.
For example, the cdf at 0.9 will be the sum of all the occurrences from 0.8 to 0.9?
Is it correct?
Thank you
The sum normalised by the number of entries will give you an estimate of the cdf, yes. It will be as accurate as the histogram is an accurate representation of the pdf. If you want to evaluate the cdf anywhere except the bin endpoints, it makes sense to include a fraction of the counts, so that if you have break points b_i and b_j, then to evaluate the cdf at some point b_i < p < b_j you should add the fraction of counts (p - b_i) / (b_j-b_i) from the relevant cell. Essentially this assumes uniform density within the cells.
You can get an estimate of the cdf from the underlying values, too (based on your question I'm not quite sure what you have access to, whether its bin counts in the histogram or the actual values). Beware that doing so will give your CDF discontinuities (steps) at each data point, so think about whether you have enough, and what you're using the CDF for, to determine whether this is appropriate.
As a final note of warning, beware that evaluating the cdf outside of the range of observed values will give you an estimated probability of zero or one (zero for x<0.8, one for x>2.2). You should consider whether the function is truly bounded to that interval, and if not, employ some smoothing to ensure small amounts of probability mass outside the range of observed values.
Related
I need to find a ways redistribute the number ranges when parallel exporting from MySQL:
Example output (SQL queries results):
what is the best way to redistribute the number ranges after getting the initial results, so the results will be more evenly distributed?
(estimated) desired output:
It seems that originally you believe your data is uniformly distributed, and now you have a view of the amount of entries at every evenly spaced bin. You can now update your belief about the distribution of your data: within every bin, the data is uniformly distributed, but bins with a higher result have a larger concentration.
Your updated distribution will tell you that you believe the number of results q to be equal to the sum of all the results in the buckets where the max bound is below q, plus
(q-min(q))/(max(q)-min(q))*size(q)+
where min(q) and max(q) give you the max and min bounds of the bucket that q belongs to, and size(q) is the amount of results in the bucket that q belongs to. This is a piecewise linear function where the slope at bucket i is its relative size to the total. Now divide by the total number of results to get a probability distribution. To get the places where you should query, find the ten values of x where this piecewise function is equal to .1,.2,.3.... 1.0 . This is a lot easier than inverting an arbitrary function if you exploit the piecewise linear property , for example if you are trying to find the x associated to .2, first find the bucket i such that min(.2)=lower_bnd_bucket(i)/total<=.2<=upper_bnd_bucket(i)=max(.2), This gives you min(.2), max(.2) and size(.2).
Then you just have to invert the linear function associated to that bucket
x=(.2-sum_of_buckets_lessthan_i)*(max(.2)-min(.2))/size(.2)+min(.2)
Note that size should not be 0 since your are dividing by it, which makes sense (if the bucket is 0, you should skip it). Finally, if you want the places you are querying to be integers, you can round the values using your preferred criteria. What we are doing here is updating with Bayes our belief of where the 10 deciles will be, based on observations
on the distribution in the 10 places you already queried. You can refine this further once you see the result of this query, and you will eventually reach convergence.
For the example on your table, to find the updated upper limit of bucket 1 , you check that
2569/264118<0.1 (first ten percent),
then you check that
that (2569+14023)/264118<0.1
and finally you check that ((2569+14023+123762)/264118)>0.1
so your new estimate for the decile should be in between 1014640 and 1141470.
Your new estimate for the upper theshold of the first bucket is
decile_1=(.1-(2569+14023)/264118)*(1141470-1014640)/(123762/264118)+1014640=1024703
similarly, your estimate for the upper bound for the second bucket is:
(.2-(2569+14023)/264118)*(1141470-1014640)/(123762/264118)+1014640=1051770. Note that this linear interpolation will work until the update for the upper limit of bucket 6, since ((2569+14023+123762)/264118)<.6 and you will now need to use the limits for the old bucket ten when updating the buckets 6 and higher.
I'd like to know how to plot power series (whose variable is x), but I don't even know where to start with.
I know it might not be possible plot infinite series, but it'd do as well plotting the sum of the first n terms.
Gnuplot has a sum function, which can be used inside the using statement to sum up several columns or terms. Together with the special file name + you can implement power series.
Consider the exponention function, which has a power series
\sum_{n=0}^\infty x^n/n!
So, we define a term as
term(x, n) = x**n/n!
Now we can plot the power series up to the n=5 term with
set xrange [0:4]
term(x, n) = x**n/n!
set samples 20
plot '+' using 1:(sum [n=0:5] term($1, n))
To plot the results when using 2 to 7 terms and compare it with the actual exp function, use
term(x, n) = x**n/n!
set xrange [-2:2]
set samples 41
set key left
plot exp(x), for [i=1:6] '+' using 1:(sum[t=0:i] term($1, t)) title sprintf('%d terms', i)
The easiest way that I can think of is to generate a file that has a column of x-values and a column of f(x) values, then just plot the table like you would any other data. A power series is continuous, so you can just connect the dots and have a fairly accurate representation (provided your dots are close enough together). Also, when evaluating f(x), you just sum up the first N terms (where N is big enough). Big enough means that the sum of the rest of the terms is smaller than whatever error you allow. (*If you want 3 good digits, then N needs to be large enough that the remaining sum is smaller than .001.)
You can pull out a calc II textbook to determine how to bound the error on the tail of the sum. A lot of calc classes briefly cover it, but students tend to feel like the error estimates are pointless (I know because I've taught the course a few times.) As an example, if you have an alternating series (whose terms are decreasing in absolute value), then the absolute value of the first term you omit (don't sum) is an upperbound on your error.
*This statement is not 100% true, it is slightly over simplified, but is correct for most practical purposes.
I have a working detection and tracking process (pixel image in rows and columns) which does not give perfectly repeatable results because its use of atomicAdd means that data points can be accumulated in different orders leading to round off errors in the calculation of centroids and other track statistics.
In the main there are few clashes for the atomicAdd, so most results are identical. However for verification and validation I need to be able to make the atomicAdd add these clashing data points in a consistent order, such that say thread 3 will beat thread 10 when both want to use the atomicAdd to add a pixel on the row N that they are processing.
Is there a mechanism that allows the atomicAdd to be deterministic in its thread order, or have I missed something?
Check out "Fast Reproducible Atomic Summations" paper from Berkeley.
http://www.eecs.berkeley.edu/~hdnguyen/public/papers/ARITH21_Fast_Sum.pdf
But basically you could try something like finding a sum of abs values along with your original sum, multiply it by O(N^2) and then subtract and add it to/from your original sum (sum = (sum - sumAbs * N^2) + sumAbs * N^2) to cancel out the lowest bits (that are indeterministic). As you can see the upper bound grows proportional to N^2... so the lower the N (number of elements in the sum) the better is your error bound.
You could also try Kahan summation to reduce the error bound in conjunction with the above.
I have a list of documents each having a relevance score for a search query. I need older documents to have their relevance score dampened, to try to introduce their date in the ranking process. I already tried fiddling with functions such as 1/(1+date_difference), but the reciprocal function is too discriminating for close recent dates.
I was thinking maybe a mathematical function with range (0..1) and domain(0..x) to amplify their score, where the x-axis is the age of a document. It's best to explain what I further need from the function by an image:
Decaying behavior is often modeled well by an exponentional function (many decaying processes in nature also follow it). You would use 2 positive parameters A and B and get
y(x) = A exp(-B x)
Since you want a y-range [0,1] set A=1. Larger B give slower decays.
If a simple 1/(1+x) decreases too quickly too soon, a sigmoid function like 1/(1+e^-x) or the error function might be better suited to your purpose. Let the current date be somewhere in the negative numbers for such a function, and you can get a value that is current for some configurable time and then decreases towards a base value.
log((x+1)-age_of_document)
Where the base of the logarithm is (x+1). Note the x is as per your diagram and is the "threshold". If the age of the document is greater than x the score goes negative. Multiply by the maximum possible score to introduce scaling.
E.g. Domain = (0,10) with a maximum score of 10: 10*(log(11-x))/log(11)
A bit late, but as thiton says, you might want to use a sigmoid function instead, since it has a "floor" value for your long tail data points. E.g.:
0.8/(1+5^(x-3)) + 0.2 - You can adjust the constants 5 and 3 to control the slope of the curve. The 0.2 is where the floor will be.
I'd like to generate uniformly distributed random integers over a given range. The interpreted language I'm using has a builtin fast random number generator that returns a floating point number in the range 0 (inclusive) to 1 (inclusive). Unfortunately this means that I can't use the standard solution seen in another SO question (when the RNG returns numbers between 0 (inclusive) to 1 (exclusive) ) for generating uniformly distributed random integers in a given range:
result=Int((highest - lowest + 1) * RNG() + lowest)
The only sane method I can see at the moment is in the rare case that the random number generator returns 1 to just ask for a new number.
But if anyone knows a better method I'd be glad to hear it.
Rob
NB: Converting an existing random number generator to this language would result in something infeasibly slow so I'm afraid that's not a viable solution.
Edit: To link to the actual SO answer.
Presumably you are desperately interested in speed, or else you would just suck up the conditional test with every RNG call. Any other alternative is probably going to be slower than the branch anyway...
...unless you know exactly what the internal structure of the RNG is. Particularly, what are its return values? If they're not IEEE-754 floats or doubles, you have my sympathies. If they are, how many real bits of randomness are in them? You would expect 24 for floats and 53 for doubles (the number of mantissa bits). If those are naively generated, you may be able to use shifts and masks to hack together a plain old random integer generator out of them, and then use that in your function (depending on the size of your range, you may be able to use more shifts and masks to avoid any branching if you have such a generator). If you have a high-quality generator that produces full quality 24- or 53-bit random numbers, then with a single multiply you can convert them from [0,1] to [0,1): just multiply by the largest generatable floating-point number that is less than 1, and your range problem is gone. This trick will still work if the mantissas aren't fully populated with random bits, but you'll need to do a bit more work to find the right multiplier.
You may want to look at the C source to the Mersenne Twister to see their treatment of similar problems.
I don't see why the + 1 is needed. If the random number generator delivers a uniform distribution of values in the [0,1] interval then...
result = lowest + (rng() * (highest - lowest))
should give you a unform distribution of values between lowest
rng() == 0, result = lowest + 0 = lowest
and highest
rng() == 1, result = lowest + highest - lowest = highest
Including + 1 means that the upper bound on the generated number can be above highest
rng() == 1, result = lowest + highest - lowest + 1 = highest + 1.
The resulting distribution of values will be identical to the distribution of the random numbers, so uniformity depends on the quality of your random number generator.
Following on from your comment below you are right to point out that Int() will be the source of a lop-sided distribution at the tails. Better to use Round() to the nearest integer or whatever equivalent you have in your scripting language.