Is there a sorted atomicAdd or equivalent - cuda

I have a working detection and tracking process (pixel image in rows and columns) which does not give perfectly repeatable results because its use of atomicAdd means that data points can be accumulated in different orders leading to round off errors in the calculation of centroids and other track statistics.
In the main there are few clashes for the atomicAdd, so most results are identical. However for verification and validation I need to be able to make the atomicAdd add these clashing data points in a consistent order, such that say thread 3 will beat thread 10 when both want to use the atomicAdd to add a pixel on the row N that they are processing.
Is there a mechanism that allows the atomicAdd to be deterministic in its thread order, or have I missed something?

Check out "Fast Reproducible Atomic Summations" paper from Berkeley.
http://www.eecs.berkeley.edu/~hdnguyen/public/papers/ARITH21_Fast_Sum.pdf
But basically you could try something like finding a sum of abs values along with your original sum, multiply it by O(N^2) and then subtract and add it to/from your original sum (sum = (sum - sumAbs * N^2) + sumAbs * N^2) to cancel out the lowest bits (that are indeterministic). As you can see the upper bound grows proportional to N^2... so the lower the N (number of elements in the sum) the better is your error bound.
You could also try Kahan summation to reduce the error bound in conjunction with the above.

Related

Statistical method to know when enough performance test iterations have been performed

I'm doing some performance/load testing of a service. Imagine the test function like this:
bytesPerSecond = test(filesize: 10MB, concurrency: 5)
Using this, I'll populate a table of results for different sizes and levels of concurrency. There are other variables too, but you get the idea.
The test function spins up concurrency requests and tracks throughput. This rate starts off at zero, then spikes and dips until it eventually stabilises on the 'true' value.
However it can take a while for this stability to occur, and there are lot of combinations of input to evaluate.
How can the test function decide when it's performed enough samples? By enough, I suppose I mean that the result isn't going to change beyond some margin if testing continues.
I remember reading an article about this a while ago (from one of the jsperf authors) that discussed a robust method, but I cannot find the article any more.
One simple method would be to compute the standard deviation over a sliding window of values. Is there a better approach?
IIUC, you're describing the classic problem of estimating the confidence interval of the mean with unknown variance. That is, suppose you have n results, x1, ..., xn, where each of the xi is a sample from some process of which you don't know much: not the mean, not the variance, and not the distribution's shape. For some required confidence interval, you'd like to now whether n is large enough so that, with high probability the true mean is within the interval of your mean.
(Note that with relatively-weak conditions, the Central Limit Theorem guarantees that the sample mean will converge to a normal distribution, but to apply it directly you would need the variance.)
So, in this case, the classic solution to determine if n is large enough, is as follows:
Start by calculating the sample mean μ = ∑i [xi] / n. Also calculate the normalized sample variance s2 = ∑i [(xi - μ)2] / (n - 1)
Depending on the size of n:
If n > 30, the confidence interval is approximated as μ ± zα / 2(s / √(n)), where, if necessary, you can find here an explanation on the z and α.
If n < 30, the confidence interval is approximated as μ ± tα / 2(s / √(n)); see again here an explanation of the t value, as well as a table.
If the confidence is enough, stop. Otherwise, increase n.
Stability means rate of change (derivative) is zero or close to zero.
The test function spins up concurrency requests and tracks throughput.
This rate starts off at zero, then spikes and dips until it eventually
stabilises on the 'true' value.
I would track your past throughput values. For example last X values or so. According to this values, I would calculate rate of change (derivative of your throughput). If your derivative is close to zero, then your test is stable. I will stop test.
How to find X? I think instead of constant value, such as 10, choosing a value according to maximum number of test can be more suitable, for example:
X = max(10,max_test_count * 0.01)

CUDA Atomic operation on array in global memory

I have a CUDA program whose kernel basically does the following.
I provide a list of n points in cartesian coordinates e.g. (x_i,y_i) in a plane of dimension dim_x * dim_y. I invoke the kernel accordingly.
For every point on this plane (x_p,y_p) I calculate by a formula the time it would take for each of those n points to reach there; given those n points are moving with a certain velocity.
I order those times in increasing order t_0,t_1,...t_n where the precision of t_i is set to 1. i.e. If t'_i=2.3453 then I would only use t_i=2.3.
Assuming the times are generated from a normal distribution I simulate the 3 quickest times to find the percentage of time those 3 points reached earliest. Hence suppose prob_0 = 0.76,prob_1=0.20 and prob_2=0.04 by a random experiment. Since t_0 reaches first most amongst the three, I also return the original index (before sorting of times) of the point. Say idx_0 = 5 (An integer).
Hence for every point on this plane I get a pair (prob,idx).
Suppose n/2 of those points are of one kind and the rest are of other. A sample image generated looks as follows.
Especially when precision of the time was set to 1 I noticed that the number of unique 3 tuples of time (t_0,t_1,t_2) was just 2.5% of the total data points i.e. number of points on the plane. This meant that most of the times the kernel was uselessly simulating when it could just use the values from previous simulations. Hence I could use a dictionary having key as 3-tuple of times and value as index and prob. Since as far as I know and tested, STL can't be accessed inside a kernel, I constructed an array of floats of size 201000000. This choice was by experimentation since none of the top 3 times exceeded 20 seconds. Hence t_0 could take any value from {0.0,0.1,0.2,...,20.0} thus having 201 choices. I could construct a key for such a dictionary like the following
Key = t_o * 10^6 + t_1 * 10^3 + t_2
As far as the value is concerned I could make it as (prob+idx). Since idx is an integer and 0.0<=prob<=1.0, I could retrieve both of those values later by
prob=dict[key]-floor(dict[key])
idx = floor(dict[key])
So now my kernel looks like the following
__global__ my_kernel(float* points,float* dict,float *p,float *i,size_t w,...){
unsigned int col = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int row = blockIdx.x*blockDim.x + threadIdx.x;
//Calculate time taken for each of the points to reach a particular point on the plane
//Order the times in increasing order t_0,t_1,...,t_n
//Calculate Key = t_o * 10^6 + t_1 * 10^3 + t_2
if(dict[key]>0.0){
prob=dict[key]-floor(dict[key])
idx = floor(dict[key])
}
else{
//Simulate and find prob and idx
dict[key]=(prob+idx)
}
p[row*width+col]=prob;
i[row*width+col]=idx;
}
The result is quite similar to the original program for most points but for some it is wrong.
I am quite sure that this is due to race condition. Notice that dict was initialized with all zeroes. The basic idea would be to make the data structure "read many write once" in a particular location of the dict.
I am aware that there might be much more optimized ways of solving this problem rather than allocating so much memory. Please let me know in that case. But I would really like to understand why this particular solution is failing. In particular I would like to know how to use atomicAdd in this setting. I have failed to use it.
Unless your simulation in the else branch is very long (~100s of floating-point operations), a lookup table in global memory is likely to be slower than running the computation. Global memory access is very expensive!
In any case, there is no way to save time by "skipping work" using conditional branching. The Single Instruction, Multiple Thread architecture of a GPU means that the instructions for both sides of the branch will be executed serially, unless all of the threads in a block follow the same branch.
edit:
The fact that you are seeing a performance increase as a result of introducing the conditional branch and you didn't have any problems with deadlock suggests that all the threads in each block are always taking the same branch. I suspect that once dict starts getting populated, the performance increase will go away.
Perhaps I have misunderstood something, but if you want to calculate the probability of an event x, assuming a normal distribution and given the mean mu and standard deviation sigma, there is no need to generate a load of random numbers and approximate a Gaussian curve. You can directly calculate the probability:
p = exp(-((x - mu) * (x - mu) / (2.0f * sigma * sigma))) /
(sigma * sqrt(2.0f * M_PI));

How to convert a sparse histogram into dense histogram in CUDA?

I am implementing an algorithm using raw CUDA kernels, in which every threadblock needs the dense histogram of available data to that threadblock, now the question is that do I have to calculate the dense histogram from the scratch? (is it worth calculating the dense histogram at all, provided that i already have the sparse histogram which is implemented using shared memory)
I have come up with this idea of converting, I will try to elaborate my idea with example (temp and hist both are in shared memory)
0,1,2,3,4,5,6... //array indexes
4,3,0,2,1,0,5... //contents of hist[]
0,0,2,0,0,5,0... //contents of temp[] if(hist[x]>0)temp[x]=x;
for_every_element //this is sequential part :(
if(temp[x]>0)
shift elements from index x to 256
4,3,2,1,0,5... //pass 1 of the for loop
4,3,2,1,5... //pass 2 of the for loop
//this goes on until all the 0s are compacted
Now I know above is sequential in nature, but the shifting can be done with constant time (and in parallel) because threads_per_block is already set to 256, so shifting is not the main issue, the main issue is how to improve this (or any other suggestion is welcomed).
Edit: i am thinking of another idea, that is as follows
Assuming threads_per_block=256 if i can count which of histogram bins are non-zeros (this operation is parallel because each thread is assigned to each bin, i can atomicadd the values generated by each thread) let's say that i can then start a new shared index variable sindex=0 and each time a thread wants to store the value into d_hist[] it can take the latest value from sindex and store it's values to d_hist[sindex]=hist[treadIdx.x] after that i can atomicAdd the sindex
Now there is only one problem, there is going to be a race condition to getting the value of sindex, so i may have to setup a flag which can be locked or unlocked when a thread is adding any value to d_hist (but i think there can be a deadlock situation here)
Will this technique work? and is there any other technique better than that?
Converting a sparse histogram to a dense histogram is a scatter operation. If the sparse histogram is composed of s_index[S_N] and s_hist[S_N], then first we create a dense histogram d_hist[N] composed of all zeroes (you can do this from host code, perhaps). Then we populate the dense histogram with d_hist[s_index[i]] = s_hist[i]; This can be done in parallel and uses as many threads as there are valid indices in your sparse histogram (i < S_N). Assuming your histogram is sorted, you'll get whatever coalescing benefit may be possible based on the distribution of your sparse histogram indices.
It may not make sense for your case where each threadblock is doing a separate histogram, but you may also be interested in thrust scatter.
Well I guess the simplest method is to find out which bins>0 and after that, and exclusive scan can be done (in order to calculate the target indexes let's say sum_array[]) after that for allbins>0 move to d_hist[sum_array[threadIdx.x]-1]=s_hist[threadIdx.x]
0,1,2,3,4,5,6... //s_indexes[]
4,3,0,2,1,0,5... //contents of s_hist[]
1,1,0,1,1,0,1... //all bins which are > 0 = sum_array[]
1,2,2,3,4,4,5... //inclusive_scan of summ_array[]
//after the moving part
0,1,3,4,6... //s_indexes[]
4,3,2,1,5... //d_hist[]
0,1,2,3,4... //d_indexes[]
The reason why I am inclined to use this pattern is because it takes log_base_2(256) time in order to calculate the sum_array plus, other than that, moving and checking parts are just constant time operations, if anyone have different idea than this, please share.

Cumulative Distribution Function For a Set of Values

I have a histogram, where I count the number of occurrences that a function takes particular values in the range 0.8 and 2.2.
I would like to get the cumulative distribution function for the set of values. Is it correct to just count the total number of occurrences until each particular value.
For example, the cdf at 0.9 will be the sum of all the occurrences from 0.8 to 0.9?
Is it correct?
Thank you
The sum normalised by the number of entries will give you an estimate of the cdf, yes. It will be as accurate as the histogram is an accurate representation of the pdf. If you want to evaluate the cdf anywhere except the bin endpoints, it makes sense to include a fraction of the counts, so that if you have break points b_i and b_j, then to evaluate the cdf at some point b_i < p < b_j you should add the fraction of counts (p - b_i) / (b_j-b_i) from the relevant cell. Essentially this assumes uniform density within the cells.
You can get an estimate of the cdf from the underlying values, too (based on your question I'm not quite sure what you have access to, whether its bin counts in the histogram or the actual values). Beware that doing so will give your CDF discontinuities (steps) at each data point, so think about whether you have enough, and what you're using the CDF for, to determine whether this is appropriate.
As a final note of warning, beware that evaluating the cdf outside of the range of observed values will give you an estimated probability of zero or one (zero for x<0.8, one for x>2.2). You should consider whether the function is truly bounded to that interval, and if not, employ some smoothing to ensure small amounts of probability mass outside the range of observed values.

numerical issues causing the difference in outputs of two programs?

I have two codes that theoretically should return the exact same output. However, this does not happen. The issue is that the two codes handle very small numbers (doubles) to the order of 1e-100 or so. I suspect that there could be some numerical issues which are related to that, and lead to the two outputs being different even though they should be theoretically the same.
Does it indeed make sense that handling numbers on the order of 1e-100 cause such problems? I don't mind the difference in output, if I could safely assume that the source is numerical issues. Does anyone have a good source/reference that talks about issues that come up with stability of algorithms when they handle numbers in such order?
Thanks.
Does anyone have a good source/reference that talks about issues that come up with stability of algorithms when they handle numbers in such order?
The first reference that comes to mind is What Every Computer Scientist Should Know About Floating-Point Arithmetic. It covers floating-point maths in general.
As far as numerical stability is concerned, the best references probably depend on the numerical algorithm in question. Two wide-ranging works that come to mind are:
Numerical Recipes by Press et al;
Matrix Computations by Golub and Van Loan.
It is not necessarily the small numbers that are causing the problem.
How do you check whether the outputs are the "exact same"?
I would check equality with tolerance. You may consider the floating point numbers x and y equal if either fabs(x-y) < 1.0e-6 or fabs(x-y) < fabs(x)*1.0e-6 holds.
Usually, there is a HUGE difference between the two algorithms if there are numerical issues. Often, a small change in the input may result in extreme changes in the output, if the algorithm suffers from numerical issues.
What makes you think that there are "numerical issues"?
If possible, change your algorithm to use Kahan Summation (aka compensated summation). From Wikipedia:
function KahanSum(input)
var sum = 0.0
var c = 0.0 //A running compensation for lost low-order bits.
for i = 1 to input.length do
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
//Next time around, the lost low part will be added to y in a fresh attempt.
return sum
This works by keeping a second running total of the cumulative error, similar to the Bresenham line drawing algorithm. The end result is that you get precision that is nearly double the data type's advertised precision.
Another technique I use is to sort my numbers from small to large (by manitude, ignoring sign) and add or subtract the small numbers first, then the larger ones. This has the virtue that if you add and subtract the same value multiple times, such numbers may cancel exactly and can be removed from the list.