number ranges redistribution based on query results - mysql

I need to find a ways redistribute the number ranges when parallel exporting from MySQL:
Example output (SQL queries results):
what is the best way to redistribute the number ranges after getting the initial results, so the results will be more evenly distributed?
(estimated) desired output:

It seems that originally you believe your data is uniformly distributed, and now you have a view of the amount of entries at every evenly spaced bin. You can now update your belief about the distribution of your data: within every bin, the data is uniformly distributed, but bins with a higher result have a larger concentration.
Your updated distribution will tell you that you believe the number of results q to be equal to the sum of all the results in the buckets where the max bound is below q, plus
(q-min(q))/(max(q)-min(q))*size(q)+
where min(q) and max(q) give you the max and min bounds of the bucket that q belongs to, and size(q) is the amount of results in the bucket that q belongs to. This is a piecewise linear function where the slope at bucket i is its relative size to the total. Now divide by the total number of results to get a probability distribution. To get the places where you should query, find the ten values of x where this piecewise function is equal to .1,.2,.3.... 1.0 . This is a lot easier than inverting an arbitrary function if you exploit the piecewise linear property , for example if you are trying to find the x associated to .2, first find the bucket i such that min(.2)=lower_bnd_bucket(i)/total<=.2<=upper_bnd_bucket(i)=max(.2), This gives you min(.2), max(.2) and size(.2).
Then you just have to invert the linear function associated to that bucket
x=(.2-sum_of_buckets_lessthan_i)*(max(.2)-min(.2))/size(.2)+min(.2)
Note that size should not be 0 since your are dividing by it, which makes sense (if the bucket is 0, you should skip it). Finally, if you want the places you are querying to be integers, you can round the values using your preferred criteria. What we are doing here is updating with Bayes our belief of where the 10 deciles will be, based on observations
on the distribution in the 10 places you already queried. You can refine this further once you see the result of this query, and you will eventually reach convergence.
For the example on your table, to find the updated upper limit of bucket 1 , you check that
2569/264118<0.1 (first ten percent),
then you check that
that (2569+14023)/264118<0.1
and finally you check that ((2569+14023+123762)/264118)>0.1
so your new estimate for the decile should be in between 1014640 and 1141470.
Your new estimate for the upper theshold of the first bucket is
decile_1=(.1-(2569+14023)/264118)*(1141470-1014640)/(123762/264118)+1014640=1024703
similarly, your estimate for the upper bound for the second bucket is:
(.2-(2569+14023)/264118)*(1141470-1014640)/(123762/264118)+1014640=1051770. Note that this linear interpolation will work until the update for the upper limit of bucket 6, since ((2569+14023+123762)/264118)<.6 and you will now need to use the limits for the old bucket ten when updating the buckets 6 and higher.

Related

What algorithm is used to filter out from higher dimensional data points?

I have 4-dimensional data points stored in my MySQL database in server. One time dimension data with three spatial GPS data (lat, lon, alt). GPS data are sampled by 1 minute timeframe for thousands of users and are being added to my server 24x7.
Sample REST/post json looks like,
{
"id": "1005",
"location": {
"lat":-87.8788,
"lon":37.909090,
"alt":0.0,
},
"datetime": 11882784
}
Now, I need to filter out all the candidates (userID) whose positions were within k meters distance from a given userID for a given time period.
Sample REST/get query params for filtering looks like,
{
"id": "1001", // user for whose we need to filter out candidates IDs
"maxDistance":3, // max distance in meter to consider (euclidian distance from users location to candidates location)
"maxDuration":14 // duration offset (in days) from current datetime to consider
}
As we see, thousands of entries are inserted in my database per minute which results huge number of total entries. Thus to iterate over for all the entries for filtering, I am afraid trivial naive approach won't be feasible for my current requirement. So, what algorithm should I implement in the server? I have tried to implement naive algorithm like,
params ($uid, $mDis, $mDay)
1. Init $candidates = []
2. For all the locations $Li of user with $uid
3. For all locations $Di in database within $mDay
4. $dif = EuclidianDis($Li, $Di)
5. If $dif < $mDis
6. $candidates += userId for $Di
7. Return $candidates
However, this approach is very slow in practice. And pre calculation might not be feasible as it costs huge space for all userIDs. What other algorithm can improve efficiency?
You could implement a spatial hashing algorithm to efficiently query your database for candidates within a given area/time.
Divide the 3D space into a 3D grid of cubes with width k, and when inserting a data-point into your database, compute which cube the point lies in and compute a hash value based on the cube co-ordinates.
When querying for all data-points within k of another datapoint d, compute the cube that d sits in, and find the 8 adjacent cubes (+/- 1 in each dimension). Compute the hash values of the 9 cubes and query your database for all entries with these hash values within the given time period. You will have small candidate set from which you can then iterate over to find all datapoints within k of d.
If your value of k can range between 2-5 meters, give your cubes a width of 5.
Timestamps can be stored as a separate field, or alternatively you can make your cubes 4-dimensional and include the timestamp in the hash, and search 27 cubes instead of 9.

Cumulative Distribution Function For a Set of Values

I have a histogram, where I count the number of occurrences that a function takes particular values in the range 0.8 and 2.2.
I would like to get the cumulative distribution function for the set of values. Is it correct to just count the total number of occurrences until each particular value.
For example, the cdf at 0.9 will be the sum of all the occurrences from 0.8 to 0.9?
Is it correct?
Thank you
The sum normalised by the number of entries will give you an estimate of the cdf, yes. It will be as accurate as the histogram is an accurate representation of the pdf. If you want to evaluate the cdf anywhere except the bin endpoints, it makes sense to include a fraction of the counts, so that if you have break points b_i and b_j, then to evaluate the cdf at some point b_i < p < b_j you should add the fraction of counts (p - b_i) / (b_j-b_i) from the relevant cell. Essentially this assumes uniform density within the cells.
You can get an estimate of the cdf from the underlying values, too (based on your question I'm not quite sure what you have access to, whether its bin counts in the histogram or the actual values). Beware that doing so will give your CDF discontinuities (steps) at each data point, so think about whether you have enough, and what you're using the CDF for, to determine whether this is appropriate.
As a final note of warning, beware that evaluating the cdf outside of the range of observed values will give you an estimated probability of zero or one (zero for x<0.8, one for x>2.2). You should consider whether the function is truly bounded to that interval, and if not, employ some smoothing to ensure small amounts of probability mass outside the range of observed values.

Is there a sorted atomicAdd or equivalent

I have a working detection and tracking process (pixel image in rows and columns) which does not give perfectly repeatable results because its use of atomicAdd means that data points can be accumulated in different orders leading to round off errors in the calculation of centroids and other track statistics.
In the main there are few clashes for the atomicAdd, so most results are identical. However for verification and validation I need to be able to make the atomicAdd add these clashing data points in a consistent order, such that say thread 3 will beat thread 10 when both want to use the atomicAdd to add a pixel on the row N that they are processing.
Is there a mechanism that allows the atomicAdd to be deterministic in its thread order, or have I missed something?
Check out "Fast Reproducible Atomic Summations" paper from Berkeley.
http://www.eecs.berkeley.edu/~hdnguyen/public/papers/ARITH21_Fast_Sum.pdf
But basically you could try something like finding a sum of abs values along with your original sum, multiply it by O(N^2) and then subtract and add it to/from your original sum (sum = (sum - sumAbs * N^2) + sumAbs * N^2) to cancel out the lowest bits (that are indeterministic). As you can see the upper bound grows proportional to N^2... so the lower the N (number of elements in the sum) the better is your error bound.
You could also try Kahan summation to reduce the error bound in conjunction with the above.

Function to dampen a value

I have a list of documents each having a relevance score for a search query. I need older documents to have their relevance score dampened, to try to introduce their date in the ranking process. I already tried fiddling with functions such as 1/(1+date_difference), but the reciprocal function is too discriminating for close recent dates.
I was thinking maybe a mathematical function with range (0..1) and domain(0..x) to amplify their score, where the x-axis is the age of a document. It's best to explain what I further need from the function by an image:
Decaying behavior is often modeled well by an exponentional function (many decaying processes in nature also follow it). You would use 2 positive parameters A and B and get
y(x) = A exp(-B x)
Since you want a y-range [0,1] set A=1. Larger B give slower decays.
If a simple 1/(1+x) decreases too quickly too soon, a sigmoid function like 1/(1+e^-x) or the error function might be better suited to your purpose. Let the current date be somewhere in the negative numbers for such a function, and you can get a value that is current for some configurable time and then decreases towards a base value.
log((x+1)-age_of_document)
Where the base of the logarithm is (x+1). Note the x is as per your diagram and is the "threshold". If the age of the document is greater than x the score goes negative. Multiply by the maximum possible score to introduce scaling.
E.g. Domain = (0,10) with a maximum score of 10: 10*(log(11-x))/log(11)
A bit late, but as thiton says, you might want to use a sigmoid function instead, since it has a "floor" value for your long tail data points. E.g.:
0.8/(1+5^(x-3)) + 0.2 - You can adjust the constants 5 and 3 to control the slope of the curve. The 0.2 is where the floor will be.

Using google maps API to find average speed at a location

I am trying to get the current traffic conditions at a particular location. The GTrafficOverlay object mentioned here only provides an overlay on an existing map.
Does anyone know how I can get this data from Google using their API?
It is only theorical, but there is perhaps a way to extract those data using the distancematrix api.
Method
1)
Make a topological road network, with node and edge, something like this:
Each edge will have four attributes: [EDGE_NUMBER;EDGE_SPEED;EDGE_TIME,EDGE_LENGTH]
You can use the openstreetmap data to create this network.
At the begining each edge will have the same road speed, for example 50km/h.
You need to use only the drivelink and delete the other edges. Take also into account that some roads are oneway.
2)
Randomly chose two nodes that are not closer than 5 or 10km
Use the dijsktra shortest path algorithm to calculate the shortest path between this two nodes (the cost = EDGE_TIME). Use your topological network to do that. The output will look like:
NODE = [NODE_23,NODE_44] PATH = [EDGE_3,EDGE_130,EDGE_49,EDGE_39]
Calculate the time needed to drive between the two nodes with the distance matrix api.
Preallocate a matrix A of size N X number_of_edge filled with zero value
Preallocate a matrix B of size 1 X number_of_edge filled with zero value
In the first row of matrix A fill each column (corresponding to each edge) with the length of the edge if the corresponding edge is in the path.
[col_1,col_2,col_3,...,col_39,...,col_49,...,col_130]
[0, 0, len_3,...,len_39,...,len_49,...,len_130] %row 1
In the first row of matrix B put the time calculated with the distance matrix api.
Then select two news node that were not used in the first path and repeat the operation until that there is no node left. (so you will fill the row 2, the row 3...)
Now you can solve the linear equation system: Ax = B where speed = 1/x
Assign the new calculated speed to each edge.
3)
Iterate the point 2) until your calculated speed start to converge.
Comment
I'm not sure that the calculated speed will converge, it will be interesting to test the method.I will try to do that if I got some time.
The distance matrix api don't provide a traveling time more precise than 1 minute, that's why the distance between the pair of node need to be at least 5 or 10 or more km.
Also this method fails to respect the Google's terms of service.
Google does not make available public API for this data.
Yahoo has a feed (example) with traffic conditions -- construction, accidents, and such. A write-up on how to access it is here.
If you want actual road speeds, you will probably need to work with a commercial provider.