What algorithm is used to filter out from higher dimensional data points? - mysql

I have 4-dimensional data points stored in my MySQL database in server. One time dimension data with three spatial GPS data (lat, lon, alt). GPS data are sampled by 1 minute timeframe for thousands of users and are being added to my server 24x7.
Sample REST/post json looks like,
{
"id": "1005",
"location": {
"lat":-87.8788,
"lon":37.909090,
"alt":0.0,
},
"datetime": 11882784
}
Now, I need to filter out all the candidates (userID) whose positions were within k meters distance from a given userID for a given time period.
Sample REST/get query params for filtering looks like,
{
"id": "1001", // user for whose we need to filter out candidates IDs
"maxDistance":3, // max distance in meter to consider (euclidian distance from users location to candidates location)
"maxDuration":14 // duration offset (in days) from current datetime to consider
}
As we see, thousands of entries are inserted in my database per minute which results huge number of total entries. Thus to iterate over for all the entries for filtering, I am afraid trivial naive approach won't be feasible for my current requirement. So, what algorithm should I implement in the server? I have tried to implement naive algorithm like,
params ($uid, $mDis, $mDay)
1. Init $candidates = []
2. For all the locations $Li of user with $uid
3. For all locations $Di in database within $mDay
4. $dif = EuclidianDis($Li, $Di)
5. If $dif < $mDis
6. $candidates += userId for $Di
7. Return $candidates
However, this approach is very slow in practice. And pre calculation might not be feasible as it costs huge space for all userIDs. What other algorithm can improve efficiency?

You could implement a spatial hashing algorithm to efficiently query your database for candidates within a given area/time.
Divide the 3D space into a 3D grid of cubes with width k, and when inserting a data-point into your database, compute which cube the point lies in and compute a hash value based on the cube co-ordinates.
When querying for all data-points within k of another datapoint d, compute the cube that d sits in, and find the 8 adjacent cubes (+/- 1 in each dimension). Compute the hash values of the 9 cubes and query your database for all entries with these hash values within the given time period. You will have small candidate set from which you can then iterate over to find all datapoints within k of d.
If your value of k can range between 2-5 meters, give your cubes a width of 5.
Timestamps can be stored as a separate field, or alternatively you can make your cubes 4-dimensional and include the timestamp in the hash, and search 27 cubes instead of 9.

Related

how do I load length frequency histogram data into mixtools?

I want to use mixtools to separate 1, 2 and 3+ year old cohorts in shellfish length frequency data. I am totally new to R coding. The example is old faithful geyser data but it is merely a list of 272 data points. I have various tables of lengths (size class midpoints) and frequencies. Generally about 15 length classes and counts in each between 0 and 50. I can create a data frame from my MSexcel table but not sure how to call it with normalmixEM() Thanks.

number ranges redistribution based on query results

I need to find a ways redistribute the number ranges when parallel exporting from MySQL:
Example output (SQL queries results):
what is the best way to redistribute the number ranges after getting the initial results, so the results will be more evenly distributed?
(estimated) desired output:
It seems that originally you believe your data is uniformly distributed, and now you have a view of the amount of entries at every evenly spaced bin. You can now update your belief about the distribution of your data: within every bin, the data is uniformly distributed, but bins with a higher result have a larger concentration.
Your updated distribution will tell you that you believe the number of results q to be equal to the sum of all the results in the buckets where the max bound is below q, plus
(q-min(q))/(max(q)-min(q))*size(q)+
where min(q) and max(q) give you the max and min bounds of the bucket that q belongs to, and size(q) is the amount of results in the bucket that q belongs to. This is a piecewise linear function where the slope at bucket i is its relative size to the total. Now divide by the total number of results to get a probability distribution. To get the places where you should query, find the ten values of x where this piecewise function is equal to .1,.2,.3.... 1.0 . This is a lot easier than inverting an arbitrary function if you exploit the piecewise linear property , for example if you are trying to find the x associated to .2, first find the bucket i such that min(.2)=lower_bnd_bucket(i)/total<=.2<=upper_bnd_bucket(i)=max(.2), This gives you min(.2), max(.2) and size(.2).
Then you just have to invert the linear function associated to that bucket
x=(.2-sum_of_buckets_lessthan_i)*(max(.2)-min(.2))/size(.2)+min(.2)
Note that size should not be 0 since your are dividing by it, which makes sense (if the bucket is 0, you should skip it). Finally, if you want the places you are querying to be integers, you can round the values using your preferred criteria. What we are doing here is updating with Bayes our belief of where the 10 deciles will be, based on observations
on the distribution in the 10 places you already queried. You can refine this further once you see the result of this query, and you will eventually reach convergence.
For the example on your table, to find the updated upper limit of bucket 1 , you check that
2569/264118<0.1 (first ten percent),
then you check that
that (2569+14023)/264118<0.1
and finally you check that ((2569+14023+123762)/264118)>0.1
so your new estimate for the decile should be in between 1014640 and 1141470.
Your new estimate for the upper theshold of the first bucket is
decile_1=(.1-(2569+14023)/264118)*(1141470-1014640)/(123762/264118)+1014640=1024703
similarly, your estimate for the upper bound for the second bucket is:
(.2-(2569+14023)/264118)*(1141470-1014640)/(123762/264118)+1014640=1051770. Note that this linear interpolation will work until the update for the upper limit of bucket 6, since ((2569+14023+123762)/264118)<.6 and you will now need to use the limits for the old bucket ten when updating the buckets 6 and higher.

Cumulative Frequency Tables and Chart Output

I'm working with some rather large time-series data sets related to futures prices and am in the process of converting some calculations which I previously did in Excel to R. This conversion has been relatively straightforward thus far but I m having a bit of trouble replicating my histograms with their cumulative frequency distributions in R as I had them in Excel. If you're familiar with Excel, the Histogram function in the Data Analysis Toolpack automatically creates a Cumulative Frequency Distribution table with the cumulative percentages of each, in this case, Price Level, next to the histogram.
I've had some success creating some basic histograms using ggplot, here is a snippet of that code:
ggplot(data=CrudeRaw, aes(x=CrudeRaw$X7_1_F))+
geom_histogram(breaks=seq(X7_F_M_L, X7_F_M_H, by=0.01),
col="blue",
fill="white",
alpha= 0.2)+
labs(title="X7 1 Month Price Distribution", x="Price Levels",
y="Frequency") +
xlim(c(X7_F_M_L, X7_F_M_H)) +
ylim(c(0,100))
Several questions regarding formatting and usage.
a) CrudeRaw is a dataframe which contains roughly 276 rows, and no less then 50 columns. For the purposes of this project I've chopped the data into 20 period, 60 period, 120 period, 180 period, and 240 period subsets. The data is in chronological order by date.
Question(s): ggplot cannot take numeric data types, only data frames, so I can only feed it the entire df even though I am interested in creating distributions for the aforementioned subsets. Is there a way that I can still do this?
b) How do I get every bin (price) to show up on the x-axis rather than a number marking every 5 bins (-15, -10, -5, 0, 5 ..., 15)?
c) I've successfully created a cumulative frequency table using the follow code,
round(cbind(cumsum(table(X7_F)))/NROW(X7_F),2)
But I'd like a way to a) output each of these tables (of which there are many) to a CSV file OR, ideally create a "report" of sorts with R which can be saved to a pdf, or perhaps even within the histogram which the table/data is associated with.
d) I've done some searching on how to output data to a CSV file, but it wasnt clear from the examples I went over how I could output multiple arrays to the same sheet or workbook, en masse. That is, I would like to output my 20, 60, 120, 180, and 240 period arrays of prices to the same workbook. I'm thinking that by creating another dataframe that I could then pass these subsets of the data to the ggplot function like I mentioned I was having trouble doing in part a)
e) Lastly (for now) how do I overlay the CFD onto my histograms?
Please advise if you require any additional information or colour in order to help me and many thanks in advance for your responses!

Will an MD5 hash keep changing as its input grows?

Does the value returned by MySQL's MD5 hash function continue to change indefinitely as the string given to it grows indefinitely?
E.g., will these continue to return different values:
MD5("A"+"B"+"C")
MD5("A"+"B"+"C"+"D")
MD5("A"+"B"+"C"+"D"+"E")
MD5("A"+"B"+"C"+"D"+"E"+"D")
... and so on until a very long list of values ....
At some point, when we are giving the function very long input strings, will the results stop changing, as if the input were being truncated?
I'm asking because I want to use the MD5 function to compare two records with a large set of fields by storing the MD5 hash of these fields.
======== MADE-UP EXAMPLE (YOU DON'T NEED THIS TO ANSWER THE QUESTION BUT IT MIGHT INTEREST YOU: ========
I have a database application that periodically grabs data from an external source and uses it to update a MySQL table.
Let's imagine that in month #1, I do my first download:
downloaded data, where the first field is an ID, a key:
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
I store this
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
Month #2, I get
1,"A","B","C"
2,"A","D","X"
3,"B","D","E"
4,"B","F","E"
Notice that the record with ID 2 has changed. Record with ID 4 is new. So I store two new records:
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
2,"A","D","X"
4,"B","F","E"
This way I have a history of *changes* to the data.
I don't want have to compare each field of the incoming data with each field of each of the stored records.
E.g., if I'm comparing incoming record x with exiting record a, I don't want to have to say:
Add record x to the stored data if there is no record a such that x.ID == a.ID AND x.F1 == a.F1 AND x.F2 == a.F2 AND x.F3 == a.F3 [4 comparisons]
What I want to do is to compute an MD5 hash and store it:
1,"A","B","C",MD5("A"+"B"+"C")
Let's suppose that it is month #3, and I get a record:
1,"A","G","C"
What I want to do is compute the MD5 hash of the new fields: MD5("A"+"G"+"C") and compare the resulting hash with the hashes in the stored data.
If it doesn't match, then I add it as a new record.
I.e., Add record x to the stored data if there is no record a such that x.ID == a.ID AND MD5(x.F1 + x.F2 + x.F3) == a.stored_MD5_value [2 comparisons]
My question is "Can I compare the MD5 hash of, say, 50 fields without increasing the likelihood of clashes?"
Yes, practically, it should keep changing. Due to the pigeonhole principle, if you continue doing that enough, you should eventually get a collision, but it's impractical that you'll reach that point.
The security of the MD5 hash function is severely compromised. A collision attack exists that can find collisions within seconds on a computer with a 2.6Ghz Pentium4 processor (complexity of 224).
Further, there is also a chosen-prefix collision attack that can produce a collision for two chosen arbitrarily different inputs within hours, using off-the-shelf computing hardware (complexity 239).
The ability to find collisions has been greatly aided by the use of off-the-shelf GPUs. On an NVIDIA GeForce 8400GS graphics processor, 16-18 million hashes per second can be computed. An NVIDIA GeForce 8800 Ultra can calculate more than 200 million hashes per second.
These hash and collision attacks have been demonstrated in the public in various situations, including colliding document files and digital certificates.
See http://www.win.tue.nl/hashclash/On%20Collisions%20for%20MD5%20-%20M.M.J.%20Stevens.pdf
A number of projects have published MD5 rainbow tables online, that can be used to reverse many MD5 hashes into strings that collide with the original input, usually for the purposes of password cracking.

Using google maps API to find average speed at a location

I am trying to get the current traffic conditions at a particular location. The GTrafficOverlay object mentioned here only provides an overlay on an existing map.
Does anyone know how I can get this data from Google using their API?
It is only theorical, but there is perhaps a way to extract those data using the distancematrix api.
Method
1)
Make a topological road network, with node and edge, something like this:
Each edge will have four attributes: [EDGE_NUMBER;EDGE_SPEED;EDGE_TIME,EDGE_LENGTH]
You can use the openstreetmap data to create this network.
At the begining each edge will have the same road speed, for example 50km/h.
You need to use only the drivelink and delete the other edges. Take also into account that some roads are oneway.
2)
Randomly chose two nodes that are not closer than 5 or 10km
Use the dijsktra shortest path algorithm to calculate the shortest path between this two nodes (the cost = EDGE_TIME). Use your topological network to do that. The output will look like:
NODE = [NODE_23,NODE_44] PATH = [EDGE_3,EDGE_130,EDGE_49,EDGE_39]
Calculate the time needed to drive between the two nodes with the distance matrix api.
Preallocate a matrix A of size N X number_of_edge filled with zero value
Preallocate a matrix B of size 1 X number_of_edge filled with zero value
In the first row of matrix A fill each column (corresponding to each edge) with the length of the edge if the corresponding edge is in the path.
[col_1,col_2,col_3,...,col_39,...,col_49,...,col_130]
[0, 0, len_3,...,len_39,...,len_49,...,len_130] %row 1
In the first row of matrix B put the time calculated with the distance matrix api.
Then select two news node that were not used in the first path and repeat the operation until that there is no node left. (so you will fill the row 2, the row 3...)
Now you can solve the linear equation system: Ax = B where speed = 1/x
Assign the new calculated speed to each edge.
3)
Iterate the point 2) until your calculated speed start to converge.
Comment
I'm not sure that the calculated speed will converge, it will be interesting to test the method.I will try to do that if I got some time.
The distance matrix api don't provide a traveling time more precise than 1 minute, that's why the distance between the pair of node need to be at least 5 or 10 or more km.
Also this method fails to respect the Google's terms of service.
Google does not make available public API for this data.
Yahoo has a feed (example) with traffic conditions -- construction, accidents, and such. A write-up on how to access it is here.
If you want actual road speeds, you will probably need to work with a commercial provider.