Finding Closest coordinates in sql database - mysql

I'm looking for a quick way (With a relative short query length) to find the rows with the closest x, y, coordinates in a database, and return the twenty closest. Closest as in the x-y coordinates are closest to the results. The database looks like this.
-----------------
|Stuff | x | y |
------------------
|bob |-21| 32 |
|Joe |23 | 29 |
------------------
So, a search would be like x=19, y=32, and I would like twenty closest results, sorted by closeness. X-Y is a completely square grid.
Any help would be appreciated. Thanks!
If this helps, I'm using MariaDB Version 10.3.27 on raspbian/debian

One way of measuring closest is to use Manhattan difference, the sum of the absolute value of the differences:
select t.*
from t
order by abs(x - #x) + abs(y - #y)
limit 20;
If you have a different distance metric, you would just plug that formula in.

Related

MYSQL Random Entry with weight fails

I'm trying to display weighted random results from my database and I'm unable to get results with expected accuracy. I've followed what I learnt here and here.
This would be my table:
+--------+-----------+
| weight | image |
+--------+-----------+
| 50 | A |
| 25 | B |
| 25 | C |
+--------+-----------+
I need the image A to appear 50% of the times, the image B the other 25% of the times and C the remaining 25% of the times.
The SQL estatement I'm using goes like this:
SELECT image FROM images WHERE weight > 0 ORDER BY -LOG(1.0 - RAND()) / weight LIMIT 10
So in order to test this properly I made a php script to have this iterate 10,000 times, counting how many times a, b or c was being shown and I display the results on my test script with percentages, like this:
a total: 4976 - 49,76%
b total: 2538 - 25,38%
c total: 2486 - 24,86%
With only 10,000 results and considering the RAND() is just a randomization function I would consider this results to be accurate enough. The problem is that I run this script about 100 times and I realized that 98 out of 100 times b had a higher percentage count than c.
I'm trying to understand what's wrong, both values (b and c) on the table are the same and I'm not introducing any other ordering factor. I took it up a notch and I went for 100,000 iterations of the SQL clause. These are the results:
a total: 50185 - 50,185%
b total: 25201 - 25,201%
c total: 24614 - 24,614%
I run this last test about 50 times (with long wait times between each). This time b was above c every time and accuracy was worse than the accuracy at 10000 iterations. You would expect that as you go higher on the number of iterations, the percentage variation should be getting smaller and the results more accurate. It's obvious that either I'm doing something wrong or RAND() is not really random enough.
Matematically speaking if it was perfectly random it should be improving accuracy the more iterations you make and not the opposite.
Any explanation/solution is welcome.

Fetch all points from database inside a bounding box of north east and south west coordinates in MySQL

I am currently building an application to show all geo-tagged trees in a particular city.
The main columns I am using to fetch data from the table are as follows,
+-----------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+-------+
| tree_geotags_id | int(11) | NO | PRI | None | |
| lattitude_dl | double(9,7) | YES | | NULL | |
| longitude_dl | double(9,7) | YES | | NULL | |
+-----------------+-------------+------+-----+---------+-------+
The table has over 158000 rows.
Currently I am using the following query to get my output,
SELECT gt.tree_geotags_id, gt.latitude_dl, gt.longitude_dl,
SQRT(
POW( 69.1 * ( gt.latitude_dl - [ center_lat ] ), 2) +
POW( 69.1 * ( [ center_lon ] - gt.longitude_dl ) * COS( gt.latitude_dl / 57.3 ), 2 )
) AS distance
FROM tree_geotags_t gt
HAVING distance < 0.4 ORDER BY distance
What this does is, it fetches all records at a radius of 0.4.
I use an ajax call to fetch the data, every time the center coordinate of the map changes( on map pans or zooms ), convert the fetched data into geojson format and then load it on the map as a layer. The issue I am having with this is, in locations where there is a very high density of trees, it takes a long time for the map to place all the points and since it fetches it on a radius it loads points that are even outside the viewport.
I need a query to only load data in coordinates inside the viewport, using the northeast and southwest coordinates as boundaries. I searched for quite a while here but couldn't find anything suited to my requirements. Please help me out. Thanks in advance..!!
If anyone's still looking, I got my answer from this post.
Get all records from MySQL database that are within Google Maps .getBounds?
Thanks for the help anyway.
You are very close. Your (otherwise grossly incorrect) distance formula contains the seed of your bounding box check.
Try this
SET #distance_unit := 69.0; /* for miles, use 111.045 for km. */
SET #radius := 1.0; /* for radius */
SET #center_lat := target_latitude_in_degrees;
SET #center_lon := target_longitude_in_degrees;
SELECT gt.tree_geotags_id, gt.latitude_dl, gt.longitude_dl
FROM tree_geotags_t gt
WHERE gt.latitude_dl
BETWEEN #center_lat - (#radius/#distance_unit) /*east boundary*/
AND #center_lat + (#radius/#distance_unit) /*west*/
AND gt.longitude_dl
BETWEEN #center_lon - (#radius / (#distance_unit * COS(RADIANS(#center_lat)))) /*south*/
AND #center_lon + (#radius / (#distance_unit * COS(RADIANS(#center_lat)))) /*north*/
Suppose you know the east, west, north, and south boundaries of your bounding box instead of its center. That's an easy adaptation of the above code.
SELECT gt.tree_geotags_id, gt.latitude_dl, gt.longitude_dl
FROM tree_geotags_t gt
WHERE gt.latitude_dl BETWEEN #east AND #west
AND gt.longitude_dl BETWEEN #south AND #north
The question of how to derive the sides of your bounding box from its corners is trivial as long as the bounding box coordinates are in degrees. If they're given in some projection units (like transverse UTM coordinates) there's no way the answer will fit in a Stack Overflow post.
This query can be made fast by a compound index on (latitude_dl, longitude_dl, tree_geotags_id) The latitude search will use an index range scan, and the longitude and id can then be retrieved directly from the index.
What's wrong with your distance formula? It's cartesian, but you need the spherical cosine law formula because you're dealing with spherical coordinates.
This will not work close to either the north or south pole (because cos(latitude) tends towards zero there) but that's OK; you're dealing with trees, and they don't grow there, yet.
Here's a comprehensive writeup on the topic. http://www.plumislandmedia.net/mysql/haversine-mysql-nearest-loc/

Ranking algorithm using likes / dislikes and average views per day

I'm currently ranking videos on a website using a bayesian ranking algorithm, each video has:
likes
dislikes
views
upload_date
Anyone can like or dislike a video, a video is always views + 1 when viewed and all videos have a unique upload_date.
Data Structure
The data is in the following format:
| id | title | likes | dislikes | views | upload_date |
|------|-----------|---------|------------|---------|---------------|
| 1 | Funny Cat | 9 | 2 | 18 | 2014-04-01 |
| 2 | Silly Dog | 9 | 2 | 500 | 2014-04-06 |
| 3 | Epic Fail | 100 | 0 | 200 | 2014-04-07 |
| 4 | Duck Song | 0 | 10000 | 10000 | 2014-04-08 |
| 5 | Trololool | 25 | 30 | 5000 | 2014-04-09 |
Current Weighted Ranking
The following weighted ratio algorithm is used to rank and sort the videos so that the best rated are shown first.
This algorithm takes into account the bayesian average to give a better overall ranking.
Weighted Rating (WR) = ((AV * AR) + (V * R))) / (AV + V)
AV = Average number of total votes
AR = Average rating
V = This items number of combined (likes + dislikes)
R = This items current rating (likes - dislikes)
Example current MySQL Query
SELECT id, title, (((avg_vote * avg_rating) + ((likes + dislikes) * (likes / dislikes)) ) / (avg_vote + (likes + dislikes))) AS score
FROM video
INNER JOIN (SELECT ((SUM(likes) + SUM(dislikes)) / COUNT(id)) AS avg_vote FROM video) AS t1
INNER JOIN (SELECT ((SUM(likes) - SUM(dislikes)) / COUNT(id)) AS avg_rating FROM video) AS t2
ORDER BY score DESC
LIMIT 10
Note: views and upload_date are not factored in.
The Issue
The ranking currently works well but it seems we are not making full use of all the data at our disposal.
Having likes, dislikes, views and an upload_date but only using two seems a waste because the views and upload_date are not factored in to account how much weight each like / dislike should have.
For example in the Data Structure table above, items 1 and 2 both have the same amount of likes / dislikes however item 2 was uploaded more recently so it's average daily views are higher.
Since item 2 has more likes and dislikes in a shorter time than those likes / dislikes should surely be weighted stronger?
New Algorithm Result
Ideally the new algorithm with views and upload_date factored in would sort the data into the following result:
Note: avg_views would equal (views / days_since_upload)
| id | title | likes | dislikes | views | upload_date | avg_views |
|------|-----------|---------|------------|---------|---------------|-------------|
| 3 | Epic Fail | 100 | 0 | 200 | 2014-04-07 | 67 |
| 2 | Silly Dog | 9 | 2 | 500 | 2014-04-06 | 125 |
| 1 | Funny Cat | 9 | 2 | 18 | 2014-04-01 | 2 |
| 5 | Trololool | 25 | 30 | 5000 | 2014-04-09 | 5000 |
| 4 | Duck Song | 0 | 10000 | 10000 | 2014-04-08 | 5000 |
The above is a simple representation, with more data it gets a lot more complex.
The question
So to summarise, my question is how can I factor views and upload_date into my current ranking algorithm in a style to improve the way that videos are ranked?
I think the above example by calculating the avg_views is a good way to go but where should I then add that into the ranking algorithm that I have?
It's possible that better ranking algorithms may exist, if this is the case then please provide an example of a different algorithm that I could use and state the benefits of using it.
Taking a straight percentage of views doesn't give an accurate representation of the item's popularity, either. Although 9 likes out of 18 is "stronger" than 9 likes out of 500, the fact that one video got 500 views and the other got only 18 is a much stronger indication of the video's popularity.
A video that gets a lot of views usually means that it's very popular across a wide range of viewers. That it only gets a small percentage of likes or dislikes is usually a secondary consideration. A video that gets a small number of views and a large number of likes is usually an indication of a video that's very narrowly targeted.
If you want to incorporate views in the equation, I would suggest multiplying the Bayesian average you get from the likes and dislikes by the logarithm of the number of views. That should sort things out pretty well.
Unless you want to go with multi-factor ranking, where likes, dislikes, and views are each counted separately and given individual weights. The math is more involved and it takes some tweaking, but it tends to give better results. Consider, for example, that people will often "like" a video that they find mildly amusing, but they'll only "dislike" if they find it objectionable. A dislike is a much stronger indication than a like.
I can point you to a non-parametric way to get the best ordering with respect to a weighted linear scoring system without knowing exactly what weights you want to use (just constraints on the weights). First though, note that average daily views might be misleading because movies are probably downloaded less in later years. So the first thing I would do is fit a polynomial model (degree 10 should be good enough) that predicts total number of views as a function of how many days the movie has been available. Then, once you have your fit, then for each date you get predicted total number of views, which is what you divide by to get "relative average number of views" which is a multiplier indicator which tells you how many times more likely (or less likely) the movie is to be watched compared to what you expect on average given the data. So 2 would mean the movie is watched twice as much, and 1/2 would mean the movie is watched half as much. If you want 2 and 1/2 to be "negatives" of each other which sort of makes sense from a scoring perspective, then take the log of the multiplier to get the score.
Now, there are several quantities you can compute to include in an overall score, like the (log) "relative average number of views" I mentioned above, and (likes/total views) and (dislikes / total views). US News and World Report ranks universities each year, and they just use a weighted sum of 7 different category scores to get an overall score for each university that they rank by. So using a weighted linear combination of category scores is definitely not a bad way to go. (Noting that you may want to do something like a log transform on some categories before taking the linear combination of scores). The problem is you might not know exactly what weights to use to give the "most desirable" ranking. The first thing to note is that if you want the weights on the same scale, then you should normalize each category score so that it has standard deviation equal to 1 across all movies. Then, e.g., if you use equal weights, then each category is truly weighted equally. So then the question is what kinds of weights you want to use. Clearly the weights for relative number of views and proportion of likes should be positive, and the weight for proportion of dislikes should be negative, so multiply the dislike score by -1 and then you can assume all weights are positive. If you believe each category should contribute at least 20%, then you get that each weight is at least 0.2 times the sum of weights. If you believe that dislikes are more important that likes, then you can say (dislike weight) >= c*(like weight) for some c > 1, or (dislike_weight) >= c*(sum of weights) + (like weight) for some c > 0. Similarly you can define other linear constraints on the weights that reflect your beliefs about what the weights should be, without picking exact values for the weights.
Now here comes the fun part, which is the main thrust of my post. If you have linear inequality constraints on the weights, all of the form that a linear combination of the weights is greater than or equal to 0, but you don't know what weights to use, then you can simply compute all possible top-10 or top-20 rankings of movies that you can get for any choice of weights that satisfy your constraints, and then choose the top-k ordering which is supported by the largest VOLUME of weights, where the volume of weights is the solid angle of the polyhedral cone of weights which results in the particular top-k ordering. Then, once you've chosen the "most supported" top-k ranking, you can restrict the scoring parameters to be in the cone that gives you that ranking, and remove the top k movies, and compute all possibilities for the next top-10 or top-20 ranking of the remaining movies when the weights are restricted to respect the original top-k movies' ranking. Computing all obtainale top-k rankings of movies for restricted weights can be done much, much faster than enumerating all n(n-1)...(n-k+1) top-k possible rankings and trying them all out. If you have two or three categories then using polytope construction methods the obtainable top-k rankings can be computed in linear time in terms of the output size, i.e. the number of obtainable top-k rankings. The polyhedral computation approach also gives the inequalities that define the cone of scoring weights that give each top-k ranking, also in linear time if you have two or three categories. Then to get the volume of weights that give each ranking, you triangulate the cone and intersect with the unit sphere and compute the areas of the spherical triangles that you get. (Again linear complexity if the number of categories is 2 or 3). Furthermore, if you scale your categories to be in a range like [0,50] and round to the nearest integer, then you can prove that the number of obtainable top-k rankings is actually quite small if the number of categories is like 5 or less. (Even if you have a lot of movies and k is high). And when you fix the ordering for the current top group of movies and restrict the parameters to be in the cone that yields the fixed top ordering, this will further restrict the output size for the obtainable next best top-k movies. The output size does depend (polynomially) on k which is why I recommended setting k=10 or 20 and computing top-k movies and choosing the best (largest volume) ordering and fixing it, and then computing the next best top-k movies that respect the ordering of the original top-k etc.
Anyway if this approach sounds appealing to you (iteratively finding successive choices of top-k rankings that are supported by the largest volume of weights that satisfy your weight constraints), let me know and I can produce and post a write-up on the polyhedral computations needed as well as a link to software that will allow you to do it with minimal extra coding on your part. In the meantime here is a paper http://arxiv.org/abs/0805.1026 I wrote on a similar study of 7-category university ranking data where the weights were simply restricted to all be non-negative (generalizing to arbitrary linear constraints on weights is straightforward).
A simple approach would be to come up with a suitable scale factor for each average - and then sum the "weights". The difficult part would be tweaking the scale factors to produce the desired ordering.
From your example data, a starting point might be something like:
Weighted Rating = (AV * (1 / 50)) + (AL * 3) - (AD * 6)
Key & Explanation
AV = Average views per day:
5000 is high so divide by 50 to bring the weight down to 100 in this case.
AL = Average likes per day:
100 in 3 days = 33.33 is high so multiply by 3 to bring the weight up to 100 in this case.
AD = Average dislikes per day:
10,000 seems an extreme value here - would agree with Jim Mischel's point that dislikes may be more significant than likes so am initially going with a negative scale factor of twice the size of the "likes" scale factor.
This gives the following results (see SQL Fiddle Demo):
ID TITLE SCORE
-----------------------------
3 Epic Fail 60.8
2 Silly Dog 4.166866
1 Funny Cat 1.396528
5 Trololool -1.666766
4 Duck Song -14950
[Am deliberately keeping this simple to present the idea of a starting point - but with real data you might find linear scaling isn't sufficient - in which case you could consider bandings or logarithmic scaling.]
Every video have:
likes
dislikes
views
upload_date
So we can deduct the following parameters from them:
like_rate = likes/views
dislike_rate = likes/views
view_rate = views/number_of_website_users
video_age = count_days(upload_date, today)
avg_views = views/upload_age
avg_likes = likes/upload_age
avg_dislikes = dislikes/upload_age
Before we can set the formula to be used, we need to specify how different videos popularity should work like, one way is to explain in points the property of a popular video:
A popular video is a recent one in most cases
The older a video gets, the higher avg_views it requires to become popular
A video with a like_rate over like_rate_threshold or a dislike_rate over dislike_rate_threshold, can compete by the difference from its threshold with how old it gets
A high view_rate of a video is a good indicator to suggest that video to a user who have not watched it before
If avg_likes or avg_dislikes make most of avg_views, the video is considered active in the meantime, in case of active videos we don't really need to check how old it's
Conclusion: I don't have a formula, but one can be constructed by converting one unit into another's axis, like cutting a video age by days based on a calculation made using avg_likes, avg_dislikes, and avg_views
Since no one has pointed it out yet (and I'm a bit surprised), I'll do it. The problem with any ranking algorithm we might come up with is that it's based on our point of view. What you're certainly looking for is an algorithm that accomodates the median user point of view.
This is no new idea. Netflix had it some time ago, only they personalized it, basing theirs on individual selections. We are looking - as I said - for the median user best ranking.
So how to achieve it? As others have suggested, you are looking for a function R(L,D,V,U) that returns a real number for the sort key. R() is likely to be quite non-linear.
This is a classical machine learning problem. The "training data" consists of user selections. When a user selects a movie, it's a statement about the goodness of the ranking: selecting a high-ranked one is a vote of confidence. A low-ranked selection is a rebuke. Function R() should revise itself accordingly. Initially, the current ranking system can be used to train the system to mirror its selections. From there it will adapt to user feedback.
There are several schemes and a huge research literature on machine learning for problems like this: regression modeling, neural networks, representation learning, etc. See for example the Wikipedia page for some pointers.
I could suggest some schemes, but won't unless there is interest in this approach. Say "yes" in comments if this is true.
Implementation will be non-trivial - certainly more than just tweaking your SELECT statement. But on the plus side you'll be able to claim your customers are getting what they're asking for in very good conscience!

Order results by proximity (with coordinates & radius)

Given a database of 4 circles, where each circle has a radius and a geolocated centre:
id | radius | latitude | longitude
---+--------+----------+----------
1 | 3 | 40.71 | 100.23
2 | 10 | 50.13 | 100.23
3 | 12 | 39.92 | 100.23
4 | 4 | 80.99 | 100.23
Note: the longitude is the same for each circle, in order to keep things simple.
Assuming that we are on the circle 2, I would like to find every circle nearby, according to the latitude/longitude coordinates and the radius of each circle.
For example, according to the latitude/longitude coordinates, we have this order:
circle 1 (because of proximity: 9.42 <- 50.13 - 40.71)
circle 3 (because of proximity: 10.21 <- 50.13 - 39.92)
circle 4 (because of proximity: 30.86 <- 80.99 - 50.13)
But according to the latitude/longitude coordinates and the radius of each circle, we should have:
circle 3 (because of proximity: 1.79 <- 12 - 10.21)
circle 1 (because of proximity: 6.42 <- 9.42 - 3)
circle 4 (because of proximity: 26.86 <- 30.86 - 4)
Is there a simple way to do so in SQL?
The cube and earthdistance extensions provided in postgresql's contrib can handle doing this, to produce at least approximate answers. Specifically, they assume the Earth is a simple sphere, which makes the math a lot easier.
With those extensions you can produce the distance between circle 2 and the others like this:
select circle.id,
earth_distance(ll_to_earth(circle.latitude, circle.longitude),
ll_to_earth(x.latitude, x.longitude))
from circle,
circle x
where x.id = 2 and circle.id <> x.id
order by 2;
Correcting for the circle radius should just involve subtracting x.radius and circle.radius from the distance above, although you need to think about what units the radius is in. By default, earth_distance will calculate a value in metres.
Now, making the query do something other than scan the entire list of circles and calculate the distance for each one, then sort and limit them, that's much more challenging. There are a couple of approaches:
using cube's ability to be indexed with gist, so you can create indices to search within certain boxes around any circle's centre, and hence cut down the list of circles to consider.
precalculate the distance between each circle and all the others any time a circle is edited, using triggers to maintain this calculation in a separate table.
The second options basically starts with:
create table circle_distance as
select a.id as a_id, b.id as b_id,
earth_distance(ll_to_earth(a.latitude, a.longitude),
ll_to_earth(b.latitude, b.longitude))
from circle a, circle b
where a.id <> b.id;
alter table circle_distance add unique(a_id, b_id);
create index on circle_distance(a_id, earth_distance);
Then some rather tedious functions to delete/insert relevant rows in circle_distance, called by triggers on circle. This means you can do:
select b_id from earth_distance where a_id = $circle_id order by earth_distance limit $n
This query will be able to use that index on (a_id,earth_distance) to do a quick scan.
I'd suggest looking at the PostGIS Geography data types and its associated functions (eg: ST_Distance)rather than reinventing the wheel
In neo4j, you can look at Neo4j Spatial, tests for the different operations at https://github.com/neo4j/spatial/blob/master/src/test/java/org/neo4j/gis/spatial/pipes/GeoPipesTest.java, amongst them proximity search, too, e.g. https://github.com/neo4j/spatial/blob/master/src/test/java/org/neo4j/gis/spatial/pipes/GeoPipesTest.java#L150
I would souggest you the following:
Create 1 table for calculation of relative distances in relation to the start circle
for instance:
id | calc1 | calc2
---+--------+----------
1 | 9.42 | 1.97
3 | 10.21 | 6.42
4 | 30.86 | 62.86
Calc1 being the calculation without the radius
calc2 being the calculation with radius
then create a store procedure that will first when it is run delete the table and then fill it with the correct data and then just read the result from the destination table
Intrudoction to store procedures
You will allso need cursor for this

Mysql Algorithm for great circle distance calculation

I want to calculate distance between two zip codes before inserting the data to database .basically i have a these tables
zip code table
| zipcode | lat | long |
01230 60.1756 23.12
01240 60.1756 25.25
customer table
| name | zip code |
foo 01230
sales man table
| name | zip code | workingdistanceinkm
foo 01240 200
foo1 01230 100
What I want to do is calculate the distance between the sales mans and the customer if it is any of the salesman working area before the data of the customer is inserted to customer table .
MY approach was to calculate the distance between a customer and every sales man that is in the sales man table. But this makes a lot of queries for example if I have 1000 sales man it means I am calculating the distance between the new customer data to be inserted with those every one of the 1000 salesmen.
I am wondering if it's possible to write one query to do the same task.
Have a look at
www.zipcodeworld.com/samples/distance.php.html
note that distance calculations between zipcodes are not always the actual representations of the distance. This is just the distance of an imaginary straight line between the two points. But in reality it is longer
Below URL helped me a lot. Please check "Finding Locations with MySQL" section. Thanks.
https://developers.google.com/maps/articles/phpsqlsearch_v3