MySQL - finding difference between any two integers in same column - mysql

I have data concerning prices of goods at various latitudes and longitudes. I am look to find profit opportunities by comparing differences in price with the distance needed to travel to obtain the better prices.
The pseudocoded formula looks like this currently:
select (100*diff(avg(price))-50*diff(lon)-70*diff(lat)) as profit
As such, I wish to find the difference between any two latitude values in the data set. I have seen responses that explain how to find the differences between consecutive values or differences in dates, but nothing that seems to address my particular question.
Edit with my current query (this should simply provide me with the two most distant cities latitude wise in descending order):
SELECT lat AS lat2, (lat2 - lat) AS latdistance, city
FROM buying INNER JOIN buying ON (lat2 = lat)
Order by latdistanceasc
The output should list both cities involved as each one contributes a latitude although I am unsure of how to make them both display in the output.
Imagine 3 data points with (Price, Latitude, Longitude):
A (10, 30, 50)
B (15, 50, 60)
C (5, 20, 30)
What is the latitudinal distance between any two points (to keep things simple)?
The output should be:
AB - 20
AC - 10
BC - 30

This query works for your conditions:
SELECT b1.city start,
b2.city finish,
abs(b1.latitude - b2.latitude) latdistance
FROM buying b1, buying b2
WHERE b1.city < b2.city
ORDER BY 3
You should be warned however, that it is expensive and it will grow number of rows as O(n^2).
To use better metric for distance, use this:
SELECT b1.city start,
b2.city finish,
sqrt(pow(b1.latitude - b2.latitude, 2)
+ pow(b1.longitude - b2.longitude, 2)) latdistance
FROM buying b1, buying b2
WHERE b1.city < b2.city
ORDER BY 3
Both queries at SQLFiddle.

To compare the current location to all other cities and showing the profit:
SELECT
here.city AS Here,
there.city AS There,
here.price - there.price AS Profit,
ABS(here.lat - there.lat) AS Distance
FROM buying AS here
-- Join to the table itself, but not to the same city
LEFT JOIN buying AS there ON (here.city <> there.city)
-- Remove this to compare all cities to every other city
WHERE here.city = 'Springfield'

Related

How do I Understand correlated queries?

I just started an sql exercise-style tutorial BUT I still haven't grasped the concept of correlated queries.
name, area and continent are fields on a table.
The query is to Find the largest country (by area) in each continent, show the continent, the name and the area.
The draft work so far:
SELECT continent, name, population FROM world x
WHERE area >= ALL
(SELECT area FROM world y
WHERE y.continent=x.continent
AND population>0)
Tried reading up on it on a few other blogs.
need to understand the logic behind correlated queries.
I assume the query you posted work. You just need clarification of what it does.
SELECT continent, name, population
FROM world x
WHERE area >= ALL (
SELECT area FROM world y
WHERE y.continent=x.continent
AND population>0
)
The query translates to
"Get the continent, name, and population of a country where area is bigger than or equal to all other countries in the same continent".
The WHERE clause in the inner query is to link the 2 queries (in this case countries in the same continent). Without the WHERE, it will get the country with the largest are in the world.
You can think of a correlated subquery as a looping mechanism. This is not necessarily how it is implemented, but it describes what it does.
Consider data such as:
row continent area population
1 a 100 19
2 a 200 10
3 a 300 20
4 b 15 2000
The outer query loops through each row. Then it looks at all matching rows. So, it takes record 1:
row continent area population
1 a 100 19
It then runs the subquery:
(SELECT w2.area
FROM world w2
WHERE w2.continent = w.continent AND
w2.population > 0
)
And substitutes in the values from the outer table:
(SELECT w2.area
FROM world w2
WHERE w2.continent = 'a' AND
w2.population > 0
)
This returns the set (100, 200, 300).
Then it applies the condition:
where w1.area >= all (100, 200, 300)
(This isn't really valid SQL but it conveys the idea.)
Well, we know that w1.area = 100, so this condition is false.
The process is then repeated for each of the rows. For the "a" continent, the only row that meets the condition is the third one -- the one with the largest area.

Normalizing price data from 1 to 100

I have a table with just 21.5 million rows, representing properties sold across the UK from 1995. For each entry I've calculated a new price based on inflation of that year and now want to normalize this inflated price to assign a value between 1 to 100.
The average price in the table is 240000. The data is skewed in a way that 3/4 of the data is below the average. Max is 150 million, min is 1000
Normalizing the data using the SQL query below results in 20 million properties assigned the normalized price of 1.
UPDATE properties p
SET inflatedNorm = round(
1 + (
(p.inflatedPrice - MIN_PRICE) * (100 - 1) / (MAX_PRICE- MIN_PRICE)
)
);
What have I done wrong ? Surely 20 million 1s is wrong and there should be a more varied spread of values with most of them being around the average price.
Don't round the result! Let the database store decimal points. So:
UPDATE properties p
SET inflatedNorm = 1 + (p.inflatedPrice - MIN_PRICE) * (100.0 - 1) / (MAX_PRICE - MIN_PRICE);
The other issue is what the prices look like. I would start with:
select max(price), min(price)
from properties p;
If the maximum is 100 times the minimum, then you'll see the phenomenon you are seeing. The range is the only thing important for your calculation, not the actual distribution within the range.
That is, if you considers the net worth of Americans and include Bill Gates in your data, then 99+% of Americans will have a net worth less than 1% of Bill Gates.

Max latitude for x distance from longitude - Max longitude for x distance from latitude - SQL

Right now I have a table of 100 million inserts:
CREATE TABLE o (
id int UNIQUE,
latitude FLOAT(10, 8),
longitude FLOAT(11, 8)
);
On my back end I am receiving a user lat/long and trying to return everything within x distance of that.
Instead of doing the distance formula on every single result I was thinking I could possibly calculate the maximum lat/long for X distance.
So we are sort of creating a square by finding the max lat/min lat, max long/min long.
Once we have these max values we would do the query on this range of values thus making our subset significantly smaller to then do the actual distance formula on (i.e., finding the values within X distance).
So my question to you is:
What makes me run faster?
Option 1)
Distance formula on 100 million entries to get the set.
Option 2)
Instead of doing the distance formula on the set of 100 million entries we calculate the min/max lat/long.
Select the values in that range from the table of 100 million entries
Do the distance formula on our new smaller set.
Option 3)
Something exists already for this in SQL
If option 2 is faster the next issue is actually solving that math problem.
If you want to look at that continue reading:
Lat/Long distance formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = (sin(dlat/2))^2 + cos(lat1) * cos(lat2) * (sin(dlon/2))^2
c = 2 * atan2(sqrt(a), sqrt(1-a))
d = R * c
Obviously we can rearrange this because D (assume 1 mile), and R (is the radius of the earth) is a set value so we get D/R = C.
The problem then comes in to how do we calculate C/2 = atan2(sqrt(a), sqrt(1-a))?
1 -- 100M rows is a lot to scan and test. It's OK do do once in a while, but it is too slow to do a lot.
2 -- Using a pseudo-square bounding box and doing
WHERE latitude BETWEEN ...
AND longitude BETWEEN ...
is a good first step. The latitude range is a simple constant times X; the longitude range also divides by cos(latitude).
But the problem comes when you try to find just those rows in the square. Any combination of index on latitude and/or longitude, either separately or together, will only partially filter. That is, it will ignore longitude and give you everything within the latitude range, or vice versa. That might get you down to 100,000 rows to check the distance against. That's a lot better than 100,000,000, but not as good as you would hope for.
3 -- http://mysql.rjweb.org/doc.php/latlng Does get down to the square, or very close. It is designed to scale. I have tested only 3M rows, not 100M, but it should work fine.
The main trick is to partition on latitude, then have longitude be the first column in the PRIMARY KEY so that InnoDB will cluster the nearby rows nearby in the partition(s). If you look for all rows within X miles (or km) it might look at (and compute the great-circle-distance) for about twice as many rows as necessary, not 100K. If you want to find the nearest 100 items, it might touch about 400 (4x).
As for SPATIAL index, you might want to upgrade to 5.7.6, which is when ST_Distance_Sphere() and ST_MakeEnvelope() were added. (MakeEnvelope is only marginally more convenient than building a Polygon yourself -- it has flat-earth syndrome.)

mysql contradictory queries to run simultaneously

I'm having problem working around the logic on how to build a particular query.
What I have is a form that narrows housing listings down by things like number of bedrooms, sq ft, etc.
My issue is writing a query that includes both city and zip code parameters along with the details of the house.
For example:
SELECT * FROM my_houses
WHERE
BEDROOMS >= 3
AND
SQFT >= 1500
AND
CITY IN ('Gotham', 'Metropolis', 'Central')
VS
SELECT * FROM my_houses
WHERE
BEDROOMS >= 3
AND
SQFT >= 1500
AND
CITY IN ('Gotham', 'Metropolis', 'Central')
OR
ZIP IN ('65656', '65432', '63254')
Now, as I understand it, when I use OR it doesn't put the other paraments against ZIP, so it will show all entries with those ZIP values, regardless of number of bedrooms. Also that cities and ZIP's are a little mutually exclusive, so there would be a conflict with something meeting a CITY value but not a ZIP and thus would be excluded. But, if I can separate them out, that shouldn't matter.
Is there a way to get around this without writing two sub queries?
logically group your OR with parenthesis... something like...
WHERE
BEDROOMS >= 3
AND
SQFT >= 1500
AND
( CITY IN (Gotham, Metropolis, Central)
OR
ZIP IN (65656, 65432, 63254) )
However, your city and ZIP values would be expected within quotes such as
CITY IN ( 'Gotham', 'Metropolis', 'Central')
ZIP IN ('65656', '65432', '63254' )
Use parens to group your or clause. That way, if either one is true, it will satisfy the AND
SELECT * FROM my_houses
WHERE
BEDROOMS >= 3
AND
SQFT >= 1500
AND
(
CITY IN (Gotham, Metropolis, Central)
OR
ZIP IN (65656, 65432, 63254)
)
You may be missing the brackets :)
SELECT * FROM my_houses
WHERE
BEDROOMS >= 3
AND
SQFT >= 1500
AND (
CITY IN (Gotham, Metropolis, Central)
OR
ZIP IN (65656, 65432, 63254)
)

select places with nearly same location (duplicates) by latitude/longitude

Lets say I have a table venues with following columns:
id
user_id
name
latitude
longitude
The latitude and longitude are kept as FLOAT(10,6) values. As different users add venues, there are venue duplicates. How can I select all the duplicates from the table in range up to lets say 50 metres (as it might be hard to achieve as the longitudial meter equivalents are different at different latitudes, so this is absolutely aproximate)? The query should select all venues: VenueA and VenueB (there might be VenueC, VenueD, etc) so that I can compare them. It should filter out venues that are actually one per location in the range (I care only for duplicates).
I was looking for an answer but had to settle with answering myself.
SELECT s1.id, s1.name, s2.id, s2.name FROM venues s1, venues s2
WHERE s2.id > s1.id AND
(POW(s1.latitude - s2.latitude, 2) + POW(s1.longitude - s2.longitude, 2) < 0.001)
The first condition is to select only half of matrix as order of similar venues is not important. The second one is simplified distance calculator. As user185631 suggested haversine formula should do the trick if you need more precision but I didn't need it as I was looking for duplicates with the same coordinates but couldn't settle with s1.latitude = s2.latitude AND s1.longitude = s2.longitude due to float/decimal corruption in my DB.
Of course checking this at insert would be better but if you get corrupt DB you need to clean it somehow. Please also note that this query is heavy on server if your tables are big.
Create a function which computes distances between lat/lons. For small/less accurate distance (which is the case here) you can use the Equirectangular approximation (see section here: http://www.movable-type.co.uk/scripts/latlong.html). If the distance is less than your chosen threshold (50m), then it is a duplicate.
Determine what 50 meters is in terms of lat and long. Then plus and minus that to your starting location to come up with a max and min for both lat and long. Then...
SELECT id FROM venues WHERE latitude < (your max latitude) AND latitude > (your min latitude) AND longitude < (your max longitude) AND longitude > (your min longitude);
Converting meters to lat/long is very tricky as it depends on where the starting point is on the globe. See the middle section of the page here: http://www.uwgb.edu/dutchs/usefuldata/utmformulas.htm