Storing millions of 3D coordinates in MySQL - bad idea? - mysql

All-
So I need to store 3D positions (x, y, z) associated with objects in a video game.
I'm curious, is this a terrible idea? The positions are generated quite frequently, and may vary some.
I basically would ONLY like to store the position in my database if it's not within a yard of a position already stored.
I was basically selecting the existing positions for an object in the game (by object_id, object_type, continent and game_version), looping through, and calculating the distance using PHP. If It was > 1, I would insert it.
Now that i'm at about 7 million rows (obviously not for the same object), this isn't efficient and the server I'm using is coming to a crawl.
Does anyone have any ideas on how I could better store this information? I'd prefer it be in MySQL somehow.
Here is the structure of the table:
object_id
object_type (like unit or game object)
x
y
z
continent (an object can be on more than one continent)
game_version (positions can vary based on the game version)
Later when I need to access the data, I basically only query it by object_id, object_type, continent, and game_version (so I have an index on these 4)
Thanks!
Josh

Presumably objects on different continents are considered infinitely far apart. Also you haven't disclosed the units you're using in your table. I'll assume inches (of which there are 36 in a yard).
So, before you insert a point you need to determine whether you're within a yard. To do this you're going to need either the MySQL geo extension (which you can go read about) or separate indexes on at least your x and y columns, and maybe the z column.
Are there any points within a yard? This query will get you whether there are any points within the bounding box of +/- one yard around your new point. A 'nearby' result of one or more means you shouldn't insert the new point.
SELECT COUNT(*) nearby
FROM table t
WHERE t.x between (?xpos - 36) AND (?xpos + 36)
AND t.y between (?ypos - 36) AND (?ypos + 36)
AND t.z between (?zpos - 36) AND (?zpos + 36)
AND t.continent = ?cpos
If you need the query to work with Cartesian distances rather than bounding boxes you can add a sum-of-squares distance computation. But I suspect bounding boxes will work just fine for your app, and be much more efficient than repeatedly fetching 75-row result sets to do proximity testing in your application.
Conceptually it wouldn't be much harder to create a stored procedure for MySQL that would conditionally insert the new row only if it met the proximity criteria. That way you'd have a simple one-way transaction rather than server back-and-forth.

It may be killing your server because of the continuous activity on the disk that could be fixed by having mysql work in memory, add: ENGINE = MEMORY; on your table def.

Related

MySQL spatial query to find all rows that deliver to set point

I have a table, stores, with thousands of stores that deliver. If i have the lat, lng, and delivery_radius for each store (I can add a point column), what is the most efficient way to query the table to see which stores can deliver to where I stand currently?
I feel that checking if the distance between myself and each row is less than the delivery_radius would be a very long process. Would it be best to add a column to store a polygon calculated from each row's info and see if my current point is in that polygon (point-in-polygon)? Any other suggestions?
You can get the distance between two geo points by using following code segment in a SQL query.
ROUND((3959 * acos(cos(radians(IFNULL(P1.LAT, 0))) * cos(radians(IFNULL(P2.LAT, 0))) * cos(radians(IFNULL(P2.LNG, 0)) - radians(IFNULL(P1.LNG, 0))) + sin(radians(IFNULL(P1.LAT, 0))) * sin(radians(IFNULL(P2.LAT, 0))))),3) AS DISTANCE
However this is very costly operation and you will definitely have performance issues when the data grows. Maintaining a polygon also might be difficult as you have to update the polygon for each new store addition and the update process will exponentially slow down when data grows.
If it is not really a need to have this on a RDMBS please consider about using other technology like elasticsearch which natively support this kind of operations. Please refer https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-queries.html

Storing vector coordinates in MySQL

I am creating a database to track the (normalized) coordinates of events within a coordinate system. Think: a basketball shot chart, where coordinates of shot attempts are stored relative to where they were taken on the basketball court, in both positive and negative directions from center court.
I'm not exactly sure the best way to store this information in a database in order to give myself the most flexibility in utilizing the data. My options are:
Store a JSON object in a TEXT/CHAR column with X and Y properties
Store each X and Y coordinate in two DECIMAL columns
Use MySQL's spatial POINT object to store the coordinate
My goal is to store a normalized vector2 (as a percentage of the bounding box), so I can map the positions back out onto a rectangle of any size.
It would be nice to be able to do calculations, like distance from another point, but my understanding of spatial objects is that it is more for geographical coordinates than a normalized vector. The other options, however, make calculations a bit more difficult though, currently for my project, they aren't a definitive requirement.
Is it possible to use spatial POINT for this and would calculations be similar to that of measuring geographical points?
It is possible to use POINT, but it may be more of a hassle retrieving or modifying the values as it is stored in binary form. You won't be able to view or modify the field directly; you would use an SQL statement to get the components or create a new POINT to replace the old one.
They are stored as numbers and you can do normal mathematical operations on them. Geospatial-type calculations on distance would use other geospatial data types such as LINESTRING.
To insert a point you would have to create a point from two numbers (I think for your case, there would be no issues with the size of the numbers) :
INSERT INTO coordinatetable(testpoint) VALUES (GeomFromText('POINT(-100473882.33 2133151132.13)'));
INSERT INTO coordinatetable(testpoint) VALUES (GeomFromText('POINT(0.3 -0.213318973)'));
To retrieve it you would have to select the X and Y value separately
SELECT X(testpoint), Y(testpoint) from coordinatetable;
For your case, I would go with storing X and Y coordinate in two DECIMAL columns. It's easier to retrieve, modify and having X and Y coordinates separate would allow you direct access to to the coordinates rather than extract the values you want from data stored in a single field. For larger data sets, it may speed up your queries.
For example:
Whether the player is past half court only requires Y-coordinate
How much help the player could possibly get from the backboard would rely more on the X-coordinate than the Y-coordinate (X closer to zero => Straighter shot)
Whether the player usually scores from locations close to the long edges of the court would rely more on the X-coordinate than the Y-coordinate (X approaches 1 or -1)

Dealing with clusters when searching for points on map using mysql

I've found various questions with solutions similar to this problem but nothing quite on the money so far. Very grateful for any help.
I have a mysql (v.5.6.10) database with a single table called POSTS that stores millions upon millions of rows of lat/long points of interest on a map. Each point is classified as one of several different types. Each row is structured as id, type, coords:
id an unsigned bigint + primary key. This is auto incremented for each new row that is inserted.
type an unsigned tinyint used to encode the type of the point of interest.
coords a mysql geospatial POINT datatype representing the lat/long of the point of interest.
There is a SPATIAL index on 'coords'.
I need to find an efficient way to query the table and return up to X of the most recently-inserted points within a radius ("R") of a specific lat/long position ("Position"). The database is very dynamic so please assume that the data is radically different each time the table is queried.
If X is infinite, the problem is trivial. I just need to execute a query something like:
SELECT id, type, AsText(coords) FROM POSTS WHERE MBRContains(GeomFromText(BoundingBox, Position))
Where 'BoundingBox' is a mysql POLYGON datatype that perfectly encloses a circle of radius R from Position. Using a bounding box is, of course, not a perfect solution but this is not important for the particular problem that I'm trying to solve. I can order the results using "ORDER BY ID DESC" to retrieve and process the most-recently-inserted points first.
If X is less than infinite then I just need to modify the above to:
SELECT id, type, AsText(coords) FROM POSTS WHERE MBRContains(GeomFromText(BoundingBox, Position)) ORDER BY id DESC LIMIT X
The problem that I am trying to solve is how do I obtain a good representative set of results from a given region on the map when the points in that region are heavily clustered (for example, within cities on the map search region). For example:
In the example above, I am standing at X and searching for the 5 most-recently-inserted points of type black within the black-framed bounding box. If these points were all inserted in the cluster in the bottom right hand corner (let's assume that cluster is London) then my set of results will not include the black point that is near the top right of the search region. This is a problem for my application as I do not want users to be given the impression that there are no points of interest outside any areas where points are clustered.
I have considered a few potential solutions but I can't find one that works efficiently when the number of rows is huge (10s of millions). Approaches that I have tried so far include:
Dividing the search region into S number of squares (i.e., turning it into a grid) and searching for up to x/S points within each square - i.e., executing a separate mysql query for each square in the grid. This works OK for a small number of rows but becomes inefficient when the number of rows is massive as you need to divide the region into a large number of squares for the approach to work effectively. With only a small number of squares, you cannot guarantee that each square won't contain a densely populated cluster. A large number of squares means a large number of mysql searches which causes things to chug.
Adding a column to each row in the table that stores the distance to the nearest neighbour for each point. The nearest neighbour distance for a given point is calculated when the point is inserted into the table. With this structure, I can then order the search results by the nearest neighbour distance column so that any points that are in clusters are returned last. This solution only works when I'm searching for ALL points within the search region. For example, consider the situation in the diagram shown above. If I want to find the 5 most-recently-inserted points of type green, the nearest neighbour distance that is recorded for each point will not be correct. Recalculating these distances for each and every query is going to be far too expensive, even using efficient algorithms like KD trees.
In fact, I can't see any approach that requires pre-processing of data in table rows (or, put another way, 'touching' every point in the relevant search region dataset) to be viable when the number of rows gets large. I have considered algorithms like k-means / DBSCAN, etc. and I can't find anything that will work with sufficient efficiency given the use case explained above.
Any pearls? My intuition tells me this CAN be solved but I'm stumped so far.
Post-processing in that case seems more effective. Fetch last X points of a given type. Find if there is some clustering, for example: too many points too close together, relative to the distance of your point of view. Drop oldest of them (or these which are very close - may be your data is referencing a same POI). How much - up to you. Fetch next X points and see if there are some of them which are not in the cluster, or you can calculate a value for each of them based on remoteness and recentness and discard points according to that value.

Return records in chronological order, given a separation factor

I have a massive table in SQLServer 2008, it contains the position reported by technicians every minute. I need to report on this table but in order to control the amount of records that are displayed in the report both a time and distance separation factors need to be taken into account.
So, a query may look like
"Return all records with no less than 5 minutes and/or 300 feet between them".
The time part is done, but I'm having a hard time with the distance factor. I have the latitude and longitude for each point, and I have no problem if I need to include a SQLServer 2008 spatial UDT in order to resolve the problem.
Things I have considered:
Bring the records by the time factor, and apply the separation constrain in the client by calculating the distance between adjacent points and discard those which falls inside the the factor. (the easiest, but it must be the one consuming more resources).
Keep the last record per technician in a cache, pre-calculate the distance between the record and its predecessor, and resolve the constrain in the client. (should consume less resources than 1) since the distance is pre-calculated, however and since the table is BIG It will increase the size of the dataset, not sure if the space is worth the processing savings).
Use the spatial functions in SQLServer 2008, but honestly I had been reading and I couldn't find anything that helps me resolve this type of requirement. Any GIS expert??
I would like to go with the best option possible (maybe not listed above?) and IMO should be the one using the SQLserver features most efficiently.
What Raciel is asking, is how to "simplify" a list of points by a distance's factor. Suppose you have a list of one hundred spatial points sorted by dateTime, and the distance between one point from the previous is exactly 150 feets, he need to get just the list of points which distance is 300 feet, the result set should be a list of around 50 points...
I just imagine do this using a cursor.
The formula is:
3949.99 * arcos(sin(LAT1) * sin(LAT2) + cos(LAT1) * cos(LAT2) * cos(LONG1 - LONG2))
The radius of Earth is 3949.99 miles. All rest is self explanatory. This formula is Great Circle distance calculation formula.
Prior to SQL 2008, the most common solution was to use a UDF to calculate the great-circle distance between two points on a sphere. The Haversine formula is probably the most commonly used method.
Of course the Earth is not actually a perfect sphere, but this was considered "good enough" for most uses.
In SQL 2008, as you anticipated, such calculations are simplified and made more accurate by the introduction of the Geography and Geometry data types. Here's a brief sample of how you can use them to simplify distance calculations.
DECLARE #locations TABLE(locname VARCHAR(100), coord geography)
DECLARE #loc1 geography
DECLARE #loc2 geography
INSERT INTO #locations
VALUES('HOME', geography::Point(-81.810194, 41.478156, 4326)) --Note: Lat, Long, SRID
--The 4326 is the SRID (spatial reference id) used by SQL as
--a reference to the WGS 84 Standard. This is the same reference
--used by the GPS system
INSERT INTO #locations
VALUES('WORK', geography::Point(-81.687771, 41.498227, 4326))
SELECT * FROM #locations
SELECT #loc1 = coord FROM #locations WHERE locname = 'HOME'
SELECT #loc2 = coord FROM #locations WHERE locname = 'WORK'
SELECT #loc1.STDistance(#loc2) * 3.2808399 --STDistance is in meters so we multiply to convert to feet
The SRID is the key to the improved accuracy. The WGS 84 specification to which it refers includes a standardized coordinate system and a reference ellipsoid. In other words, it accounts for the non-spherical nature of the Earth, giving better results than a pure spherical Great Circle calculation.
If GIS accuracy is important to your work, this is the simplest way to implement it in SQL 2008.

What is the most efficient way to work with distances between two coordinates?

I have 2063 locations stored in a mysql table. In one of my processes I need to exclude certain results based on how far away they are from a given point of origin. The problem is, I will need to filter a couple of hundred, maybe a couple of thousand results at a time.
So what would be the best way to do the distance math. Should I do it at run time
1. Find all points connecting to my point of origin
2. loops through the connecting points
3. calculate the distance between the point of origin and the connecting point
4. exclude the connecting point if the distance if too great
or should I create a look up table with the distances between each and every point already figured out. I can avoid duplicate rows since the distance between p1 and p2 would be the same as the distance between p2 and p1, but that would still result in a couple of million rows in the table.
Or.. is there an even better way of doing it?
You could use MySQL's spatial extensions to calculate the distance and even create an R-tree index on the data to optimize the lookup of point within some range.
See the docs for MySQL spatial extensions for details:
http://dev.mysql.com/doc/refman/5.1-maria/en/spatial-extensions.html
How about this:
1. Loop through all points:
2. If abs(a-b) &lt distance && abs(a-b) &lt distance then:
3. Do the fancy distance calculation between a and b.
I.e. assuming most points will be outside the "box" defined by the distance you are interested in, you can filter out most points very quickly with step 2 and only calculate the real distance for a much smaller number of points.
Since your data is in a mysql table, you really want a solution that SQL will be able to help you with.
I will assume that each location has an x and y coordinate. Store these as separate entries in the table.
You can quickly narrow your field of search to a box centered on your point of interest.
eg
WHERE X > (MyPosX - Range) AND X < MyPosX + Range)
AND Y > (MyPosY - Range) AND Y < MyPosY + Range)
Once you have a smaller set of items that are likely to be in range, you can use a more iterative approach
Edit: Avoid square root calculations when working out the actual distances though, as these are expensive. eg instead of
sqrt(x*x + y*y) < distance
try
(x*x + y*y) < distance*distance
// distance*distance is a constant and can be calculated once