What database for 3D distance calculations? - mysql

All-
I thing i've finally out grown MySQL for one of my solutions. Right now I have 70 million rows that simply store the x,y,z of objects in 3D space. Unfortunately I don't know how else to optimize my database to handle the inserts/queries anymore. I need to query based on distance (get objects within distance).
Does anyone have a suggestions on a good replacement? I don't know if I should be looking at something like hbase or non-relational databases, as I may run into a similar problem. I generally insert about 100 rows per minute, and my query looks like:
// get objects within 500 yards
SELECT DISTINCT `object_positions`.`entry` FROM `object_positions` WHERE `object_positions`.`type` = 3 AND `object_positions`.`continent` = '$p->continent' AND SQRT(POW((`object_positions`.`x` - $p->x), 2) + POW((`object_positions`.`y` - $p->y), 2) + POW((`object_positions`.`z` - $p->z), 2)) < 500;
Nothing crazy complicated, but I think the math involved is what is causing MySQL to explode and I'm wondering if I should be looking at a cloud based database solution? It could easily have to handle 10-100 queries per second.

It's not MySQL that's giving you trouble, it's the need to apply indexing to your problem. You have a problem that no amount of NoSQL or cloud computing is going to solve by magic.
Here's your query simplified just a bit for clarity.
SELECT DISTINCT entry
FROM object_positions
WHERE type = 3
AND continent = '$p->continent'
AND DIST(x,$p->x, y, $p->y, z,$p-z) < 500
DIST() is shorthand for your Cartesian distance function.
You need to put separate indexes on x, y, and z in your table, then you need to do this:
SELECT DISTINCT entry
FROM object_positions
WHERE type = 3
AND continent = '$p->continent'
AND x BETWEEN ($p->x - 500) AND ($p->x + 500)
AND y BETWEEN ($p->y - 500) AND ($p->y + 500)
AND z BETWEEN ($p->z - 500) AND ($p->z + 500)
AND DIST(x,$p->x, y, $p->y, z,$p-z) < 500
The three BETWEEN clauses of the WHERE statement will allow indexes to be used to avoid a full table scan of your table for each query. They'll select all your points in a 1000x1000x1000 cube surrounding your candidate point. Then the DIST computation will toss out the ones that are outside the radius you want. You'll get the same batch of points but much more efficiently.
You don't have to actually create a DIST function; the formula you have in your question is fine.
You do have an index on (type, continent), don't you? If not you need it too.

Related

Snowflake ST_POLYGON(TO_GEOGRAPHY(...)) Is Inefficient

I have a few queries that use geospatial conditions. These queries are running surprisingly slow. Initially I thought it was the geospatial calculation itself, but stripping everything down to just ST_POLYGON(TO_GEOGRAPHY(...)), it is still very slow. This would make sense if each row had it's own polygon, but the condition uses a static polygon in the query:
SELECT
ST_POLYGON(TO_GEOGRAPHY('LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'))
FROM TABLE(GENERATOR(ROWCOUNT=>1000000))
Snowflake should be able to figure out that it only needs to calculate this polygon once for the entire query. Yet, the more rows that are added, the slower it gets. On an x-small this query takes over a minute. Where this query:
SELECT
'LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'
FROM TABLE(GENERATOR(ROWCOUNT=>3000000))
(added 2mm more rows to match the byte count)
Can complete in 2s.
I tried "precomputing" the polygon myself with a WITH statement but SF figures out the WITH is redundant and drops it. I also tried setting a session variable, but you can't set a complex value like this one as a variable.
I believe this is a bug.
Geospatial functions are in preview for now, and the team is working hard on all kind of optimizations.
For this case I want to note that making the polygon a single row table would help, but I would still expect better performance as the team gets this feature out of beta.
Let me create a table with one row, the polygon:
create or replace temp table poly1
as
select ST_POLYGON(TO_GEOGRAPHY('LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'
)) polygon
;
To see if this would help, I tried a one million rows cross join:
select *
from poly1, TABLE(GENERATOR(ROWCOUNT=>1000000));
It takes 14 seconds, and in the query profiler you can see most time was spent on an internal TO_OBJECT​(​GET_PATH​(​POLY1​.​POLYGON, '_shape'​)​​.
What's interesting to note is that the previous operation is mostly concerned with the ascii representation of the polygon. Running operations over this polygon is much quicker:
select st_area(polygon)
from poly1, TABLE(GENERATOR(ROWCOUNT=>1000000));
This query should have taken longer (finding the area of a polygon sounds more complicated than just selecting it), but turns out it only took 7 seconds (~half).
Thanks for the report, and the team will continue to optimize cases like this.
For anyone curious about the particular polygon in the question - it's a nice heart:

How to get the time that a value remains above a limit

I have a thermometer which is storing all the reading data.
Example:
http://sqlfiddle.com/#!9/c0aab3/9
The idea is to obtain the time that remained with temperature above 85 fahrenheit.
I have invested everything according to my knowledge and I have not been able to find the solution.
Currently, what I'm doing is getting the time when I went above 85 and then getting the next low value of 85 to calculate the difference in time.
If the temperature is maintained at 85 for 5 consecutive minutes the data may fail.
Please, what would be the way to calculate this?
According to the example of sqlfiddle, the results shown are greater than or equal to 85, but in some cases it was not maintained means that low.
that peak from the beginning to the low must be taken and the time is calculated in seconds, therefore, I must do it successively until the end.
Then add all the seconds and get the time.
Base answer (no modification on the table)
I could find a way around with variables and some IF functions that manipulate them. See if this works for you :
SET #currIndex = 1;
SET #indicator = FALSE;
SET #prevIndex = 0;
SELECT Q2.sen,
MIN(Q2.subTime) as 'From',
MAX(Q2.subTime) AS 'To',
TIMEDIFF(MAX(Q2.subTime), MIN(Q2.subTime)) as diff
FROM (SELECT IF(Q1.temp < 85,
IF(#prevIndex = #currIndex,
(#currIndex := #currIndex +1) -1,
#currIndex),
#prevIndex := #currIndex) AS 'Ind',
Q1.subTime,
Q1.temp,
Q1.sen
FROM (SELECT IF(sen_temp.temp < 85,
(#indicator XOR (#indicator := FALSE)),
#indicator := TRUE) as ind,
s_time AS subTime,
temp,
sen
FROM sen_temp
) AS Q1
WHERE Q1.ind
) AS Q2
GROUP BY Q2.`Ind`
ORDER BY Q2.subTime;
Here's an SQL fiddle based on your own.
The main problem of this query is its performance. Since there is no ID on the table, data had to be carried through the queries.
Performance answer (table modification required)
After a lot of optimization work, I ended up adding an ID to the table. It allowed me to have only one sub query instead of 2 and to carry less data in the sub query.
SET #indicator = FALSE;
SET #currentIndex = 0;
SELECT T1.sen, MIN(T1.s_time) as 'From', MAX(T1.s_time) AS 'To',
TIMEDIFF(MAX(T1.s_time), MIN(T1.s_time)) as diff
FROM (SELECT id, (CASE WHEN (temp >= 85) THEN
#currentIndex + (#indicator := TRUE)
WHEN (#indicator) THEN
#currentIndex := #currentIndex + (#indicator := FALSE) + 1
ELSE
#indicator := FALSE
END) as ind
FROM sen_temp
ORDER BY id, s_date, s_time) AS Q1
INNER JOIN sen_temp AS T1 ON Q1.id = T1.id
WHERE Q1.ind > 0
GROUP BY T1.sen, Q1.ind
Please check this fiddle for this more efficient version.
Performance difference explanation
When creating a MySQL Query, performance is always key. If it is simple, the query will execute efficiently and you should not have any problem unless you get into some syntax error or other optimization problems like filtering or ordering data on a field that's not indexed.
When we create sub-queries, it's harder for the database to handle the query. The reason is quite simple : it potentially uses more RAM. When a query containing sub-queries is executed, the sub-queries are executed first ("obviously!" you might say). When a sub-query is executed, the server needs to keep those values for the upper-query, so it kind of creates a temporary table in the RAM, allowing itself to consult the data in the upper-query if it needs to. Even though RAM is quite fast, it may seem slowed down a lot when handling a ludicrous amount of data. It is even worse when the query makes the database server handle so much data that it won't even fit in the RAM, forcing the server to use the much slower system's storage.
If we limit the amount of data generated in sub-queries to the minimum and then only recover wanted data in the main query, the amount of RAM the server uses for the sub-queries is more negligible and the query runs faster.
Therefore, the smaller the data amount is in sub-queries, the faster the whole query will execute. This much is true for pretty much all "SQL like" databases.
Here's a nice reading explaining how queries are optimized

Best way to check for near objects in SQL database according to x and y columns

I have a database with pretty static objects e.g buildings with x and y coorinates for a game in which I will be sending http requests to my server to get all objects around some given x and y coordinates.
Currently I am using this simple sql on the server which then returns the data in JSON.
SELECT OBJECTS.id \"id\", POINTS.x, POINTS.y
FROM OBJECT, OBJECTPOINTS, POINTS
WHERE OBJECTPOINTS.OID = OBJECTS.ID AND OBJECTPOINTS.PID = POINTS.ID AND
ABS(POINTS.x -\"+x+") < 0.01 AND ABS(POINTS.y - "+y+") < 0.01;"
Each object is represented by points which will be used to draw on the client.
I am currently achieving ~ 5 seconds respond time for around 1.5M Points and 200k objects.
To me this is fairly reasonable however the problem I have is that the db is blocked with each request. Here are 10 requests sent at the same time
:
36 seconds for map data with 10 clients requesting at the same time is way too much.
So my question is what would be a better way to handle the request rather then comparing distance in sql?
Would it be deliberatly faster to hold all of those objects in memory and iterate them on the server?
I have also thought of abstracting all the data in to some kind of grid and then first checking in which grid the request coords are to then run the same query as above on the db with only the objects in that certain square. Is there some clever solution I might be overlooking in the sql maybe?
Your query cannot be utilized by an index, because you are using a function on your column data in the where-clause of your query (ABS(POINTS.x ...)). If you rewrite your query to compare the raw value of your columns against another value you can add an index to your table and your query no longer needs to scan your full table to answer the query.
Rewrite your where clause to something like this to replace the ABS() function.
(POINTS.x < (x + 0.01) AND POINTS.x > (x - 0.01))
Then add an index to your table like:
alter table POINTS add index position(x, y);
Check the changing of the scanned rows of both queries with and without the index by adding the explain keyword infront of your query.

How could I know how much time it takes to query items in a table of MYSQL?

Our website has a problem: The visiting time of one page is too long. We have found out that it has a n*n matrix in that page; and for each item in the matrix, it queries three tables from MYSQL database. Every item in that matrix do the query quiet alike.
So I wonder maybe it is the large amount of MYSQL queries lead to the problem. And I want to try to fix it. Here is one of my confusions I list below:
1.
m = store.execute('SELECT X FROM TABLE1 WHERE I=1')
result = store.execute('SELECT Y FROM TABLE2 WHERE X in m')
2.
r = store.execute('SELECT X, Y FROM TABLE2');
result = []
for each in r:
i = store.execute('SELECT I FROM TABLE1 WHERE X=%s', each[0])
if i[0][0]=1:
result.append(each)
It got about 200 items in TABLE1 and more then 400 items in TABLE2. I don't know witch part takes the most time, so I can't make a better decision of how to write my sql statement.
How could I know how much time it takes to do some operation in MYSQL? Thank you!
Rather than installing a bunch of special tools, you could take a dead-simple approach like this (pardon my Ruby):
start = Time.new
# DB query here
puts "Query XYZ took #{Time.now - start} sec"
Hopefully you can translate that to Python. OR... pardon my Ruby again...
QUERY_TIMES = {}
def query(sql)
start = Time.new
connection.execute(sql)
elapsed = Time.new - start
QUERY_TIMES[sql] ||= []
QUERY_TIMES[sql] << elapsed
end
Then run all your queries through this custom method. After doing a test run, you can make it print out the number of times each query was run, and the average/total execution times.
For the future, plan to spend some time learning about "profilers" (if you haven't already). Get a good one for your chosen platform, and spend a little time learning how to use it well.
I use the MySQL Workbench for SQL development. It gives response times and can connect remotely to MySQL servers granted you have the permission (which in this case will give you a more accurate reading).
http://www.mysql.com/products/workbench/
Also, as you've realized it appears you have a SQL statement in a for loop. That could drastically effect performance. You'll want to take a different route with retrieving that data.

Practical limit to length of SQL query (specifically MySQL)

Is it particularly bad to have a very, very large SQL query with lots of (potentially redundant) WHERE clauses?
For example, here's a query I've generated from my web application with everything turned off, which should be the largest possible query for this program to generate:
SELECT *
FROM 4e_magic_items
INNER JOIN 4e_magic_item_levels
ON 4e_magic_items.id = 4e_magic_item_levels.itemid
INNER JOIN 4e_monster_sources
ON 4e_magic_items.source = 4e_monster_sources.id
WHERE (itemlevel BETWEEN 1 AND 30)
AND source!=16 AND source!=2 AND source!=5
AND source!=13 AND source!=15 AND source!=3
AND source!=4 AND source!=12 AND source!=7
AND source!=14 AND source!=11 AND source!=10
AND source!=8 AND source!=1 AND source!=6
AND source!=9 AND type!='Arms' AND type!='Feet'
AND type!='Hands' AND type!='Head'
AND type!='Neck' AND type!='Orb'
AND type!='Potion' AND type!='Ring'
AND type!='Rod' AND type!='Staff'
AND type!='Symbol' AND type!='Waist'
AND type!='Wand' AND type!='Wondrous Item'
AND type!='Alchemical Item' AND type!='Elixir'
AND type!='Reagent' AND type!='Whetstone'
AND type!='Other Consumable' AND type!='Companion'
AND type!='Mount' AND (type!='Armor' OR (false ))
AND (type!='Weapon' OR (false ))
ORDER BY type ASC, itemlevel ASC, name ASC
It seems to work well enough, but it's also not particularly high traffic (a few hundred hits a day or so), and I wonder if it would be worth the effort to try and optimize the queries to remove redundancies and such.
Reading your query makes me want to play an RPG.
This is definitely not too long. As long as they are well formatted, I'd say a practical limit is about 100 lines. After that, you're better off breaking subqueries into views just to keep your eyes from crossing.
I've worked with some queries that are 1000+ lines, and that's hard to debug.
By the way, may I suggest a reformatted version? This is mostly to demonstrate the importance of formatting; I trust this will be easier to understand.
select *
from
4e_magic_items mi
,4e_magic_item_levels mil
,4e_monster_sources ms
where mi.id = mil.itemid
and mi.source = ms.id
and itemlevel between 1 and 30
and source not in(16,2,5,13,15,3,4,12,7,14,11,10,8,1,6,9)
and type not in(
'Arms' ,'Feet' ,'Hands' ,'Head' ,'Neck' ,'Orb' ,
'Potion' ,'Ring' ,'Rod' ,'Staff' ,'Symbol' ,'Waist' ,
'Wand' ,'Wondrous Item' ,'Alchemical Item' ,'Elixir' ,
'Reagent' ,'Whetstone' ,'Other Consumable' ,'Companion' ,
'Mount'
)
and ((type != 'Armor') or (false))
and ((type != 'Weapon') or (false))
order by
type asc
,itemlevel asc
,name asc
/*
Some thoughts:
==============
0 - Formatting really matters, in SQL even more than most languages.
1 - consider selecting only the columns you need, not "*"
2 - use of table aliases makes it short & clear ("MI", "MIL" in my example)
3 - joins in the WHERE clause will un-clutter your FROM clause
4 - use NOT IN for long lists
5 - logically, the last two lines can be added to the "type not in" section.
I'm not sure why you have the "or false", but I'll assume some good reason
and leave them here.
*/
Default MySQL 5.0 server limitation is "1MB", configurable up to 1GB.
This is configured via the max_allowed_packet setting on both client and server, and the effective limitation is the lessor of the two.
Caveats:
It's likely that this "packet" limitation does not map directly to characters in a SQL statement. Surely you want to take into account character encoding within the client, some packet metadata, etc.)
SELECT ##global.max_allowed_packet
this is the only real limit it's adjustable on a server so there is no real straight answer
From a practical perspective, I generally consider any SELECT that ends up taking more than 10 lines to write (putting each clause/condition on a separate line) to be too long to easily maintain. At this point, it should probably be done as a stored procedure of some sort, or I should try to find a better way to express the same concept--possibly by creating an intermediate table to capture some relationship I seem to be frequently querying.
Your mileage may vary, and there are some exceptionally long queries that have a good reason to be. But my rule of thumb is 10 lines.
Example (mildly improper SQL):
SELECT x, y, z
FROM a, b
WHERE fiz = 1
AND foo = 2
AND a.x = b.y
AND b.z IN (SELECT q, r, s, t
FROM c, d, e
WHERE c.q = d.r
AND d.s = e.t
AND c.gar IS NOT NULL)
ORDER BY b.gonk
This is probably too large; optimizing, however, would depend largely on context.
Just remember, the longer and more complex the query, the harder it's going to be to maintain.
Most databases support stored procedures to avoid this issue. If your code is fast enough to execute and easy to read, you don't want to have to change it in order to get the compile time down.
An alternative is to use prepared statements so you get the hit only once per client connection and then pass in only the parameters for each call
I'm assuming you mean by 'turned off' that a field doesn't have a value?
Instead of checking if something is not this, and it's also not that etc. can't you just check if the field is null? Or set the field to 'off', and check if type or whatever equals 'off'.