I've looked around a bit and found quite a few people seeking to order a table of points by distance to a set point, but I'm curious how one would go about efficiently joining two tables on the minimum distance between two points. In my case, consider the table nodes and centroids.
CREATE TABLE nodes (
node_id VARCHAR(255),
pt POINT
);
CREATE TABLE centroids (
centroid_id MEDIUMINT UNSIGNED,
temperature FLOAT,
pt POINT
);
I have approximately 300k nodes and 15k centroids, and I want to get the closest centroid to each node so I can assign each node a temperature. So far I have created spatial indexes on pt on both tables and tried running the following query:
SELECT
nodes.node_id,
MIN(ST_DISTANCE(nodes.pt, centroids.pt))
FROM nodes
INNER JOIN centroids
ON ST_DISTANCE(nodes.pt, centroids.pt) <= 4810
GROUP BY
nodes.node_id
LIMIT 10;
Clearly, this query is not going to solve my problem; it does not retrieve temperature, assumes that the closest centroid is within 4810, and only evaluates 10 nodes. However, even with these simplifications, this query is very poorly optimized, and is still running as I type this. When I have MySQL give details about the query, it says no indexes are being used and none of the spatial indexes are listed as possible keys.
How could I build a query that can actually return the data I want joined efficiently utilizing spatial indexes?
I think a good approach would be partitioning (numerically not db partitioning) the data into cells. I don't know how well spatial indexes applies here, but the high-level logic is to say bin each node and centroid point into square regions and find matches between all the node-centroid in the same square, then make sure that there isn't a closer match in an 8-adjacent square (e.g. using the same nodes in original square). The closest matches can then be used to compute and save the temperature. All subsequent queries should ignore nodes with the temperature set.
There will still be nodes with centroids that aren't within the same or 8-adjacent squares, you would then expand the search, perhaps use squares with double the width and height. I can see this working with plain indexes on just the x and y coordinate of the points. I don't know how spatial indexes can further improve this.
There are many ways to solve this least-n-per-group problem.
One method uses a self-left-join antipattern (this allows ties):
select
n.node_id,
c.centroid_id,
st_distance(n.pt, c.pt) dist,
c.temperature
from nodes n
cross join centroids c
left join centroids c1
on c1.centroid_id <> c.centroid_id
and st_distance(n.pt, c1.pt) < st_distance(n.pt, c.pt)
where c1.centroid_id is null
The same logic can be expressed with a not exists condition.
Another option is to use a correlated subquery for filtering (this does not allow ties):
select
n.node_id,
n.node_id,
c.centroid_id,
st_distance(n.pt, c.pt) dist,
c.temperature
from nodes n
inner join centroids c
on c.centroid_id = (
select c1.centroid_id
from centroids c1
order by st_distance(n.pt, c1.pt)
limit 1
)
Finally: if all you want is the temperature of the closest centroid, then a simple subquery should be a good choice:
select
n.node_id,
(
select c1.temperature
from centroids c1
order by st_distance(n.pt, c1.pt)
limit 1
) temperature
from nodes n
Related
The question is about *.FIT files (link to definition) (1 to extremely many and constantly more), from Sports watches, speedometers,
in which there is always a timestamp (1 to n seconds), as well as 1 to n further parameters (which also have either a timestamp or a counter from 1 to x).
To perform data analysis, I need the data in the database to calculate e.g. the heart rates in relation to the altitude over several FIT files / training units / time periods.
Because of the changing number of parameters in a FIT file (depending on the connected devices, the device that created the file, etc.) and the possibility to integrate more/new parameters in the future, my idea was to have a separate table for each parameter instead of writing everything in one big table (which would then have extremely many "empty" cells whenever a parameter is not present in a FIT file).
Basic tables:
1 x tbl_file
id
filename
date
1
xyz.fit
2022-01-01
2
vwx.fit
2022-01-02
..
..
..
n x tbl_parameter_xy / tbl_ parameter_yz / ....
id
timestamp/counter
file_id
value
1
0
1
value
2
1
1
value
3
0
2
value
..
..
..
..
And these parameter tables would then be linked to each other via the file_id as well as to the FIT File.
I then used a test server, set up a MYSQL-DB to test this and was shocked:
SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999
Took almost 30 seconds to give me the results.
In my parameter tables there are 209918 rows.
file_id 999 consists of 1964 rows.
But my SELECT with JOIN returns 3857269 rows, so there must be an/the error and that's the reason why it takes 30sec.
In comparison, fetching from a "large complete" table was done in 0.5 seconds:
SELECT * FROM tbl_all_parameters
WHERE file_id = 999
After some research, I came across INDEX and thought I had the solution.
I created an index (file_id) for each of the parameter tables, but the result was even slower/same.
Right now I´m thinking about building that big "one in all" table, which makes it easier to handle and faster to select from, but I would have to update it frequently to insert new cols for new parameters. And I´m afraid it will grow so big it kills itself
I have 2 questions:
Which table setup is recommended, primary with focus on SELECT speed, secondary with size of DB.
Do I have a basic bug in my SELECT that makes it so slow?
EXPLAIN SELECT
You're getting a combinatorial explosion in your JOIN. Your result set contains one output row for every pair of input rows in your two parameter tables.
If you say
SELECT * FROM a LEFT JOIN b
with no ON condition at all you get COUNT(a) * COUNT(b) rows in your result set. And you said this
SELECT * FROM a LEFT JOIN b WHERE a.file_id = b.file_id
which gives you a similarly bloated result set.
You need another ON condition... possibly try this.
SELECT *
FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
AND x.timestamp = y.timestamp
if the timestamps in the two tables are somehow in sync.
But, with respect, I don't think you have a very good database design yet.
This is a tricky kind of data for which to create an optimal database layout, because it's extensible.
If you find yourself with a design where you routinely create new tables in production (for example, when adding a new device type) you almost certainly have misdesigned you database.
An approach you might take is creating an attribute / value table. It will have a lot of rows in it, but they'll be short and easy to index.
Your observations will go into a table like this.
file_id part of your primary key
parameter_id part of your primary key
timestamp part of your primary key
value
Then, when you need to, say, retrieve parameters 2 and 3 from a particular file, you would do
SELECT timestamp, parameter_id, value
FROM observation_table
WHERE file_id = xxxx
AND parameter_id IN (2,3)
ORDER BY timestamp, parameter_id
The multicolumn primary key I suggested will optimize this particular query.
Once you have this working, read about denormalization.
I was playing around with SQLite and I ran into an odd performance issue with CROSS JOINS on very small data sets. For example, any cross join I do in SQLite takes about 3x or longer than the same cross join in mysql. For example, here would be an example for 3,000 rows in mysql:
select COUNT(*) from (
select * from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
Does SQLite use a different algorithm or something than does other client-server databases for doing cross joins or other types of joins? I have had a lot of luck using SQLite on a single table/database, but whenever joining tables, it seems be become a bit more problematic.
Does SQLite use a different algorithm or something than does other client-server databases for doing cross joins or other types of joins?
Yes. The algorithm used by SQLite is very simple. In SQLite, joins are executed as nested loop joins. The database goes through one table, and for each row, searches matching rows from the other table.
SQLite is unable to figure out how to use an index to speed the join and without indices, an k-way join takes time proportional to N^k. MySQL for example, creates some "ghostly" indexes which helps the iteration process to be faster.
It has been commented already by Shawn that this question would need much more details in order to get a really accurate answer.
However, as a general answer, please be aware that this note in the SQLite documentation states that the algorithm used to perform CROSS JOINs may be suboptimal (by design!), and that their usage is generally discouraged:
Side note: Special handling of CROSS JOIN. There is no difference between the "INNER JOIN", "JOIN" and "," join operators. They are completely interchangeable in SQLite. The "CROSS JOIN" join operator produces the same result as the "INNER JOIN", "JOIN" and "," operators, but is handled differently by the query optimizer in that it prevents the query optimizer from reordering the tables in the join. An application programmer can use the CROSS JOIN operator to directly influence the algorithm that is chosen to implement the SELECT statement. Avoid using CROSS JOIN except in specific situations where manual control of the query optimizer is desired. Avoid using CROSS JOIN early in the development of an application as doing so is a premature optimization. The special handling of CROSS JOIN is an SQLite-specific feature and is not a part of standard SQL.
This clearly indicates that the SQLite query planner handles CROSS JOINs differently than other RDBMS.
Note: nevertheless, I am unsure that this really applies to your use case, where both derived tables being joined have the same number of records.
Why MySQL might be faster: It uses the optimization that it calls "Using join buffer (Block Nested Loop)".
But... There are many things that are "wrong" with the query. I would hate for you to draw a conclusion on comparing DB engines based on your findings.
It could be that one DB will create an index to help with join, even if none were already there.
SELECT * probably hauls around all the columns, unless the Optimizer is smart enough to toss all the columns except for territory.
A LIMIT without an ORDER BY gives you random value. You might think that the resultset is necessarily 3000 rows of the value "3000" in each, but it is perfectly valid to come up with other results. (Depending on what you ORDER BY, it still may not be deterministic.)
Having a COUNT(*) without a column saying what it is counting (territory) seems unrealistic.
You have the same subquery twice. Some engine may be smart enough to evaluate it only once. Or you could reformulate it with WITH to (possibly) give the Optimizer a big hint of such. (I think the example below shows how it would be reformulated in MySQL 8.0 or MariaDB 10.2; I don't know about SQLite).
If you are pitting one DB against the other, please use multiple queries that relate to your application.
This is not necessarily a "small" dataset, since the intermediate table (unless optimized away) has 9,000,000 rows.
I doubt if I have written more than one cross join in a hundred queries, maybe a thousand. Its performance is hardly worth worrying about.
WITH w AS ( SELECT territory FROM main_s LIMIT 3000 )
SELECT COUNT(*)
FROM w AS x1
JOIN w AS x2
GROUP BY x1.territory;
As noted above, using CROSS JOIN in SQLite restricts the optimiser from reordering tables so that you can influence the order the nested loops that perform the join will take.
However, that's a red herring here as you are limiting rows in both sub selects to 3000 rows, and its the same table, so there is no optimisation to be had there anyway.
Lets see what your query actually does:
select COUNT(*) from (
select * from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
You say; produce an intermediate result set of 9 million rows (3000 x 3000), group them on x.territory and return count of the size of the group.
So let's say the row size of your table is 100 bytes.
You say, for each of 3000 rows of 100 bytes, give me 3000 rows of 100 bytes.
Hence you get 9 million rows of 200 bytes length, an intermediate result set of 1.8GB.
So here are some optimisations you could make.
select COUNT(*) from (
select territory from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
You don't use anything other than territory from x, so select just that. Lets assume it is 8 bytes, so now you create an intermediate result set of:
9M x 108 = 972MB
So we nearly halve the amount of data. Lets try the same for x2.
But wait, you are not using any data fields from x2. You are just using it multiply the result set by 3000. If we do this directly we get:
select COUNT(*) * 3000 from (
select territory from main_s limit 3000
) group by territory
The intermediate result set is now:
3000 x 8 = 24KB which is now 0.001% of the original.
Further, now that SELECT * is not being used, it's possible the optimiser will be able to use an index on main_s that includes territory as a covering index (meaning it doesn't need to lookup the row to get territory).
This is done when there is a WHERE clause, it will try to chose a covering index that will also satisfy the query without using row lookups, but it's not explicit in the documentation if this is also done when WHERE is not used.
If you determined a covering index was not being use (assuming one exists), then counterintuitively (because sorting takes time), you could use ORDER BY territory in the sub select to cause the covering index to be used.
select COUNT(*) * 3000 from (
select territory from main_s limit 3000 order by territory
) group by territory
Check the optimiser documentation here:
https://www.sqlite.org/draft/optoverview.html
To summarise:
The optimiser uses the structure of your query to look for hints and clues about how the query may be optimised to run quicker.
These clues take the form of keywords such as WHERE clauses, ORDER By, JOIN (ON), etc.
Your query as written provides none of these clues.
If I understand your question correctly, you are interested in why other SQL systems are able to optimise your query as written.
The most likely reasons seem to be:
Ability to eliminate unused columns from sub selects (likely)
Ability to use covering indexes without WHERE or ORDER BY (likely)
Ability to eliminate unused sub selects (unlikely)
But this is a theory that would need testing.
Sqlite uses CROSS JOIN as a flag to the query-planner to disable optimizations. The docs are quite clear:
Programmers can force SQLite to use a particular loop nesting order for a join by using the CROSS JOIN operator instead of just JOIN, INNER JOIN, NATURAL JOIN, or a "," join. Though CROSS JOINs are commutative in theory, SQLite chooses to never reorder the tables in a CROSS JOIN. Hence, the left table of a CROSS JOIN will always be in an outer loop relative to the right table.
https://www.sqlite.org/optoverview.html#crossjoin
Im trying to get this query working, unfortunately it's pretty slow. So i'm guessing there could be a better query for getting the result I'm looking for.
Select samples.X, samples.Y, samples.id, samples.Provnr, samples.costumer_id, avg(lerhalter.lerhalt) from samples
left outer join lerhalter
on SQRT(POW(samples.X - lerhalter.x , 2) + POW(samples.Y - lerhalter.y, 2)) < 100
where samples.customer_id = 900417
group by samples.provnr
I have the table samples, and i'd like to get all the customers samples, and then join the "lerhalt" table. There could be more than one row of each sample when i do the join, therefore id like to get the average value of column lerhalt.
I think i get the result that I'm after, but the query can take up to 10s, for a customer with only 100 samples. There's customers with 2000 samples.
So i have to get a better query time.
Any suggestions?
A small speed up would be to leave out the SQRT function. SQRT() is expensive in terms of computing time and you can simply adjust the right side of your comparison to 100x100 = 10.000:
Select samples.X, samples.Y, samples.id, samples.Provnr, samples.costumer_id, avg(lerhalter.lerhalt) from samples
left outer join lerhalter
on (POW(samples.X - lerhalter.x , 2) + POW(samples.Y - lerhalter.y, 2)) < 10000
where samples.customer_id = 900417
group by samples.provnr
Also, are you sure you need a LEFT OUTER JOIN? Could an INNER JOIN be used instead?
Next question: Are the X and Y coordinated integer values? If not, can they be converted to integers? Integer claucuations are a lot faster usually than floating point operations.
Finally, you clearly do a euclidean distance measure. Is that really needed? Can another distance measure do a sufficiently good job? Maybe city-block distance is good enough for your needs? This would further speed up things a lot.
I am working with a table in MySQL that defines a tree hierarchy using the "adjacency list" method, which should be fine for my purposes.
I need to compute the maximum of some value over all children using the (fast) query
SELECT MAX(N.someValue) AS rate
FROM `nodes` N
WHERE N.parent = <some node id>;
Sometimes I am not as lucky and have to work with descendants of the children (its defined, and always references some leaf node in the branch of that tree).
SELECT MAX(N.someValue) AS rate
FROM `nodes` N
WHERE N.id IN (SELECT N2.descendant FROM `nodes` N2 WHERE N2.parent = <some node id>);
This second query is quite slow. The number of children for a given parent is quite low, rarely more than 10, never more than 20. It does not appear to be a correlated subquery by my eye, though EXPLAIN says the subquery is dependent. I am testing in MySQL 5.1. nodes.id is the primary key and there is a BTREE index on nodes.parent. Is there a way I can improve the speed of this query?
I don't see anything that specifically explains why this query is slow, but I do see that it could be restructured using a JOIN instead of a subquery. Something like this (I've reversed the table aliases because that's how my brain works):
SELECT MAX(n2.someValue) AS `rate`
FROM `nodes` n1
JOIN `nodes` n2 ON n1.descendant = n2.id
WHERE n1.parent = '<some static id>'
I don't know why your example query was slow, but this structure shouldn't pose performance issues as long as the proper fields are indexed and of course depending on how many rows are being fetched.
This query is inefficient and unable to execute. track and desiredspeed table have almost million records.... after this we want to self join the track table for further processing. any efficient approach to execute bellow query is appreciated..
select
t_id,
route_id,
t.timestamp,
s_lat,
s_long,
longitude,
latitude,
SQRT(POW((latitude - d_lat),2) + POW((longitude - d_long),2)) as dst,
SUM(speed*18/5)/count(*) as speed,
'20' as actual_speed,
((20-(speed*18/5))/(speed*18/5))*100 as speed_variation
from
track t,
desiredspeed s
WHERE
LEFT(s_lat,6) = LEFT(latitude,6)
AND LEFT(s_long,6)=LEFT(longitude,6)
AND t_id > 53445
group by
route_id,
s_lat,
s_long
order by
t_id asc
firstly you are using sybase join syntax i would change that
you are also performing two computations per join across large datasets this is likely to be inefficient
this will not be able to use an index as you are performing computation on the column, either store the data precomputed or alternately add a computed column based on the rule applied above, and index accordingly
Finally it may be quicker if you used temp tables or common Table expressions (although do not know MySQL too well here)