Im trying to get this query working, unfortunately it's pretty slow. So i'm guessing there could be a better query for getting the result I'm looking for.
Select samples.X, samples.Y, samples.id, samples.Provnr, samples.costumer_id, avg(lerhalter.lerhalt) from samples
left outer join lerhalter
on SQRT(POW(samples.X - lerhalter.x , 2) + POW(samples.Y - lerhalter.y, 2)) < 100
where samples.customer_id = 900417
group by samples.provnr
I have the table samples, and i'd like to get all the customers samples, and then join the "lerhalt" table. There could be more than one row of each sample when i do the join, therefore id like to get the average value of column lerhalt.
I think i get the result that I'm after, but the query can take up to 10s, for a customer with only 100 samples. There's customers with 2000 samples.
So i have to get a better query time.
Any suggestions?
A small speed up would be to leave out the SQRT function. SQRT() is expensive in terms of computing time and you can simply adjust the right side of your comparison to 100x100 = 10.000:
Select samples.X, samples.Y, samples.id, samples.Provnr, samples.costumer_id, avg(lerhalter.lerhalt) from samples
left outer join lerhalter
on (POW(samples.X - lerhalter.x , 2) + POW(samples.Y - lerhalter.y, 2)) < 10000
where samples.customer_id = 900417
group by samples.provnr
Also, are you sure you need a LEFT OUTER JOIN? Could an INNER JOIN be used instead?
Next question: Are the X and Y coordinated integer values? If not, can they be converted to integers? Integer claucuations are a lot faster usually than floating point operations.
Finally, you clearly do a euclidean distance measure. Is that really needed? Can another distance measure do a sufficiently good job? Maybe city-block distance is good enough for your needs? This would further speed up things a lot.
Related
I've looked around a bit and found quite a few people seeking to order a table of points by distance to a set point, but I'm curious how one would go about efficiently joining two tables on the minimum distance between two points. In my case, consider the table nodes and centroids.
CREATE TABLE nodes (
node_id VARCHAR(255),
pt POINT
);
CREATE TABLE centroids (
centroid_id MEDIUMINT UNSIGNED,
temperature FLOAT,
pt POINT
);
I have approximately 300k nodes and 15k centroids, and I want to get the closest centroid to each node so I can assign each node a temperature. So far I have created spatial indexes on pt on both tables and tried running the following query:
SELECT
nodes.node_id,
MIN(ST_DISTANCE(nodes.pt, centroids.pt))
FROM nodes
INNER JOIN centroids
ON ST_DISTANCE(nodes.pt, centroids.pt) <= 4810
GROUP BY
nodes.node_id
LIMIT 10;
Clearly, this query is not going to solve my problem; it does not retrieve temperature, assumes that the closest centroid is within 4810, and only evaluates 10 nodes. However, even with these simplifications, this query is very poorly optimized, and is still running as I type this. When I have MySQL give details about the query, it says no indexes are being used and none of the spatial indexes are listed as possible keys.
How could I build a query that can actually return the data I want joined efficiently utilizing spatial indexes?
I think a good approach would be partitioning (numerically not db partitioning) the data into cells. I don't know how well spatial indexes applies here, but the high-level logic is to say bin each node and centroid point into square regions and find matches between all the node-centroid in the same square, then make sure that there isn't a closer match in an 8-adjacent square (e.g. using the same nodes in original square). The closest matches can then be used to compute and save the temperature. All subsequent queries should ignore nodes with the temperature set.
There will still be nodes with centroids that aren't within the same or 8-adjacent squares, you would then expand the search, perhaps use squares with double the width and height. I can see this working with plain indexes on just the x and y coordinate of the points. I don't know how spatial indexes can further improve this.
There are many ways to solve this least-n-per-group problem.
One method uses a self-left-join antipattern (this allows ties):
select
n.node_id,
c.centroid_id,
st_distance(n.pt, c.pt) dist,
c.temperature
from nodes n
cross join centroids c
left join centroids c1
on c1.centroid_id <> c.centroid_id
and st_distance(n.pt, c1.pt) < st_distance(n.pt, c.pt)
where c1.centroid_id is null
The same logic can be expressed with a not exists condition.
Another option is to use a correlated subquery for filtering (this does not allow ties):
select
n.node_id,
n.node_id,
c.centroid_id,
st_distance(n.pt, c.pt) dist,
c.temperature
from nodes n
inner join centroids c
on c.centroid_id = (
select c1.centroid_id
from centroids c1
order by st_distance(n.pt, c1.pt)
limit 1
)
Finally: if all you want is the temperature of the closest centroid, then a simple subquery should be a good choice:
select
n.node_id,
(
select c1.temperature
from centroids c1
order by st_distance(n.pt, c1.pt)
limit 1
) temperature
from nodes n
I was playing around with SQLite and I ran into an odd performance issue with CROSS JOINS on very small data sets. For example, any cross join I do in SQLite takes about 3x or longer than the same cross join in mysql. For example, here would be an example for 3,000 rows in mysql:
select COUNT(*) from (
select * from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
Does SQLite use a different algorithm or something than does other client-server databases for doing cross joins or other types of joins? I have had a lot of luck using SQLite on a single table/database, but whenever joining tables, it seems be become a bit more problematic.
Does SQLite use a different algorithm or something than does other client-server databases for doing cross joins or other types of joins?
Yes. The algorithm used by SQLite is very simple. In SQLite, joins are executed as nested loop joins. The database goes through one table, and for each row, searches matching rows from the other table.
SQLite is unable to figure out how to use an index to speed the join and without indices, an k-way join takes time proportional to N^k. MySQL for example, creates some "ghostly" indexes which helps the iteration process to be faster.
It has been commented already by Shawn that this question would need much more details in order to get a really accurate answer.
However, as a general answer, please be aware that this note in the SQLite documentation states that the algorithm used to perform CROSS JOINs may be suboptimal (by design!), and that their usage is generally discouraged:
Side note: Special handling of CROSS JOIN. There is no difference between the "INNER JOIN", "JOIN" and "," join operators. They are completely interchangeable in SQLite. The "CROSS JOIN" join operator produces the same result as the "INNER JOIN", "JOIN" and "," operators, but is handled differently by the query optimizer in that it prevents the query optimizer from reordering the tables in the join. An application programmer can use the CROSS JOIN operator to directly influence the algorithm that is chosen to implement the SELECT statement. Avoid using CROSS JOIN except in specific situations where manual control of the query optimizer is desired. Avoid using CROSS JOIN early in the development of an application as doing so is a premature optimization. The special handling of CROSS JOIN is an SQLite-specific feature and is not a part of standard SQL.
This clearly indicates that the SQLite query planner handles CROSS JOINs differently than other RDBMS.
Note: nevertheless, I am unsure that this really applies to your use case, where both derived tables being joined have the same number of records.
Why MySQL might be faster: It uses the optimization that it calls "Using join buffer (Block Nested Loop)".
But... There are many things that are "wrong" with the query. I would hate for you to draw a conclusion on comparing DB engines based on your findings.
It could be that one DB will create an index to help with join, even if none were already there.
SELECT * probably hauls around all the columns, unless the Optimizer is smart enough to toss all the columns except for territory.
A LIMIT without an ORDER BY gives you random value. You might think that the resultset is necessarily 3000 rows of the value "3000" in each, but it is perfectly valid to come up with other results. (Depending on what you ORDER BY, it still may not be deterministic.)
Having a COUNT(*) without a column saying what it is counting (territory) seems unrealistic.
You have the same subquery twice. Some engine may be smart enough to evaluate it only once. Or you could reformulate it with WITH to (possibly) give the Optimizer a big hint of such. (I think the example below shows how it would be reformulated in MySQL 8.0 or MariaDB 10.2; I don't know about SQLite).
If you are pitting one DB against the other, please use multiple queries that relate to your application.
This is not necessarily a "small" dataset, since the intermediate table (unless optimized away) has 9,000,000 rows.
I doubt if I have written more than one cross join in a hundred queries, maybe a thousand. Its performance is hardly worth worrying about.
WITH w AS ( SELECT territory FROM main_s LIMIT 3000 )
SELECT COUNT(*)
FROM w AS x1
JOIN w AS x2
GROUP BY x1.territory;
As noted above, using CROSS JOIN in SQLite restricts the optimiser from reordering tables so that you can influence the order the nested loops that perform the join will take.
However, that's a red herring here as you are limiting rows in both sub selects to 3000 rows, and its the same table, so there is no optimisation to be had there anyway.
Lets see what your query actually does:
select COUNT(*) from (
select * from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
You say; produce an intermediate result set of 9 million rows (3000 x 3000), group them on x.territory and return count of the size of the group.
So let's say the row size of your table is 100 bytes.
You say, for each of 3000 rows of 100 bytes, give me 3000 rows of 100 bytes.
Hence you get 9 million rows of 200 bytes length, an intermediate result set of 1.8GB.
So here are some optimisations you could make.
select COUNT(*) from (
select territory from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
You don't use anything other than territory from x, so select just that. Lets assume it is 8 bytes, so now you create an intermediate result set of:
9M x 108 = 972MB
So we nearly halve the amount of data. Lets try the same for x2.
But wait, you are not using any data fields from x2. You are just using it multiply the result set by 3000. If we do this directly we get:
select COUNT(*) * 3000 from (
select territory from main_s limit 3000
) group by territory
The intermediate result set is now:
3000 x 8 = 24KB which is now 0.001% of the original.
Further, now that SELECT * is not being used, it's possible the optimiser will be able to use an index on main_s that includes territory as a covering index (meaning it doesn't need to lookup the row to get territory).
This is done when there is a WHERE clause, it will try to chose a covering index that will also satisfy the query without using row lookups, but it's not explicit in the documentation if this is also done when WHERE is not used.
If you determined a covering index was not being use (assuming one exists), then counterintuitively (because sorting takes time), you could use ORDER BY territory in the sub select to cause the covering index to be used.
select COUNT(*) * 3000 from (
select territory from main_s limit 3000 order by territory
) group by territory
Check the optimiser documentation here:
https://www.sqlite.org/draft/optoverview.html
To summarise:
The optimiser uses the structure of your query to look for hints and clues about how the query may be optimised to run quicker.
These clues take the form of keywords such as WHERE clauses, ORDER By, JOIN (ON), etc.
Your query as written provides none of these clues.
If I understand your question correctly, you are interested in why other SQL systems are able to optimise your query as written.
The most likely reasons seem to be:
Ability to eliminate unused columns from sub selects (likely)
Ability to use covering indexes without WHERE or ORDER BY (likely)
Ability to eliminate unused sub selects (unlikely)
But this is a theory that would need testing.
Sqlite uses CROSS JOIN as a flag to the query-planner to disable optimizations. The docs are quite clear:
Programmers can force SQLite to use a particular loop nesting order for a join by using the CROSS JOIN operator instead of just JOIN, INNER JOIN, NATURAL JOIN, or a "," join. Though CROSS JOINs are commutative in theory, SQLite chooses to never reorder the tables in a CROSS JOIN. Hence, the left table of a CROSS JOIN will always be in an outer loop relative to the right table.
https://www.sqlite.org/optoverview.html#crossjoin
My front-end (SourcePawn) currently does the following:
float fPoints = 0.0;
float fWeight = 1.0;
while(results.FetchRow())
{
fPoints += (results.FetchFloat(0) * fWeight);
fWeight *= 0.95;
}
In case you don't understand this code, it goes through the resultset of this query:
SELECT points FROM table WHERE auth = 'authentication_id' AND points > 0.0 ORDER BY points DESC;
The resultset is floating numbers, sorted by points from high to low.
My front-end takes the 100% of the first row, then 95% of the second one, and it drops by 5% every time. It all adds up to fPoints that is my 'sum' variable.
What I'm looking for, is a solution of how to replicate this code in pure SQL and receive the sum which is called fPoints in my front-end, so I will be able to run it for a table that has over 10,000 rows, in one query instead of 10,000.
I'm very lost. I don't know where to start and guidance of any kind would be very nice.
You can do this using variables:
SELECT points,
(points * (#f := 0.95 * #f) / 0.95) as fPoints
FROM table t CROSS JOIN
(SELECT #f := 1.0) params
WHERE auth = 'authentication_id' AND points > 0.0
ORDER BY points DESC;
A note about the calculation. The value of #f starts at 1. Because we are dealing with variables, the assignment and the use of the variable need to be in the same expression -- MySQL does not guarantee the order of evaluation of expressions.
So, the 0.95 * #f reduces the value by 5%. However, that is for the next iteration. The / 0.95 undoes that to get the right value for this iteration.
While I'm glad the answer Gordon Linoff provides works for you, you should understand it's quite specific. ORDER BY, per the SQL standard, has no effect on how a query is processed, and SQL does not recognize "iteration" in a SELECT statement. So the idea of "reducing a variable on each iteration", where the iteration order is governed by ORDER BY has no basis in standard SQL. You might want to check if it's guaranteed by MySQL, just for your own edification.
To achieve the effect you want in a standard way, proceed as follows.
Create a table Percentiles( Percentile int not null, Factor float not null )
Populate that table with your factors (20 rows).
Write a view or CTE that ranks your points in descending order. Let us call the rank column rank.
Then join your view to Percentiles:
SELECT auth, sum(points * factor) as weight
FROM "your view" as t join percentiles as p
ON r.rank = percentile
WHERE points > 0.0
GROUP BY auth
That query is simple, and its intent obvious. It might even be faster. Most important, it will definitely work, and doesn't depend on any idiosyncrasies of your current DBMS.
My SQL Query with all the filters applied is returning 10 lakhs (one million) records . To get all the records it is taking 76.28 seconds .. which is not acceptable . How can I optimize my SQL Query which should take less time.
The Query I am using is :
SELECT cDistName , cTlkName, cGpName, cVlgName ,
cMmbName , dSrvyOn
FROM sspk.villages
LEFT JOIN gps ON nVlgGpID = nGpID
LEFT JOIN TALUKS ON nGpTlkID = nTlkID
left JOIN dists ON nTlkDistID = nDistID
LEFT JOIN HHINFO ON nHLstGpID = nGpID
LEFT JOIN MEMBERS ON nHLstID = nMmbHhiID
LEFT JOIN BNFTSTTS ON nMmbID = nBStsMmbID
LEFT JOIN STATUS ON nBStsSttsID = nSttsID
LEFT JOIN SCHEMES ON nBStsSchID = nSchID
WHERE (
(nMmbGndrID = 1 and nMmbAge between 18 and 60)
or (nMmbGndrID = 2 and nMmbAge between 18 and 55)
)
AND cSttsDesc like 'No, Eligible'
AND DATE_FORMAT(dSrvyOn , '%m-%Y') < DATE_FORMAT('2012-08-01' , '%m-%Y' )
GROUP BY cDistName , cTlkName, cGpName, cVlgName ,
DATE_FORMAT(dSrvyOn , '%m-%Y')
I have searched on the forum and outside and used some of the tips given but it hardly makes any difference . The joins that i have used in above query is left join all on Primary Key and Foreign key . Can any one suggest me how can I modify this sql to get less execution time ....
You are, sir, a very demanding user of MySQL! A million records retrieved from a massively joined result set at the speed you mentioned is 76 microseconds per record. Many would consider this to be acceptable performance. Keep in mind that your client software may be a limiting factor with a result set of that size: it has to consume the enormous result set and do something with it.
That being said, I see a couple of problems.
First, rewrite your query so every column name is qualified by a table name. You'll do this for yourself and the next person who maintains it. You can see at a glance what your WHERE criteria need to do.
Second, consider this search criterion. It requires TWO searches, because of the OR.
WHERE (
(MEMBERS.nMmbGndrID = 1 and MEMBERS.nMmbAge between 18 and 60)
or (MEMBERS.nMmbGndrID = 2 and MEMBERS.nMmbAge between 18 and 55)
)
I'm guessing that these criteria match most of your population -- females 18-60 and males 18-55 (a guess). Can you put the MEMBERS table first in your list of LEFT JOINs? Or can you put a derived column (MEMBERS.working_age = 1 or some such) in your table?
Also try a compound index on (nMmbGndrID,nMmbAge) on MEMBERS to speed this up. It may or may not work.
Third, consider this criterion.
AND DATE_FORMAT(dSrvyOn , '%m-%Y') < DATE_FORMAT('2012-08-01' , '%m-%Y' )
You've applied a function to the dSrvyOn column. This defeats the use of an index for that search. Instead, try this.
AND dSrvyOn >= '2102-08-01'
AND dSrvyOn < '2012-08-01' + INTERVAL 1 MONTH
This will, if you have an index on dSrvyOn, do a range search on that index. My remark also applies to the function in your ORDER BY clause.
Finally, as somebody else mentioned, don't use LIKE to search where = will do. And NEVER use column LIKE '%something%' if you want acceptable performance.
You claim yourself you base your joins on good and unique indexes. So there is little to be optimized. Maybe a few hints:
try to optimize your table layout, maybe you can reduce the number of joins required. That probably brings more performance optimization than anything else.
check your hardware (available memory and things) and the server configuration.
use mysqls explain feature to find bottle necks.
maybe you can create an auxilliary table especially for this query, which is filled by a background process. That way the query itself runs faster, since the work is done before the query in background. That usually works if the query retrieves data that must not neccessarily be synchronous with every single change in the database.
check if an RDBMS is really the right type of database. For many purposes graph databases are much more efficient and offer better performance.
Try adding an index to nMmbGndrID, nMmbAge, and cSttsDesc and see if that helps your queries out.
Additionally you can use the "Explain" command before your select statement to give you some hints on what you might do better. See the MySQL Reference for more details on explain.
If the tables used in joins are least use for updates queries, then you can probably change the engine type from INNODB to MyISAM.
Select queries in MyISAM runs 2x faster then in INNODB, but the updates and insert queries are much slower in MyISAM.
You can create Views in order to avoid long queries and time.
Your like operator could be holding you up -- full-text search with like is not MySQL's strong point.
Consider setting a fulltext index on cSttsDesc (make sure it is a TEXT field first).
ALTER TABLE articles ADD FULLTEXT(cSttsDesc);
SELECT
*
FROM
table_name
WHERE MATCH(cSttsDesc) AGAINST('No, Eligible')
Alternatively, you can set a boolean flag instead of cSttsDesc like 'No, Eligible'.
Source: http://devzone.zend.com/26/using-mysql-full-text-searching/
This SQL has many things that are redundant that may not show up in an explain.
If you require a field, it shouldn't be in a table that's in a LEFT JOIN - left join is for when data might be in the joined table, not when it has to be.
If all the required fields are in the same table, it should be the in your first FROM.
If your text search is predictable (not from user input) and relates to a single known ID, use the ID not the text search (props to Patricia for spotting the LIKE bottleneck).
Your query is hard to read because of the lack of table hinting, but there does seem to be a pattern to your field names.
You require nMmbGndrID and nMmbAge to have a value, but these are probably in MEMBERS, which is 5 left joins down. That's a redundancy.
Remember that you can do a simple join like this:
FROM sspk.villages, gps, TALUKS, dists, HHINFO, MEMBERS [...] WHERE [...] nVlgGpID = nGpID
AND nGpTlkID = nTlkID
AND nTlkDistID = nDistID
AND nHLstGpID = nGpID
AND nHLstID = nMmbHhiID
It looks like cSttsDesc comes from STATUS. But if the text 'No, Eligible' matches exactly one nBStsSttsID in BNFTSTTS then find out the value and use that! If it is 7, take out LEFT JOIN STATUS ON nBStsSttsID = nSttsID and replace AND cSttsDesc like 'No, Eligible' with AND nBStsSttsID = '7'. This would see a massive speed improvement.
So I'm working on a data mining project where we're looking at code elements and their relationships and changes to these things over time. What we want is to ask some questions about how often related elements are changed. I've set it up as a view, but it's taking like 10 min to run. I believe the problem is that I'm having to do a lot of subtraction, concatenation, and string comparisons to compare entries (for our window size), but I don't know a good way to fix this. The query looks like
select aw.same
, rw.k
, count(distint concat_ws(',', r1.id, r2.id)) as num
from deltamethoddeclaration dmd1
join revision r1
on r1.id=FKrevID
join methodinvocation mi
on mi.FKcallerID = dmd1.FKMDID
join deltamethoddeclaration dmd2
on mi.FKcalleeID = dmd2.FKMDID
join revision r2
on r2.id = dmd2.FKrevID
join revisionwindow rw
join authorwindow aw
where (dmd1.FKrevID - dmd2.FKrevID) < rw.k
and (dmd2.FKrevID - dmd1.FKrevID) < rw.k
and case aw.same
when 1 then
r1.author = r2.author
when 0 then
r1.author <> r2.author
else
1=1
end
group by aw.same
, rw.k
;
Ok, so revisionwindow stores the revision windows we're interested in (10, 20, 50, 100) and authorwindow stores which author types we want (same, different, and don't care). Part of the problem is, we could have the same revision pair with different elements matching, so the only hack i could come up with was that ugly count(distinct concat()) thing. This should return a table with 12 rows, one for each combination of the author and revision windows. The entries under 'num' are the unique pairs of revisions related in the manner specified (in this case, both change methods and one of the methods calls the other). It works perfectly, it's just crazy slow (~10 min running time). I'm basically looking for any advice or help to make this work better without sacrificing accuracy.
where (dmd1.FKrevID - dmd2.FKrevID) < rw.k
The most damaging about this statement is the less than operator < not the arithmetic. B-trees cannot use this and forces a full table scan every time, any time. Gory details why this true: http://explainextended.com/2010/05/19/things-sql-needs-determining-range-cardinality/
I doubt your CASE statement can be optimized by the backend and <> operator suffers from the same problem as above. I would think about ways to join with = operators, perhaps breaking up the query and using UNION statements so you can always use indexes.
Your not using EXPLAIN. You need to start using it to optimize queries. You have no idea what indexes are being used and what are not, or if your condition is selective enough where they would even be helpful (if its not very selective see the last point) http://dev.mysql.com/doc/refman/5.0/en/explain.html
Since this a data mining application you have a great opportunity to use temp tables of intermediate values. Since the data is probably dumped at periodic intervals (or maybe even only once!) it is easy to rebuild the long running temp table every so often without running the risk of data corruption (Or it may just not matter since you looking for aggregate patterns.)
I have taken queries that were running over 60 minutes and reduced them to less than 100 ms (instant) by building temp tables that cached the hard stuff. If you are not able to use any of the ideas above, this is probably the lowest lying fruit. Take all the 'hard stuff' - case joins and non equality joins and do it one place. Then add an index to your temp table :-) The trick is to make it general enough that you can query the temp table so you still have flexibility to ask different questions.
I suspect the two joins (join revisionwindow rw) and (join authorwindow aw) that do not have an ON condition but use the WHERE, cause this.
How many records do these two tables have? MySQL probably does first a CROSS JOIN on these and only later checks the complex (WHERE) conditions.
But please post the results of EXPLAIN.
--EDIT--
Oops, I missed your last paragraph which explains that the two tables have 4 and 3 rows.
Can you try this:
(where the concat has been replaced
and the where clauses have been moved as JOIN ON ...)
select aw.same
, rw.k
, count(distint r1.id, r2.id) as num
from deltamethoddeclaration dmd1
join revision r1
on r1.id = dmd1.FKrevID
join methodinvocation mi
on mi.FKcallerID = dmd1.FKMDID
join deltamethoddeclaration dmd2
on mi.FKcalleeID = dmd2.FKMDID
join revision r2
on r2.id = dmd2.FKrevID
join revisionwindow rw
on (dmd1.FKrevID - dmd2.FKrevID) < rw.k
and (dmd2.FKrevID - dmd1.FKrevID) < rw.k
join authorwindow aw
on case aw.same
when 1 then
r1.author = r2.author
when 0 then
r1.author <> r2.author
else
1=1
end
group by aw.same
, rw.k
;