Optimizing the SQL Query to reduce execution time - mysql

My SQL Query with all the filters applied is returning 10 lakhs (one million) records . To get all the records it is taking 76.28 seconds .. which is not acceptable . How can I optimize my SQL Query which should take less time.
The Query I am using is :
SELECT cDistName , cTlkName, cGpName, cVlgName ,
cMmbName , dSrvyOn
FROM sspk.villages
LEFT JOIN gps ON nVlgGpID = nGpID
LEFT JOIN TALUKS ON nGpTlkID = nTlkID
left JOIN dists ON nTlkDistID = nDistID
LEFT JOIN HHINFO ON nHLstGpID = nGpID
LEFT JOIN MEMBERS ON nHLstID = nMmbHhiID
LEFT JOIN BNFTSTTS ON nMmbID = nBStsMmbID
LEFT JOIN STATUS ON nBStsSttsID = nSttsID
LEFT JOIN SCHEMES ON nBStsSchID = nSchID
WHERE (
(nMmbGndrID = 1 and nMmbAge between 18 and 60)
or (nMmbGndrID = 2 and nMmbAge between 18 and 55)
)
AND cSttsDesc like 'No, Eligible'
AND DATE_FORMAT(dSrvyOn , '%m-%Y') < DATE_FORMAT('2012-08-01' , '%m-%Y' )
GROUP BY cDistName , cTlkName, cGpName, cVlgName ,
DATE_FORMAT(dSrvyOn , '%m-%Y')
I have searched on the forum and outside and used some of the tips given but it hardly makes any difference . The joins that i have used in above query is left join all on Primary Key and Foreign key . Can any one suggest me how can I modify this sql to get less execution time ....

You are, sir, a very demanding user of MySQL! A million records retrieved from a massively joined result set at the speed you mentioned is 76 microseconds per record. Many would consider this to be acceptable performance. Keep in mind that your client software may be a limiting factor with a result set of that size: it has to consume the enormous result set and do something with it.
That being said, I see a couple of problems.
First, rewrite your query so every column name is qualified by a table name. You'll do this for yourself and the next person who maintains it. You can see at a glance what your WHERE criteria need to do.
Second, consider this search criterion. It requires TWO searches, because of the OR.
WHERE (
(MEMBERS.nMmbGndrID = 1 and MEMBERS.nMmbAge between 18 and 60)
or (MEMBERS.nMmbGndrID = 2 and MEMBERS.nMmbAge between 18 and 55)
)
I'm guessing that these criteria match most of your population -- females 18-60 and males 18-55 (a guess). Can you put the MEMBERS table first in your list of LEFT JOINs? Or can you put a derived column (MEMBERS.working_age = 1 or some such) in your table?
Also try a compound index on (nMmbGndrID,nMmbAge) on MEMBERS to speed this up. It may or may not work.
Third, consider this criterion.
AND DATE_FORMAT(dSrvyOn , '%m-%Y') < DATE_FORMAT('2012-08-01' , '%m-%Y' )
You've applied a function to the dSrvyOn column. This defeats the use of an index for that search. Instead, try this.
AND dSrvyOn >= '2102-08-01'
AND dSrvyOn < '2012-08-01' + INTERVAL 1 MONTH
This will, if you have an index on dSrvyOn, do a range search on that index. My remark also applies to the function in your ORDER BY clause.
Finally, as somebody else mentioned, don't use LIKE to search where = will do. And NEVER use column LIKE '%something%' if you want acceptable performance.

You claim yourself you base your joins on good and unique indexes. So there is little to be optimized. Maybe a few hints:
try to optimize your table layout, maybe you can reduce the number of joins required. That probably brings more performance optimization than anything else.
check your hardware (available memory and things) and the server configuration.
use mysqls explain feature to find bottle necks.
maybe you can create an auxilliary table especially for this query, which is filled by a background process. That way the query itself runs faster, since the work is done before the query in background. That usually works if the query retrieves data that must not neccessarily be synchronous with every single change in the database.
check if an RDBMS is really the right type of database. For many purposes graph databases are much more efficient and offer better performance.

Try adding an index to nMmbGndrID, nMmbAge, and cSttsDesc and see if that helps your queries out.
Additionally you can use the "Explain" command before your select statement to give you some hints on what you might do better. See the MySQL Reference for more details on explain.

If the tables used in joins are least use for updates queries, then you can probably change the engine type from INNODB to MyISAM.
Select queries in MyISAM runs 2x faster then in INNODB, but the updates and insert queries are much slower in MyISAM.

You can create Views in order to avoid long queries and time.

Your like operator could be holding you up -- full-text search with like is not MySQL's strong point.
Consider setting a fulltext index on cSttsDesc (make sure it is a TEXT field first).
ALTER TABLE articles ADD FULLTEXT(cSttsDesc);
SELECT
*
FROM
table_name
WHERE MATCH(cSttsDesc) AGAINST('No, Eligible')
Alternatively, you can set a boolean flag instead of cSttsDesc like 'No, Eligible'.
Source: http://devzone.zend.com/26/using-mysql-full-text-searching/

This SQL has many things that are redundant that may not show up in an explain.
If you require a field, it shouldn't be in a table that's in a LEFT JOIN - left join is for when data might be in the joined table, not when it has to be.
If all the required fields are in the same table, it should be the in your first FROM.
If your text search is predictable (not from user input) and relates to a single known ID, use the ID not the text search (props to Patricia for spotting the LIKE bottleneck).
Your query is hard to read because of the lack of table hinting, but there does seem to be a pattern to your field names.
You require nMmbGndrID and nMmbAge to have a value, but these are probably in MEMBERS, which is 5 left joins down. That's a redundancy.
Remember that you can do a simple join like this:
FROM sspk.villages, gps, TALUKS, dists, HHINFO, MEMBERS [...] WHERE [...] nVlgGpID = nGpID
AND nGpTlkID = nTlkID
AND nTlkDistID = nDistID
AND nHLstGpID = nGpID
AND nHLstID = nMmbHhiID
It looks like cSttsDesc comes from STATUS. But if the text 'No, Eligible' matches exactly one nBStsSttsID in BNFTSTTS then find out the value and use that! If it is 7, take out LEFT JOIN STATUS ON nBStsSttsID = nSttsID and replace AND cSttsDesc like 'No, Eligible' with AND nBStsSttsID = '7'. This would see a massive speed improvement.

Related

how can I improve the performance of this slow query in mysql

I have a mysql query which combines data from 3 tables, which I'm calling "first_table", "second_table", and "third_table" as shown below.
This query consistently shows up in the MySQL slow query log, even though all fields referenced in the query are indexed, and the actual amount of data in these tables is not large (< 1000 records, except for "third_table" which has more like 10,000 records).
I'm trying to determine if there is a better way to structure this query to achieve better performance, and what part of this query is likely to be the most likely culprit for causing the slowdown.
Please note that "third_table.placements" is a JSON field type. All "label" fields are varchar(255), "id" fields are primary key integer fields, "sample_img" is an integer, "guid" is a string, "deleted" is an integer, and "timestamp" is a datetime.
SELECT DISTINCT first_table.id,
first_table.label,
(SELECT guid
FROM second_table
WHERE second_table.id = first_table.sample_img) AS guid,
Count(third_table.id) AS
related_count,
Sum(Json_length(third_table.placements)) AS
placements_count
FROM first_table
LEFT JOIN third_table
ON Json_overlaps(third_table.placements,
Cast(first_table.id AS CHAR))
WHERE first_table.deleted IS NULL
AND third_table.deleted IS NULL
AND Unix_timestamp(third_table.timestamp) >= 1647586800
AND Unix_timestamp(third_table.timestamp) < 1648191600
GROUP BY first_table.id
ORDER BY Lower(first_table.label) ASC
LIMIT 0, 1000
The biggest problem is that these are not sargable:
WHERE ... Unix_timestamp(third_table.timestamp) < 1648191600
ORDER BY Lower(first_table.label)
That is, don't hide a potentially indexed column inside a function call. Instead:
WHERE ... third_table.timestamp < FROM_UNIXTIME(1648191600)
and use a case insensitive COLLATION for first_table.label. That is any collation ending in _ci. (Please provide SHOW CREATE TABLE so I can point that out, and to check the vague "all fields are indexed" -- That usually indicates not knowing the benefits of "composite" indexes.)
Json_overlaps(...) is probably also not sargable. But it gets trickier to fix. Please explain the structure of the json and the types of id and placements.
Do you really need 1000 rows in the output? That is quite large for "pagination".
How big are the tables? UUIDs/GUIDs are notorious when the tables are too big to be cached in RAM.
It is possibly never useful to have both SELECT DISTINCT and GROUP BY. Removing the DISTINCT may speed up the query by avoiding an extra sort.
Do you really want LEFT JOIN, not just JOIN? (I don't understand the query enough to make a guess.)
After you have fixed most of those, and if you still need help, I may have a way to get rid of the GROUP BY by adding a 'derived' table. Later. (Then I may be able to address the "json_overlaps" discussion.)
Please provide EXPLAIN SELECT ...

Temp tables vs subqueries in inner join

Both SQL, return the same results. The first my joins are on the subqueries the second the final queryis a join with a temporary that previously I create/populate them
SELECT COUNT(*) totalCollegiates, SUM(getFee(c.collegiate_id, dateS)) totalMoney
FROM collegiates c
LEFT JOIN (
SELECT collegiate_id FROM collegiateRemittances r
INNER JOIN remittances r1 USING(remittance_id)
WHERE r1.type_id = 1 AND r1.name = remesa
) hasRemittance ON hasRemittance.collegiate_id = c.collegiate_id
WHERE hasRemittance.collegiate_id IS NULL AND c.typePayment = 1 AND c.active = 1 AND c.exentFee = 0 AND c.approvedBoard = 1 AND IF(notCollegiate, c.collegiate_id NOT IN (notCollegiate), '1=1');
DROP TEMPORARY TABLE IF EXISTS hasRemittance;
CREATE TEMPORARY TABLE hasRemittance
SELECT collegiate_id FROM collegiateRemittances r
INNER JOIN remittances r1 USING(remittance_id)
WHERE r1.type_id = 1 AND r1.name = remesa;
SELECT COUNT(*) totalCollegiates, SUM(getFee(c.collegiate_id, dateS)) totalMoney
FROM collegiates c
LEFT JOIN hasRemittance ON hasRemittance.collegiate_id = c.collegiate_id
WHERE hasRemittance.collegiate_id IS NULL AND c.typePayment = 1 AND c.active = 1 AND c.exentFee = 0 AND c.approvedBoard = 1 AND IF(notCollegiate, c.collegiate_id NOT IN (notCollegiate), '1=1');
Which will have better performance for a few thousand records?
The two formulations are identical except that your explicit temp table version is 3 sql statements instead of just 1. That is, the overhead of the back and forth to the server makes it slower. But...
Since the implicit temp table is in a LEFT JOIN, that subquery may be evaluated in one of two ways...
Older versions of MySQL were 'dump' and re-evaluated it. Hence slow.
Newer versions automatically create an index. Hence fast.
Meanwhile, you could speed up the explicit temp table version by adding a suitable index. It would be PRIMARY KEY(collegiate_id). If there is a chance of that INNER JOIN producing dups, then say SELECT DISTINCT.
For "a few thousand" rows, you usually don't need to worry about performance.
Oracle has a zillion options for everything. MySQL has very few, with the default being (usually) the best. So ignore the answer that discussed various options that you could use in MySQL.
There are issues with
AND IF(notCollegiate,
c.collegiate_id NOT IN (notCollegiate),
'1=1')
I can't tell which table notCollegiate is in. notCollegiate cannot be a list, so why use IN? Instead simply use !=. Finally, '1=1' is a 3-character string; did you really want that?
For performance (of either version)
remittances needs INDEX(type_id, name, remittance_id) with remittance_id specifically last.
collegiateRemittances needs INDEX(remittance_id) (unless it is the PK).
collegiates needs INDEX(typePayment, active, exentFee , approvedBoard) in any order.
Bottom line: Worry more about indexes than how you formulate the query.
Ouch. Another wrinkle. What is getFee()? If it is a Stored Function, maybe we need to worry about optimizing it?? And what is dateS?
It depends actually. You'll have to test performance of every option. On my website I had 2 tables with articles and comments to them. It turned out it's faster to call comment counts 20 times for each article, than using a single union query. MySQL (like other DBs) caches queries, so small simple queries can run amazingly fast.
I did not saw that you have tagged the question as mysql so I initialy aswered for Oracle. Here is what I think about mySQL.
MySQL
There are two options when it comes to temporary tables Memory or Disk. And for Disk you can have MyIsam - non transactional and InnoDB transactional. Of course you can expect better performance for non transactional type of storage.
Additionaly you need to figure out how big resultset are you dealing with. For small resultset the memory option would be faster for large resultset the disk option would be faster.
Again at the end as in my original answer you need to figure out what performance is good enough and go for the most descriptive and easy to read option.
Oracle
It depends on what kind of temporary table you are dealing with.
You can have session based temporary tables - data is held until logout, or transaction based - data is held until commit . On top of this they can support transaction logging or not support it. Depending on configuration you can get better performance from a temporary table.
As everything in the world performance is relative therm. Most probably for few thousand records it will not do significant difference between the two queries. In which case I would go not for the most performant on but for the most easier to read and understand one.

SQL query takes too much time (3 joins)

I'm facing an issue with an SQL Query. I'm developing a php website, and to avoid making too much queries, I prefer to make a big one looking like :
select m.*, cj.*, cjb.*, me.pseudo as pseudo_acheteur
from mercato m
JOIN cartes_joueur cj
ON m.ID_carte = cj.ID_carte_joueur
JOIN cartes_joueur_base cjb
ON cj.ID_carte_joueur_base = cjb.ID_carte_joueur_base
JOIN membres me
ON me.ID_membre = cj.ID_membre
where not exists (select * from mercato_encheres me where me.ID_mercato = m.ID_mercato)
and cj.ID_membre = 2
and m.status <> 'cancelled'
ORDER BY total_carac desc, cj.level desc, cjb.nom_carte asc
This should return all cards sold by the member without any bet on it. In the result, I need all the information to display them.
Here is the approximate rows in each table :
mercato : 1200
cartes_joueur : 800 000
carte_joueur_base : 62
membres : 2000
mercato_enchere : 15 000
I tried to reduce them (in dev environment) by deleting old data; but the query still needs 10~15 seconds to execute (which is way too long on a website )
Thanks for your help.
Let's take a look.
The use of * in SELECT clauses is harmful to query performance. Why? It's wasteful. It needlessly adds to the volume of data the server must process, and in the case of JOINs, can force the processing of columns with duplicate values. If you possibly can do so, try to enumerate the columns you need.
You may not have useful indexes on your tables for accelerating this. We can't tell. Please notice that MySQL can't exploit multiple indexes in a single query, so to make a query fast you often need a well-chosen compound index. I suggest you try defining the index (ID_membre, ID_carte_jouer, ID_carte_joueur_base) on your cartes_joueur table. Why? Your query matches for equality on the first of those columns, and then uses the second and third column in ON conditions.
I have often found that writing a query with the largest table (most rows) first helps me think clearly about optimizing. In your case your largest table is cartes_jouer and you are choosing just one ID_membre value from that table. Your clearest path to optimization is the knowledge that you only need to examine approximately 400 rows from that table, not 800 000. An appropriate compound index will make that possible, and it's easiest to imagine that index's columns if the table comes first in your query.
You have a correlated subquery -- this one.
where not exists (select *
from mercato_encheres me
where me.ID_mercato = m.ID_mercato)
MySQL's query planner can be stupidly literal-minded when it sees this, running it thousands of times. In your case it's even worse: it's got SELECT * in it: see point 1 above.
It should be refactored to use the LEFT JOIN ... IS NULL pattern. Here's how that goes.
select whatever
from mercato m
JOIN ...
JOIN ...
LEFT JOIN mercato_encheres mench ON mench.ID_mercato = m.ID_mercato
WHERE mench.ID_mercato IS NULL
and ...
ORDER BY ...
Explanation: The use of LEFT JOIN rather than ordinary inner JOIN allows rows from the mercato table to be preserved in the output even when the ON condition does not match them to tables in the mercato_encheres table. The mismatching rows get NULL values for the second table. The mench.ID_mercato IS NULL condition in the WHERE clause then selects only the mismatching rows.

mysql like operator on integer giving me much better results than = operator

this is the indexes on tblNewsToCity:
I have this query:
SELECT *
FROM `tblnews`
INNER JOIN `tblnewstocity`
ON `tblnews`.`id` = `tblnewstocity`.`fkid`
WHERE `tblnewstocity`.`city` = '233'
AND `tblnews`.`id` != '1771'
ORDER BY `tblnews`.`id` DESC
LIMIT 3 offset 0
which is running slow
but if i am changing tblnewstocitycity = 233 to: tblnewstocitycity like 233
I am getting this good results:
would love to understand why? what Am i missing? why the first query not running good as the second when the second uses the like operator on integers when it should be even slower
MySQL has several options to execute that query:
look for ALL rows with tblnewstocity.city = '233' by using the index, do the join and order by the id. It has to check all rows given by the index, because the first 3 don't have to have the largest tblnews.id.
go through tblnews in the order by-order (from the end), do the join, and look for the FIRST 3 rows that happens to have the right city-value. It can stop after having found 3 rows, as there cannot be larger tblnews.id.
It depends on your data which way is faster. If you have e.g. only 2 rows that fit your index-conditions (and the join), the first one will be faster, because it just have to check a handful of rows, while the second query would have to check the whole table to realize there are only 2. If e.g. all rows would have city = 233, the first query would have to find all (by index, but it will still be all), order them and take the first 3, while the second query will only have to test the first 3 rows, because the are already ordered.
A realistic distribution will lie somewhere in between these possibilities. MySQL has to guess. It guessed that the index (for =) will give you only a little number of rows, so it took option 1. like will make MySQL trust the index less, so it will prefer the second option, which, luckily, was the faster one. But it could have gone the other way too, try e.g. like and = with a value that has no city (well, depending on your mysql server version, the optimizer might check it and might not fall for that, so maybe test it with a value that would only give a handfull of rows without the limit.
Long story short: there are two solutions for that:
either replace INNER JOIN tblnewstocity with straight_join tblnewstocity, this will force mysql to take the second way (but of course risking slow execution for the counter examples)
or add a proper index: tblnewstocity (city, fkid) should take care of the problem. You might have to change ORDER BY tblnews.id DESC to ORDER BY tblnewstocity.fkid DESC, which is equivalent because tblnewstocity.fkid = tblnews.id according to the join-condition, and mysql should realize this on its own, but you never know...
Not sure regarding the LIKE issue, but you should include the filter in the JOIN
SELECT *
FROM `tblnews`
INNER JOIN `tblnewstocity`
ON `tblnews`.`id` = `tblnewstocity`.`fkid`
AND `tblnews`.`id` != '1771'
AND `tblnewstocity`.`city` = '233'
ORDER BY `tblnews`.`id` DESC
LIMIT 3 offset 0

Specific SQL Query Optimization Help Needed

So I'm working on a data mining project where we're looking at code elements and their relationships and changes to these things over time. What we want is to ask some questions about how often related elements are changed. I've set it up as a view, but it's taking like 10 min to run. I believe the problem is that I'm having to do a lot of subtraction, concatenation, and string comparisons to compare entries (for our window size), but I don't know a good way to fix this. The query looks like
select aw.same
, rw.k
, count(distint concat_ws(',', r1.id, r2.id)) as num
from deltamethoddeclaration dmd1
join revision r1
on r1.id=FKrevID
join methodinvocation mi
on mi.FKcallerID = dmd1.FKMDID
join deltamethoddeclaration dmd2
on mi.FKcalleeID = dmd2.FKMDID
join revision r2
on r2.id = dmd2.FKrevID
join revisionwindow rw
join authorwindow aw
where (dmd1.FKrevID - dmd2.FKrevID) < rw.k
and (dmd2.FKrevID - dmd1.FKrevID) < rw.k
and case aw.same
when 1 then
r1.author = r2.author
when 0 then
r1.author <> r2.author
else
1=1
end
group by aw.same
, rw.k
;
Ok, so revisionwindow stores the revision windows we're interested in (10, 20, 50, 100) and authorwindow stores which author types we want (same, different, and don't care). Part of the problem is, we could have the same revision pair with different elements matching, so the only hack i could come up with was that ugly count(distinct concat()) thing. This should return a table with 12 rows, one for each combination of the author and revision windows. The entries under 'num' are the unique pairs of revisions related in the manner specified (in this case, both change methods and one of the methods calls the other). It works perfectly, it's just crazy slow (~10 min running time). I'm basically looking for any advice or help to make this work better without sacrificing accuracy.
where (dmd1.FKrevID - dmd2.FKrevID) < rw.k
The most damaging about this statement is the less than operator < not the arithmetic. B-trees cannot use this and forces a full table scan every time, any time. Gory details why this true: http://explainextended.com/2010/05/19/things-sql-needs-determining-range-cardinality/
I doubt your CASE statement can be optimized by the backend and <> operator suffers from the same problem as above. I would think about ways to join with = operators, perhaps breaking up the query and using UNION statements so you can always use indexes.
Your not using EXPLAIN. You need to start using it to optimize queries. You have no idea what indexes are being used and what are not, or if your condition is selective enough where they would even be helpful (if its not very selective see the last point) http://dev.mysql.com/doc/refman/5.0/en/explain.html
Since this a data mining application you have a great opportunity to use temp tables of intermediate values. Since the data is probably dumped at periodic intervals (or maybe even only once!) it is easy to rebuild the long running temp table every so often without running the risk of data corruption (Or it may just not matter since you looking for aggregate patterns.)
I have taken queries that were running over 60 minutes and reduced them to less than 100 ms (instant) by building temp tables that cached the hard stuff. If you are not able to use any of the ideas above, this is probably the lowest lying fruit. Take all the 'hard stuff' - case joins and non equality joins and do it one place. Then add an index to your temp table :-) The trick is to make it general enough that you can query the temp table so you still have flexibility to ask different questions.
I suspect the two joins (join revisionwindow rw) and (join authorwindow aw) that do not have an ON condition but use the WHERE, cause this.
How many records do these two tables have? MySQL probably does first a CROSS JOIN on these and only later checks the complex (WHERE) conditions.
But please post the results of EXPLAIN.
--EDIT--
Oops, I missed your last paragraph which explains that the two tables have 4 and 3 rows.
Can you try this:
(where the concat has been replaced
and the where clauses have been moved as JOIN ON ...)
select aw.same
, rw.k
, count(distint r1.id, r2.id) as num
from deltamethoddeclaration dmd1
join revision r1
on r1.id = dmd1.FKrevID
join methodinvocation mi
on mi.FKcallerID = dmd1.FKMDID
join deltamethoddeclaration dmd2
on mi.FKcalleeID = dmd2.FKMDID
join revision r2
on r2.id = dmd2.FKrevID
join revisionwindow rw
on (dmd1.FKrevID - dmd2.FKrevID) < rw.k
and (dmd2.FKrevID - dmd1.FKrevID) < rw.k
join authorwindow aw
on case aw.same
when 1 then
r1.author = r2.author
when 0 then
r1.author <> r2.author
else
1=1
end
group by aw.same
, rw.k
;