I am trying to compare values against same table which has more than 1,000,000 rows. Below is my query and it takes around 25 secs to get results.
EXPLAIN SELECT DISTINCT a.studyid,a.number,a.load_number,b.studyid,b.number,b.load_number FROM
(SELECT t1.*, buildnumber,platformid FROM t t1
INNER JOIN testlog t2 ON t1.`testid` = t2.`testid`
WHERE (buildnumber =1031719 AND platformid IN (SELECT platformid FROM platform WHERE platform.`Description` = "Windows 7 SP1"))
)AS a
JOIN
(SELECT t1.*,buildnumber,platformid FROM t t1
INNER JOIN testlog t2 ON t1.`testid` = t2.`testid`
WHERE (buildnumber =1030716 AND platformid IN (SELECT platformid FROM platform WHERE platform.`Description` = "Windows 7 SP1"))
)AS b
ON a.studyid=b.studyid AND a.load_number = b.load_number AND a.number = b.number
Could you anyone help me to improve this query to get fast enough results?
The problem here is even I have number and load_number index, the query doesn't use that. I dont know why it is always ignored..
Thanks.
First, you have a silly query. You are retrieving six columns, but there are only three values. Look at the on clause.
I think your best bet is to rewrite the query using conditional aggregation. I think the following is equivalent:
SELECT t1.studyid, t1.load_number, t1.number
FROM t t1 INNER JOIN
testlog t2
ON t1.testid = t2.testid
WHERE t2.buildnumber IN (1031719, 1030716) AND
platformid IN (SELECT platformid FROM platform p WHERE p.Description = 'Windows 7 SP1'))
GROUP BY studyid, load_number, number
HAVING MIN(buildnumber) <> MAX(buildnumber)
For this query, you want indexes on platform(Description, platformid) and testlog(buildnumber, platformid) and t(testid).
Problem #1:
IN ( SELECT ... ) optimizes very poorly. The subquery is rerun again and again. It looks like you are expecting exactly one id from that query; if so, change it to = ( SELECT ... ). That way it will be run exactly once.
Problem #2:
FROM ( SELECT ... )
JOIN ( SELECT ... ) ON ...
optimizes poorly because neither subquery. Can you merge the two subqueries into one, as Gordon was trying? If not, then put one of them into a TEMPORARY TABLE and add an appropriate index to that table so that the ON will be able to use it. Probably PRIMARY KEY(studyid, load_number, number).
Footnote: The latest versions of MySQL have made improvements on these problems by dynamically generating indexes. What version are you using?
Related
I have the following query, which was developed from a hint found online because of a problem with a GROUP BY returning the maximum value; but it's running really slowly.
Having looked online I'm seeing that WHERE IN (SELECT.... GROUP BY) is probably the issue, but, to be honest, I'm struggling to find a way around this:
SELECT *
FROM tbl_berths a
JOIN tbl_active_trains b on a.train_uid=b.train_uid
WHERE (a.train_id, a.TimeStamp) in (
SELECT a.train_id, max(a.TimeStamp)
FROM a
GROUP BY a.train_id
)
I'm thinking I possibly need a derived table, but my experience in this area is zero and it's just not working out!
you can move that to a SUBQUERY and also select only required columns instead of All (*)
SELECT a.train_uid
FROM tbl_berths a
JOIN tbl_active_trains b on a.train_uid=b.train_uid
JOIN (SELECT a.train_id, max(a.TimeStamp) as TimeStamp
FROM a
GROUP BY a.train_id )T
on a.train_id = T.train_id
and a.TimeStamp = T.TimeStamp
I'm quite sloppy with databases, can't get this working with joins, and I'm not even sure that would be faster...
DELETE FROM atable
WHERE btable_id IN (SELECT id
FROM btable
WHERE param > 2)
AND ctable_id IN (SELECT id
FROM ctable
WHERE ( someblob LIKE '%_ID1_%'
OR someblob LIKE '%_ID2_%' ))
Atable contains ~19M rows, this would delete ~3M of that. At the moment, I can only run the query with LIMIT 100000, and I don't want to sit here with phpmyadmin all day, because each deletion (of 100.000 rows) runs for about 1.5 mins.
Any ways to speed this up / automate it?
MySQL 5.5
(do you think it's already bad DB design if any table contains 20M rows?)
Use EXISTS or JOIN instead of IN to improve perfromance
Using EXISTS:
DELETE FROM Atable A
WHERE EXISTS (SELECT 1 FROM Btable B WHERE A.Btable_id = B.id AND B.param > 2) AND
EXISTS (SELECT 1 FROM Ctable C WHERE A.Ctable_id = C.id AND (C.someblob LIKE '%_ID1_%' OR C.someblob LIKE '%_ID2_%'))
Using JOIN:
DELETE A
FROM Atable A
INNER JOIN Btable B ON A.Btable_id = B.id AND B.param > 2
INNER JOIN Ctable C WHERE A.Ctable_id = C.id AND (C.someblob LIKE '%_ID1_%' OR C.someblob LIKE '%_ID2_%')
first you should try with exist instead of in. it's faster in many many case.
Then you could try to do inner join instead of in and exists.
Example :
delete a
from a
inner join b on b.id = a.tablebid
And finally if it could be possible (i don't know if you have id3, ids) to change the or by something else. Sometimes strange and complicated change helps the optimizer. case when, subquery...
I don't see where a simple index would help much. I'd do:
delete from atable where id in (
select
id
from
atable a
join btable b on a.btable_id = b.id
join ctable c on a.ctable_id = c.id
where
b.param > 2
and (
c.someblob LIKE '%_ID1_%'
OR c.someblob LIKE '%_ID2_%'
)
)
Correction: I'm assuming you've got indexes on btable and ctable's id's (probably, if they're primary keys...) and on b.param (if it's numeric).
Beside optimizing the query you could also take a look at a good use of indexes, since they might prevent a full table scan.
For BTable for example create an index on id and param.
To explain why this helps:
If the database has to look up the id and param values in the table in a unsorted manner, the database has to read ALL rows. If the database reads the index, SORTED, it can look up the id and param with reduced costs.
Here's an SQL statement (actually two statements) that works -- it's taking a series of matching rows and adding a delivery_number which increments for each row:
SELECT #i:=0;
UPDATE pipeline_deliveries AS d
SET d.delivery_number = #i:=#i+1
WHERE d.pipelineID = 11
ORDER BY d.setup_time;
But now, the client no longer wants them ordered by setup_time. They needed to be ordered according to departure time, which is a field in another table. I can't figure out how to do this.
The MySQL docs, as well as this answer, suggest that in version 4.0 and up (we're running MySQL 5.0) I should be able to do this:
SELECT #i:=0;
UPDATE pipeline_deliveries AS d RIGHT JOIN pipeline_routesXdeliveryID AS rXd
ON d.pipeline_deliveryID = rXd.pipeline_deliveryID
LEFT JOIN pipeline_routes AS r
ON rXd.pipeline_routeID = r.pipeline_routeID
SET d.delivery_number = #i:=#i+1
WHERE d.pipelineID = 11
ORDER BY r.departure_time,d.pipeline_deliveryID;
but I get the error #1221 - Incorrect usage of UPDATE and ORDER BY.
So what's the correct usage?
You can't mix UPDATE joining 2 (or more) tables and ORDER BY.
You can bypass the limitation, with something like this:
UPDATE
pipeline_deliveries AS upd
JOIN
( SELECT t.pipeline_deliveryID,
#i := #i+1 AS row_number
FROM
( SELECT #i:=0 ) AS dummy
CROSS JOIN
( SELECT d.pipeline_deliveryID
FROM
pipeline_deliveries AS d
JOIN
pipeline_routesXdeliveryID AS rXd
ON d.pipeline_deliveryID = rXd.pipeline_deliveryID
LEFT JOIN
pipeline_routes AS r
ON rXd.pipeline_routeID = r.pipeline_routeID
WHERE
d.pipelineID = 11
ORDER BY
r.departure_time, d.pipeline_deliveryID
) AS t
) AS tmp
ON tmp.pipeline_deliveryID = upd.pipeline_deliveryID
SET
upd.delivery_number = tmp.row_number ;
The above uses two features of MySQL, user defined variables and ordering inside a derived table. Because the latter is not standard SQL, it may very well break in a feature release of MySQL (when the optimizer is clever enough to figure out that ordering inside a derived table is useless unless there is a LIMIT clause). In fact the query would do exactly that in the latest versions of MariaDB (5.3 and 5.5). It would run as if the ORDER BY was not there and the results would not be the expected. See a related question at MariaDB site: GROUP BY trick has been optimized away.
The same may very well happen in any future release of main-strean MySQL (maybe in 5.6, anyone care to test this?) that will improve the optimizer code.
So, it's better to write this in standard SQL. The best would be window functions which haven't been implemented yet. But you could also use a self-join, which will be not very bad regarding efficiency, as long as you are dealing with a small subset of rows to be affected by the update.
UPDATE
pipeline_deliveries AS upd
JOIN
( SELECT t1.pipeline_deliveryID
, COUNT(*) AS row_number
FROM
( SELECT d.pipeline_deliveryID
, r.departure_time
FROM
pipeline_deliveries AS d
JOIN
pipeline_routesXdeliveryID AS rXd
ON d.pipeline_deliveryID = rXd.pipeline_deliveryID
LEFT JOIN
pipeline_routes AS r
ON rXd.pipeline_routeID = r.pipeline_routeID
WHERE
d.pipelineID = 11
) AS t1
JOIN
( SELECT d.pipeline_deliveryID
, r.departure_time
FROM
pipeline_deliveries AS d
JOIN
pipeline_routesXdeliveryID AS rXd
ON d.pipeline_deliveryID = rXd.pipeline_deliveryID
LEFT JOIN
pipeline_routes AS r
ON rXd.pipeline_routeID = r.pipeline_routeID
WHERE
d.pipelineID = 11
) AS t2
ON t2.departure_time < t2.departure_time
OR t2.departure_time = t2.departure_time
AND t2.pipeline_deliveryID <= t1.pipeline_deliveryID
OR t1.departure_time IS NULL
AND ( t2.departure_time IS NOT NULL
OR t2.departure_time IS NULL
AND t2.pipeline_deliveryID <= t1.pipeline_deliveryID
)
GROUP BY
t1.pipeline_deliveryID
) AS tmp
ON tmp.pipeline_deliveryID = upd.pipeline_deliveryID
SET
upd.delivery_number = tmp.row_number ;
Based on this documentation
For the multiple-table syntax, UPDATE updates rows in each table named
in table_references that satisfy the conditions. In this case, ORDER
BY and LIMIT cannot be used.
Without knowing too much about MySQL you could open up a cursor and process this row by row, or by passing it back to the client code (PHP,Java, etc) that you maintain to handle this processing.
After more digging:
To eliminate the badly optimized subquery, you need to rewrite the
subquery as a join, but how can you do that and retain the LIMIT and
ORDER BY? One way is to find the rows to be updated in a subquery in
the FROM clause, so the LIMIT and ORDER BY can be nested inside the
subquery. In this way work_to_do is joined against the ten
highest-priority unclaimed rows of itself. Normally you can’t
self-join the update target in a multi-table UPDATE, but since it’s
within a subquery in the FROM clause, it works in this case.
update work_to_do as target
inner join (
select w. client, work_unit
from work_to_do as w
inner join eligible_client as e on e.client = w.client
where processor = 0
order by priority desc
limit 10
) as source on source.client = target.client
and source.work_unit = target.work_unit
set processor = #process_id;
There is one downside: the rows are not locked in primary key order.
This may help explain the occasional deadlock we get on this table
The hard way:-
ALTER TABLE eav_attribute_option
ADD temp_value TEXT NOT NULL
AFTER sort_order;
UPDATE eav_attribute_option o
JOIN eav_attribute_option_value ov ON o.option_id=ov.option_id
SET o.temp_value = ov.value
WHERE o.attribute_id=90;
SET #x = 0;
UPDATE eav_attribute_option
SET sort_order = (#x:=#x+1)
WHERE attribute_id=90
ORDER BY temp_value ASC;
ALTER TABLE eav_attribute_option
DROP temp_value;
I'm having some troubles (both technical and conceptual) optimizating some very slow queries.
This is my original query:
select Con_Progr,
Pre_Progr
from contribuente,
preavviso_ru,
comune,
via
where [lots of where clauses]
and ((Via_Progr = Con_Via_Sec) or (Via_Progr = Con_Via_Res and (Con_Via_Sec is null or Con_Via_Sec = '0')))
order by Con_Cognome,
Con_Nome asc;
This query takes about 38secs to execute, which is a really slow time. I manipulated it a bit and managed to speed up to about 0.1sec and now the query looks like this:
(select Con_Progr,
Pre_Progr
from preavviso_ru
join contribuente
on Pre_Contribuente = Con_Progr
join via
on Via_Progr = Con_Via_Sec
join comune
on Via_Comune = Com_CC
where [lots of where clauses]
order by Con_Cognome,
Con_Nome asc
)
union
(
select Con_Progr,
Pre_Progr
from preavviso_ru
join contribuente
on Pre_Contribuente = Con_Progr
join via
on Via_Progr = Con_Via_Res
join comune
on Via_Comune = Com_CC
where [lots of where clauses]
and (Con_Via_Sec is null or Con_Via_Sec = '0')
order by Con_Cognome,
Con_Nome asc
)
As you can see I split up the where clause in the original query that used an OR operator in two different subqueries and then merged them. That resolved the speed problem. The result though is not perfect, 'cause I've lost the ordering. I tried to select the columns in the subqueries and then perform the ordering on that result, like this:
select Con_Progr,
Pre_Progr
from (
[FIRST SUBQUERY]
) as T1 union (
[SECOND SUBQUERY]
) as T2
order by Con_Cognome,
Con_Nome asc
but I get a syntax error near 'union'. Any suggestion?
This was the technical issue. Now for the conceptual, I reckon that the two subqueries are VERY similar (they only differ by a join clause and a where clause), is there a way to rearrange the second query (the fast one) in a more elegant way?
I resolved the techincal issue, I misplaced the parenthesis:
select Con_Progr,
Pre_Progr
from (
[FIRST SUBQUERY]
union
[SECOND SUBQUERY]
) as T
order by Con_Cognome,
Con_Nome asc
Now it works almost perfectly (there's still a bit of discrepancy in the ordering but it's not a problem.
About the efficiency of the query, I found that the problem is the OR condition on two different columns, since MySQL 4.0 (which I'm using for backwards compatibility issues) allows only one index for table. But still, why doesn't it use at least one of those indexes. I'll test a bit more...
im trying to generate a report using CodeIgniter and Datatables.net .
Now i'm trying to the amount of closed jobs (its a human resources system). I used to query all jobs and in PHP do a foreach and then doing the calcs.
Because im want to use all the features of Datatables (sorting specifically) im trying to do all the calcs in mySQL.
The problem is: the second subquery is very very very slow.
SELECT
jobs.jobs_id, clients.nome_fantasia, concat_ws(' ', user_profiles.first_name, user_profiles.last_name) as fullname,
jobs.titulo_vaga, jobs.qtd_vagas, company.name as nome_company, jobs_status.name as status_name, DATEDIFF(NOW(), jobs.data_abertura) as date_idade,
(select count(job_cv.jobs_id) from job_cv where job_cv.jobs_id = jobs.jobs_id) as qtd_int,
(select count(distinct job_cv.user_id) from job_cv_history join job_cv on job_cv.job_cv_id = job_cv_history.job_cv_id where job_cv_history.status = '11' and job_cv.jobs_id = jobs.jobs_id ) as fechadas
FROM (jobs)
JOIN clients ON lients.clients_id=jobs.clients_idJOIN user_profiles ON jobs.consultor_id=user_profiles.user_id
JOIN jobs_status ON jobs.status=jobs_status.jobs_status_id
JOIN company ON jobs.company_id=company.company_id
LIMIT 50
Some one can help me? I can provide more information if its needed.
UPDATE
The idea to use JOIN instead SELECT work with the first subquery but with the second one not, there a way to pass a 'variable' to use inside the subquery? Like the current jobs_id?
UPDATE AGAIN
This line works fine by itself. But inside the subquery take about a minute with worng values
SELECT job_cv.jobs_id,count(distinct job_cv.user_id) AS fechadas
FROM job_cv_history
JOIN job_cv
ON job_cv.job_cv_id = job_cv_history.job_cv_id
WHERE job_cv_history.status = '11'
GROUP BY job_cv.jobs_id
It is not subquery that is slow. It's the fact, that you're executing these subqueries for each row returned from outer query. Move these to joins instead, and you should observe increase in performance.
SELECT
jobs.jobs_id, clients.nome_fantasia, concat_ws(' ', user_profiles.first_name, user_profiles.last_name) as fullname,
jobs.titulo_vaga, jobs.qtd_vagas, company.name as nome_company, jobs_status.name as status_name, DATEDIFF(NOW(), jobs.data_abertura) as date_idade,
qtd.qtd_int,
fechadas.fechadas
FROM (jobs)
JOIN clients ON lients.clients_id=jobs.clients_idJOIN user_profiles ON jobs.consultor_id=user_profiles.user_id
JOIN jobs_status ON jobs.status=jobs_status.jobs_status_id
JOIN company ON jobs.company_id=company.company_id
JOIN (
SELECT jobs_id, count(jobs_id) AS qtd_int FROM job_cv GROUP BY jobs_id
) AS qtd ON qtd.jobs_id = jobs.jobs_id
JOIN (
SELECT job_cv.user_id, count(distinct job_cv.user_id) AS fechadas
FROM job_cv_history
JOIN job_cv
ON job_cv.job_cv_id = job_cv_history.job_cv_id
WHERE job_cv_history.status = '11'
GROUP BY job_cv.user_id
) AS fechadas ON job_cv.jobs_id = jobs.jobs_id
LIMIT 50
You may try to create these indexes:
ALTER TABLE `job_cv` ADD INDEX `job_cv_cindex` (`job_cv_id` ASC, `jobs_id` ASC, `user_id` ASC);
ALTER TABLE `job_cv_history` ADD INDEX `job_cv_history_cindex` (`job_cv_id` ASC, `status` ASC);
use Joins instead of sub queries. It significantly improves the performance in MySql.
try to use Left join on your case and see if performance improves or not