Slow mysql query, join on huge table, not using indexes - mysql

SELECT archive.id, archive.file, archive.create_timestamp, archive.spent
FROM archive LEFT JOIN submissions ON archive.id = submissions.id
WHERE submissions.id is NULL
AND archive.file is not NULL
AND archive.create_timestamp < DATE_SUB(NOW(), INTERVAL 6 month)
AND spent = 0
ORDER BY archive.create_timestamp ASC LIMIT 10000
EXPLAIN result:
+----+-------------+--------------------+--------+--------------------------------+------------------+---------+--------------------------------------------+-----------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+--------+--------------------------------+------------------+---------+--------------------------------------------+-----------+--------------------------------------+
| 1 | SIMPLE | archive | range | create_timestamp,file_in | create_timestamp | 4 | NULL | 111288502 | Using where |
| 1 | SIMPLE | submissions | eq_ref | PRIMARY | PRIMARY | 4 | production.archive.id | 1 | Using where; Using index; Not exists |
+----+-------------+--------------------+--------+--------------------------------+------------------+---------+--------------------------------------------+-----------+--------------------------------------+
I've tried hinting use of indexes for archive table with:
USE INDEX (create_timestamp,file_in)
Archive table is huge, ~150mil records.
Any help with speeding up this query would be greatly appreciated.

You want to use a composite index. For this query:
create index archive_file_spent_createts on archive(file, spent, createtimestamp);
In such an index, you want the where conditions with = to come first, followed by up to one column with an inequality. In this case, I'm not sure if MySQL will use the index for the order by.
This assumes that the where conditions do, indeed, significantly reduce the size of the data.

Related

Speed up MYSQL query that has order by

I got 300,000 rows in 'tblmessages', and I'm trying to run this query.
If U don't use "order by msgId desc" It's run very fast, but when I add the order It's very very slow.
What am I missing...?
my index
SELECT msgId
FROM tblmessages
left join wp_users as a on a.ID = (CASE WHEN (msgFromUserId=1) then tblmessages.msgToUserId else tblmessages.msgFromUserId END)
left join tblforum_users u1 on u1.user_ID =(CASE WHEN (msgFromUserId=1) then tblmessages.msgToUserId else tblmessages.msgFromUserId END)
where (msgFromUserId=1 or msgToUserId=1)
order by msgId desc limit 0,20
after Explain:
+----+-------------+-------------+------------+-------------+---------------------------------------+---------------------------------------+---------+------+--------+----------+-----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+-------------+---------------------------------------+---------------------------------------+---------+------+--------+----------+-----------------------------------------------------+
| 1 | SIMPLE | tblmessages | NULL | index_merge | IXtbl_messages_from,IX_tblmessages_to | IX_tblmessages_from,IX_tblmessages_to | 5,5 | NULL | 726454 | 100.00 | Using union(IX_tblmessages_from,IX_tblmessages_to); |
Using where; Using filesort |
| 1 | SIMPLE | a | NULL | eq_ref | PRIMARY | PRIMARY | 8 | func | 1 | 100.00 | Using where; Using index |
| 1 | SIMPLE | u1 | NULL | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using where; Using index |
+----+-------------+-------------+------------+-------------+---------------------------------------+---------------------------------------+---------+------+--------+----------+-----------------------------------------------------+
and more information about tblmessages:
Data 12.1 GiB
Index 190.8 MiB
Total 12.3 GiB
There are 2 reasons for this:
If U don't use "order by msgId desc" It's run very fast, but when I
add the order It's very very slow.
1st is that your query has a "LIMIT 20" rows at the end of it. Ordering your results for this query (via "order by") will require all rows which satisfy the where clause to be pulled/generated,sorted,etc. in order to get the correct 20 rows.
2nd is sort-of subjective, but I'll say your table's don't have the necessary indexing to support optimization of that query. Or, another way of looking at it, could be a query like that is generally gonna run slowly because of the multiple joins on IF-THEN clauses. My advice on fixing this is to either simplify the query into multiple queries or update your question with "SHOW CREATE TABLE" statements for the relevant tables along with one of those INFORMATION_SCHEMA queries that summarizes indices, partitions, etc. From there we could potentially update design (e.g. adding multi-column indices/indexes/however its spelled...)

mysql slow complex query with order by

The below query even without the order by is very slow and I can't figure out why. I'm guessing it's the where date_affidavit_file but how can I make it fast with that order byas well? Perhaps a sublect on the job_id's that match the where and then pass that into the rest of the code but I still need to order by server the servername like this. Any suggestions?
explain select sql_no_cache court_county, job.id as jid, job_status,
DATE_FORMAT(job.datetime_served, '%m/%d/%Y') as dserved ,
CONCAT(server.namefirst, ' ', server.namelast) as servername, client_name,
DATE_FORMAT(job.datetime_received, '%m/%d/%Y') as dtrec ,
DATE_FORMAT(job.datetime_give2server, '%m/%d/%Y') as dtg2s,
DATE_FORMAT(date_kase_filed, '%m/%d/%Y') as dkf,
DATE_FORMAT(job.date_sent_to_court, '%m/%d/%Y') as dtstc ,
TO_DAYS(datetime_served )-TO_DAYS(date_kase_filed) as totaldays from job
left join kase on kase.id=job.kase_id
left join server on job.server_id=server.id
left join client on kase.client_id=client.id
left join LUcourt on LUcourt.id=kase.court_id
where date_affidavit_filed is not null and date_affidavit_filed !='' order by servername;
+----+-------------+---------+--------+----------------------+---------+---------+-----------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+--------+----------------------+---------+---------+-----------------------+--------+----------------------------------------------+
| 1 | SIMPLE | job | ALL | date_affidavit_filed | NULL | NULL | NULL | 365212 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | kase | eq_ref | PRIMARY | PRIMARY | 4 | pserve.job.kase_id | 1 | |
| 1 | SIMPLE | server | eq_ref | PRIMARY | PRIMARY | 4 | pserve.job.server_id | 1 | |
| 1 | SIMPLE | client | eq_ref | PRIMARY | PRIMARY | 4 | pserve.kase.client_id | 1 | |
| 1 | SIMPLE | LUcourt | eq_ref | PRIMARY | PRIMARY | 4 | pserve.kase.court_id | 1 | |
+----+-------------+---------+--------+----------------------+---------+---------+-----------------------+--------+----------------------------------------------+
Check that you have indexes on the following columns. job.kase_id or job.server_id
Also you are ordering by a calculated field which is not optimal. Perhaps order by a field with index.
If you need to preserve that exact sort, you might want to add a field in the DB for that value. And populate it with appropriate values or set up a trigger on the DB to populate it for you automatically.
This can speed up the order by:
CREATE INDEX namefull ON server (namefirst,namelast);
if you do ORDER BY (server.namefirst, server.namelast) instead of ORDER BY servername, which should produce the same output.
You can also create indexes on each table on any field you are left joining, that can improve the performance of your query too.
When you write,
where date_affidavit_filed is not null and date_affidavit_filed !=''
you practically are selecting most of the rows. Or at least so many that it is not worthwhile to run through the indexing. The query planner sees that there is an index involving date_affidavit_filed, but decides not to use it and go with the WHERE clause, which only involves date_affidavit_filed; so we know it's not a key issue, it must be a cardinality issue.
| 1 | SIMPLE | job | ALL | date_affidavit_filed | NULL | NULL | NULL | 365212 | Using where; Using temporary; Using filesort |
You can try optimizing this by creating an index on
date_affidavit_filed, kase_id, server_id
in that order. How many rows are returned by the query?
You are selecting everything that isn't empty really.
That really means everything.
I don't know how many rows of data you have have but it's a lot to go through.
Try narrowing your query to a date range or specific client.
If you really need everything, don't output it one row after a time, but build up a big string in the software you use to output with all formatting, and then when you're finished looping through the results and you have constructed the data you wish to output you can output them in one big go.
You could also use paging.
Just add limit 0,30 on page 1, limit 30,30 on page two, etc.. and let the end user walk through the pages.

joining table in mysql not using index properly?

I have four tables that I am trying to join and output the result to a new table. My code looks like this:
create table tbl
select a.dte, a.permno, (ret - rf) f0_xs_ret, (xs_ret - (betav*xs_mkt)) f0_resid, mkt_cap last_year_mkt_cap, betav beta_value
from a inner join b using (dte)
inner join c on (year(a.dte) = c.yr and a.permno = c.permno)
inner join d on (a.permno = d.permno and year(a.dte)-1 = year(d.dte));
All of the tables have multiple indices and for table a, (dte, permno) identify a unique record, for table b, dte id's a unique record, for table c, (yr, permno) id a unique record and for table d, (dte, permno) id a unique record. the explain from the select part of the query is:
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| 1 | SIMPLE | d | ALL | idx1 | NULL | NULL | NULL | 264129 | |
| 1 | SIMPLE | c | ref | idx2 | idx2 | 4 | achernya.d.permno | 16 | |
| 1 | SIMPLE | b | ALL | PRIMARY,idx2 | NULL | NULL | NULL | 12336 | Using join buffer |
| 1 | SIMPLE | a | eq_ref | PRIMARY,idx1,idx2 | PRIMARY | 7 | achernya.b.dte,achernya.d.permno | 1 | Using where |
+----+-------------+-------+--------+-------------------+---------+---------+----------------------------------+--------+-------------------+
Why does mysql have to read so many rows to process this thing? and if i am reading this correctly, it has to read (264129*16*12336) rows which should take a good month.
Could someone please explain what's going on here?
MySQL has to read the rows because you're using functions as your join conditions. An index on dte will not help resolve YEAR(dte) in a query. If you want to make this fast, then put the year in its own column to use in joins and move the index to that column, even if that means some denormalization.
As for the other columns in your index that you don't apply functions to, they may not be used if the index won't provide much benefit, or they aren't the leftmost column in the index and you don't use the leftmost prefix of that index in your join condition.
Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

Why scan type is changed from ALL to RANGE when using LIMIT on SQL queries + Optimize query

I have this query
SELECT l.licitatii_id,
l.nume,
l.data_publicarii,
l.data_limita
FROM licitatii_ue l
INNER JOIN domenii_licitatii dl
ON l.licitatii_id = dl.licitatii_id
AND dl.tip_licitatie = '2'
INNER JOIN domenii d
ON dl.domenii_id = d.domenii_id
AND d.status = 1
AND d.tip_domeniu = '1'
WHERE l.status = 1
AND Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita)
AS DATE)))
< '1300683793'
GROUP BY l.licitatii_id
ORDER BY data_publicarii DESC
Explain outputs:
+-----+--------------+--------+---------+-------------------------------------+----------+----------+---------------------------+-------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | SIMPLE | d | ALL | PRIMARY,key_status_tip_domeniu | NULL | NULL | NULL | 120 | 85.83 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dl | ref | PRIMARY,tip_licitatie,licitatii_id | PRIMARY | 4 | web61db1.d.domenii_id | 6180 | 100.00 | Using where; Using index |
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | web61db1.dl.licitatii_id | 1 | 100.00 | Using where |
+-----+--------------+--------+---------+-------------------------------------+----------+----------+---------------------------+-------+-----------+----------------------------------------------+
As you see type=ALL for d table
now if I add LIMIT 100 to the query
plan changes to range:
+-----+--------------+--------+---------+-------------------------------------+-------------------------+----------+---------------------------+-------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | SIMPLE | d | range | PRIMARY,key_status_tip_domeniu | key_status_tip_domeniu | 9 | NULL | 103 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dl | ref | PRIMARY,tip_licitatie,licitatii_id | PRIMARY | 4 | web61db1.d.domenii_id | 6180 | 100.00 | Using where; Using index |
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | web61db1.dl.licitatii_id | 1 | 100.00 | Using where |
+-----+--------------+--------+---------+-------------------------------------+-------------------------+----------+---------------------------+-------+-----------+----------------------------------------------+
Why does this happen?
Can this query be optimized more, both queries take 13 seconds.
Table schema is visible on gist github
MySQL chooses domenii as the leading table for the join.
This table is filtered on (status, tip_domeniu) = (1, 1).
It does not seem to be a very selective condition, so normally a full table scan with filtering would be preferred over the index scan.
We can see that MySQL expects 120 records to be returned from domanii for which this condition would hold.
When you add a LIMIT, the number of records expected to be processed is decreased, and MySQL considers the index scan more efficient for this.
Note that this condition:
Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita) AS DATE))) < '1300683793'
is not sargable, so you deprive the optimizer to use an index on data_limita.
Create the following indexes:
licitatii_ue (status, data_limita)
licitatii_ue (status, data_publicarii)
and rewrite the query like this:
SELECT l.licitatii_id,
l.nume,
l.data_publicarii,
l.data_limita
FROM licitatii_ue l
JOIN domenii_licitatii dl
ON l.licitatii_id = dl.licitatii_id
AND dl.tip_licitatie = '2'
JOIN domenii d
ON dl.domenii_id = d.domenii_id
AND d.status = 1
AND d.tip_domeniu = '1'
WHERE l.status = 1
AND l.data_limita < FROM_UNIXTIME(((1300683793 - 86400) div 86400) * 86400)
GROUP BY
l.licitatii_id
ORDER BY
data_publicarii DESC
Ah, the mysteries of the query optimizer are many and unknowable...
At a quick glance, the most obvious thing to optimize might be the
AND Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita)
AS DATE)))
clause.
depending on the number of records in the licitatii_ue table, this looks like an expensive operation, and it will bypass any indices available.
ALL is table scan, range is range scan (due to LIMIT). Nothing bad with that, actually it also causes a key to be used (key_status_tip_domeniu).
The reason you are slow is, most likely, that you are using ORDER BY data_publicarii DESC (this is easy to test, just drop the ORDER BY and benchmark the query; would expect few orders of magnitude).
Mysql admits (under Extra column of explain) that it is using filesort (needed for order by because it can't or does not know how to use an index). Adding yet another index to the mix might help, especially if you confirm that ORDER BY is making it slow.
EDIT
Actually, you do have a cardinal sin in your query:
Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita)
AS DATE)))
< '1300683793'
Avoid applying any functions to your field values if you can apply them to a constant. So switch it around and rewrite it as
l.data_limita < some_function('1300683793')
However complext the some_function would be, it will be calculated only once. Mysql planner will know it is a constant. The way you wrote it would force mysql to apply unix_timestamp, timestampadd, cast and from_unixtime to value of data_limita from each row. Now in I/O bound systems this will usually just burn some extra CPU cycles while waiting for the disks to spin around (however, it might get significant, your system might get CPU bound and it is just a bad thing). Biggest difference is that you loose possibility to use an index on data_limita.
Finally, all your indexes are singe field indexes and mysql does some index merging, but is not stellar in it. You might want to try creating indexes that cover all your conditions and sorting order (in order of selectivity for target query).

MySQL query optimization - distinct, order by and limit

I am trying to optimize the following query:
select distinct this_.id as y0_
from Rental this_
left outer join RentalRequest rentalrequ1_
on this_.id=rentalrequ1_.rental_id
left outer join RentalSegment rentalsegm2_
on rentalrequ1_.id=rentalsegm2_.rentalRequest_id
where
this_.DTYPE='B'
and this_.id<=1848978
and this_.billingStatus=1
and rentalsegm2_.endDate between 1273631699529 and 1274927699529
order by rentalsegm2_.id asc
limit 0, 100;
This query is done multiple time in a row for paginated processing of records (with a different limit each time). It returns the ids I need in the processing. My problem is that this query take more than 3 seconds. I have about 2 million rows in each of the three tables.
Explain gives:
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+----------------------------------------------+
| 1 | SIMPLE | rentalsegm2_ | range | index_endDate,fk_rentalRequest_id_BikeRentalSegment | index_endDate | 9 | NULL | 449904 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | rentalrequ1_ | eq_ref | PRIMARY,fk_rental_id_BikeRentalRequest | PRIMARY | 8 | solscsm_main.rentalsegm2_.rentalRequest_id | 1 | Using where |
| 1 | SIMPLE | this_ | eq_ref | PRIMARY,index_billingStatus | PRIMARY | 8 | solscsm_main.rentalrequ1_.rental_id | 1 | Using where |
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+----------------------------------------------+
I tried to remove the distinct and the query ran three times faster. explain without the query gives:
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+-----------------------------+
| 1 | SIMPLE | rentalsegm2_ | range | index_endDate,fk_rentalRequest_id_BikeRentalSegment | index_endDate | 9 | NULL | 451972 | Using where; Using filesort |
| 1 | SIMPLE | rentalrequ1_ | eq_ref | PRIMARY,fk_rental_id_BikeRentalRequest | PRIMARY | 8 | solscsm_main.rentalsegm2_.rentalRequest_id | 1 | Using where |
| 1 | SIMPLE | this_ | eq_ref | PRIMARY,index_billingStatus | PRIMARY | 8 | solscsm_main.rentalrequ1_.rental_id | 1 | Using where |
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+-----------------------------+
As you can see, the Using temporary is added when using distinct.
I already have an index on all fields used in the where clause.
Is there anything I can do to optimize this query?
Thank you very much!
Edit: I tried to order by on this_.id as suggested and the query was 5x slower. Here is the explain plan:
+----+-------------+--------------+------+-----------------------------------------------------+---------------------------------------+---------+------------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+-----------------------------------------------------+---------------------------------------+---------+------------------------------+--------+----------------------------------------------+
| 1 | SIMPLE | this_ | ref | PRIMARY,index_billingStatus | index_billingStatus | 5 | const | 782348 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | rentalrequ1_ | ref | PRIMARY,fk_rental_id_BikeRentalRequest | fk_rental_id_BikeRentalRequest | 9 | solscsm_main.this_.id | 1 | Using where; Using index; Distinct |
| 1 | SIMPLE | rentalsegm2_ | ref | index_endDate,fk_rentalRequest_id_BikeRentalSegment | fk_rentalRequest_id_BikeRentalSegment | 8 | solscsm_main.rentalrequ1_.id | 1 | Using where; Distinct |
+----+-------------+--------------+------+-----------------------------------------------------+---------------------------------------+---------+------------------------------+--------+----------------------------------------------+
From the execution plan we see that the optimizer is smart enough to understand that you do not require OUTER JOINs here. Anyway, you should better specify that explicitly.
The DISTINCT modifier means that you want to GROUP BY all fields in SELECT part, that is ORDER BY all of the specified fields and then discard duplicates. In other words, order by rentalsegm2_.id asc clause does not make any sence here.
The query below should return the equivalent result:
select distinct this_.id as y0_
from Rental this_
join RentalRequest rentalrequ1_
on this_.id=rentalrequ1_.rental_id
join RentalSegment rentalsegm2_
on rentalrequ1_.id=rentalsegm2_.rentalRequest_id
where
this_.DTYPE='B'
and this_.id<=1848978
and this_.billingStatus=1
and rentalsegm2_.endDate between 1273631699529 and 1274927699529
limit 0, 100;
UPD
If you want the execution plan to start with RentalSegment, you will need to add the following indices to the database:
RentalSegment (endDate)
RentalRequest (id, rental_id)
Rental (id, DTYPE, billingStatus) or (id, billingStatus, DTYPE)
The query then could be rewritten as the following:
SELECT this_.id as y0_
FROM RentalSegment rs
JOIN RentalRequest rr
JOIN Rental this_
WHERE rs.endDate between 1273631699529 and 1274927699529
AND rs.rentalRequest_id = rr.id
AND rr.rental_id <= 1848978
AND rr.rental_id = this_.id
AND this_.DTYPE='D'
AND this_.billingStatus = 1
GROUP BY this_.id
LIMIT 0, 100;
If the execution plan will not start from RentalSegment you can force in with STRAIGHT_JOIN.
The reason that the query without the distinct runs faster is because you have a limit clause. Without the distinct, the server only needs to look at the first hundred matches. However however some of those rows may have duplicate fields, so if you introduce the distinct clause, the server has to look at many more rows in order to find ones that do not have duplicate values.
BTW, why are you using OUTER JOIN?
Here for "rentalsegm2_" table, optimizer has chosen "index_endDate" index and its no of rows expected from this table is about 4.5 lakhs. Since there are other where conditions exist, you can check for "this_" table indexes . I mean you can check in "this_ table" for how much records affected for each where conditions.
In summary, you can try for alternate solutions by changing indices used by optimizer.
This can be obtained by "USE INDEX", "FORCE INDEX" commands.
Thanks
Rinson KE
DBA
www.qburst.com