MySQL SELECT queries of same limits getting slower and slower - mysql

I'm using PHP & MySQL for my website and have to aggregate some metrics to generate a report for the client. The metrics can be found in a 4-million row table, which uses the MyISAM engine and is distributed in 12 PARTITIONS.
In a PHP loop (a for-loop), for each iteration, I retrieve 1000 rows that match specific ids with
id = X OR id =Y OR id = Z
(I'm not using any inner join with a temporary table like UNION 1 id UNION 2 etc. as it is a little bit slower, might be because of the partition option that relies on the hash of the id).
The problem is that the queries are getting slower and slower. It might be caused by something that is cached progressively, but I don't know what.
Any help would be very precious, many thanks.

MySQL gets very slow when you use a LIMIT that starts deep in the index. Check out this article for optimizing late row lookups:
http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/

Related

Why is SELECT COUNT(*) much slower than SELECT * even without WHERE clause in MySQL?

I have a view that has a duration time of ~0.2 seconds when I do a simple SELECT * from it, but has a duration time of ~25 seconds when I do simply SELECT COUNT(*) from it. What would cause this? It seems like if it takes 0.2 seconds to compute the output data then it could run a simple length calculation on that dataset in a trivial amount of time. MySQL 5.7. Details below.
mysql> select count(*) from Lots;
+----------+
| count(*) |
+----------+
| 4136666 |
+----------+
1 row in set (25.29 sec)
In MySQL workbench, the following query produces durations like: 0.217 sec
select * from Lots;
The fetch time is significant given the amount of data, but my understanding is the "Duration" is how long it takes to compute the output dataset of the view.
Definition of Lots view:
select
lot.*,
coalesce(overrides.streetNumber, address.streetNumber, lot.rawStreetNumber) as streetNumber,
coalesce(overrides.street, address.street, lot.rawStreet) as street,
coalesce(overrides.postalCode, address.postalCode, lot.rawPostalCode) as postalCode,
coalesce(overrides.city, address.city, lot.rawCity) as city
from LotsData lot
left join Address address on address.lotNumber = lot.lotNumber
left join Override overrides on overrides.lotId = lot.lotNumber
The data in VIEW objects isn't materialized. That is, it doesn't exist in any sort of tabular form in your database server. Rather, the server pulls it together from its tables when a query (like your COUNT query) references the VIEW. So, there's no simple metadata hanging around in the server that can satisfy your COUNT query instantaneously. The server has to pull together all your joined tables to generate a row count. It takes a while. Remember, your database server may have other clients concurrently INSERTing or DELETEing rows to one or more of the tables in your view.
It's worse than that. In the InnoDB storage engine, even COUNTing the rows of a table is slow. To achieve high concurrency InnoDB doesn't attempt to store any kind of precise row count. So the database server has to count those rows one-by-one as well. (The older MyISAM storage engine does maintain precise row count metadata for tables, but it offers less concurrency.)
Wise data programmers avoid using COUNT(*) on whole tables or views composed from them in production for those reasons.
The real question is why your SELECT * FROM view is so fast. It's unlikely that your database server can compose and deliver a 4-megarow view from its JOINs in less than a second, nor is it likely that Workbench can absorb that many rows in that time. Like #ysth said, many GUI-based SQL client programs, like Workbench and HeidiSQL, sometimes silently append something like LIMIT 1000 to interactive operations calling for the display of whole tables or views. You might look for evidence of that.

Temp tables vs subqueries in inner join

Both SQL, return the same results. The first my joins are on the subqueries the second the final queryis a join with a temporary that previously I create/populate them
SELECT COUNT(*) totalCollegiates, SUM(getFee(c.collegiate_id, dateS)) totalMoney
FROM collegiates c
LEFT JOIN (
SELECT collegiate_id FROM collegiateRemittances r
INNER JOIN remittances r1 USING(remittance_id)
WHERE r1.type_id = 1 AND r1.name = remesa
) hasRemittance ON hasRemittance.collegiate_id = c.collegiate_id
WHERE hasRemittance.collegiate_id IS NULL AND c.typePayment = 1 AND c.active = 1 AND c.exentFee = 0 AND c.approvedBoard = 1 AND IF(notCollegiate, c.collegiate_id NOT IN (notCollegiate), '1=1');
DROP TEMPORARY TABLE IF EXISTS hasRemittance;
CREATE TEMPORARY TABLE hasRemittance
SELECT collegiate_id FROM collegiateRemittances r
INNER JOIN remittances r1 USING(remittance_id)
WHERE r1.type_id = 1 AND r1.name = remesa;
SELECT COUNT(*) totalCollegiates, SUM(getFee(c.collegiate_id, dateS)) totalMoney
FROM collegiates c
LEFT JOIN hasRemittance ON hasRemittance.collegiate_id = c.collegiate_id
WHERE hasRemittance.collegiate_id IS NULL AND c.typePayment = 1 AND c.active = 1 AND c.exentFee = 0 AND c.approvedBoard = 1 AND IF(notCollegiate, c.collegiate_id NOT IN (notCollegiate), '1=1');
Which will have better performance for a few thousand records?
The two formulations are identical except that your explicit temp table version is 3 sql statements instead of just 1. That is, the overhead of the back and forth to the server makes it slower. But...
Since the implicit temp table is in a LEFT JOIN, that subquery may be evaluated in one of two ways...
Older versions of MySQL were 'dump' and re-evaluated it. Hence slow.
Newer versions automatically create an index. Hence fast.
Meanwhile, you could speed up the explicit temp table version by adding a suitable index. It would be PRIMARY KEY(collegiate_id). If there is a chance of that INNER JOIN producing dups, then say SELECT DISTINCT.
For "a few thousand" rows, you usually don't need to worry about performance.
Oracle has a zillion options for everything. MySQL has very few, with the default being (usually) the best. So ignore the answer that discussed various options that you could use in MySQL.
There are issues with
AND IF(notCollegiate,
c.collegiate_id NOT IN (notCollegiate),
'1=1')
I can't tell which table notCollegiate is in. notCollegiate cannot be a list, so why use IN? Instead simply use !=. Finally, '1=1' is a 3-character string; did you really want that?
For performance (of either version)
remittances needs INDEX(type_id, name, remittance_id) with remittance_id specifically last.
collegiateRemittances needs INDEX(remittance_id) (unless it is the PK).
collegiates needs INDEX(typePayment, active, exentFee , approvedBoard) in any order.
Bottom line: Worry more about indexes than how you formulate the query.
Ouch. Another wrinkle. What is getFee()? If it is a Stored Function, maybe we need to worry about optimizing it?? And what is dateS?
It depends actually. You'll have to test performance of every option. On my website I had 2 tables with articles and comments to them. It turned out it's faster to call comment counts 20 times for each article, than using a single union query. MySQL (like other DBs) caches queries, so small simple queries can run amazingly fast.
I did not saw that you have tagged the question as mysql so I initialy aswered for Oracle. Here is what I think about mySQL.
MySQL
There are two options when it comes to temporary tables Memory or Disk. And for Disk you can have MyIsam - non transactional and InnoDB transactional. Of course you can expect better performance for non transactional type of storage.
Additionaly you need to figure out how big resultset are you dealing with. For small resultset the memory option would be faster for large resultset the disk option would be faster.
Again at the end as in my original answer you need to figure out what performance is good enough and go for the most descriptive and easy to read option.
Oracle
It depends on what kind of temporary table you are dealing with.
You can have session based temporary tables - data is held until logout, or transaction based - data is held until commit . On top of this they can support transaction logging or not support it. Depending on configuration you can get better performance from a temporary table.
As everything in the world performance is relative therm. Most probably for few thousand records it will not do significant difference between the two queries. In which case I would go not for the most performant on but for the most easier to read and understand one.

Mysql Large Table Join Query very slow not Key indexes issue

SELECT t1.*
FROM
( SELECT key_a,key_b,MAX(date) as date
FROM large_table
WHERE date <= **20150126**
group by key_a,key_b
) AS t2
JOIN large_table AS t1 USING(key_a,key_b ,date)
large_table = 1,223,001,206 rows of data
Primary Key key_a,key_b,date
key on key_b
key on date
There are numerous empty dates between rows for a & b that I want the most recent behind or on the "Date" entered.
Is it the Mysql Join settings causing it to be slow ?
I can copy the entire set of a & b data with an INSERT to a temp table just by selecting all the rows and then run the same query on the temp table, but why do multi queries (insert selected, then select from) when only 1 is needed.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Not table optimization, not keys, is it Max sort length, Join Buffer size , I have 128 gig ram, on a 32 core server running this, there is no reason for it to be slow, just never bulk insert this large of a single table to run Join queries on prior if anyone else has dealt with tables this size any info greatly appreciated.
Edited query, yes it's late long day had Distinct when it wasn't needed or in actual query
WHERE date <= **20150126**
group by key_a,key_b
needs an index starting with date. It's about doing what you can with the WHERE clause, not sparse or dense.
Then... Since the inner query references only 3 columns, building a 'covering' index may be useful. (Probably useful in your case.) So, tack on the other two fields, in either order. Such as
INDEX(`date`, key_a, key_b)
For MyISAM this step is critical. For InnoDB, this is redundant, since each secondary key (such as your INDEX(date)) implicitly includes the rest of the fields of the PK.
No, the PRIMARY KEY(key_a, key_b, date) cannot serve the purpose. It's in the wrong order. Also, it is (if you are using InnoDB) "clustered" with the index.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Sorry, I had trouble parsing that. I assume you are saying 4M rows had 'date<...' and the subquery delivered only 180K rows. Hence, the outer query also returned 180K rows.
The first goal is to get through the 4M rows as efficiently as possible. With the index I propose, that might be about 20K blocks (#16KB each) of index scanning. That's 300MB.
Next the MAX and GROUP BY are performed. At 300MB, this will involve a disk tmp table. (See max_heap_size and max_tmp_table_size.)
Then comes the JOIN to fetch t1.*. You are using a good technique for fetching a bunch of rows from a huge table, where you need a GROUP BY (or LIMIT or ...) that is clumsy when done the obvious way. It goes like this: Write the subquery to find the PKs. Get the best index for it. Then JOIN on the PK.
Now for the JOIN. (Again, I assume InnoDB.) Since you are JOINing on the PK, each lookup into t1 will be efficient -- drill down the PK's BTree to find a row. Do that 180K times.
If those 180K lookups are scattered around the table, then this could be 180K disk hits.
Total effort: 20K + 180K = 200K disk hits, possibly less. On commodity spinning disks, this would take about 30 minutes (plus time for the tmp table). (No, only one core will be used. Anyway, I/O is probably the bottleneck.)
OPTIMIZE TABLE -- almost always useless.
I assume innodb_buffer_pool_size is about 90G? If things are going to be cached, that is where it would happen (for InnoDB). Since 200K blocks is 3GB, it could be easily cached. That is, if you run the query twice, the first might be 30 minutes, but the second might be less than 3 minutes.
To get more numbers, you could do:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS;
and look for 'Handler%', '%sort%', 'Innodb%' and maybe a few others.
What version are you running? Recent versions have a leapfrog technique that works better for max+groupby than what I described. I think it is called MRR. If so, your PK is actually optimal. (Hmmm... I should play around with that.)
PARTITIONing -- I don't see any benefit (for this query).

MySQL Slow Query despite EXPLAIN shows less rows

I am using MySQL 5.1.34, all tables were in Innodb Engine.
I have 3 tables as below:
TableA (1M rows)
-ID (Auto Increment PK)
-TableB_ID
-Date varchar(indexed)
-Other Fields
TableB (60M rows)
-ID (Auto Increment PK)
-TableC_ID
-Other Fields
TableC (10M rows)
-ID (Auto Increment PK)
-Other Fields
My objective is to join 3 tables which matching the "date" in TableA. The "date" column is indexed and a simple WHERE clause can be complete within second. E.g.
SELECT * FROM TableA where date = '2015-03-13';
10000 rows in set (0.1 sec)
However when I try to join TableB and TableC with the SQL below, the process become extreme slow.
SELECT A.*, C.Something FROM TableA A JOIN TableB B on A.TableB_ID = B.ID JOIN TableC C on B.TableC_ID = C.ID WHERE A.date = '2015-03-13';
10000 rows in set (20 sec)
I've tried to troubleshoot the slowness using EXPLAIN Command, output as follow.
What could be the reason? Please help!
As I said, it's probably a disk seeking issue. As you measured, once the records are in the memory, the query is fast, which, to me, confirms the issue.
Fetching a random location from the disk takes about ~10ms as the disk head has to move. The relevant records are probably clustered on the disk and the server had to do approx. 20s/10ms = 20.000 seeks.
The are a couple of obvious approaches:
Use an SSD. No seeking.
Add enough RAM to the server to avoid disk access, or use a dedicated server for these queries (16GB looks like more than enough (though I don't know the size of the records) so I guess there are other frequently used massive tables).
Cache the result per date - and store it in a database (memcached/redis/..). This could work really well if the records are static for the old dates as you don't have to worry about cache invalidation.
Anyways, it's probably good idea to do some back-of-the-envelope calculations and figuring out the memory requirements.
You have left out a lot of important information by not providing SHOW CREATE TABLE, but I will make some guesses.
Are you really fetching 10K rows? What are you going to do with all of them? It takes time to get that many rows.
If you are using MyISAM, then B would benefit greatly from INDEX(ID, TableC_ID). InnoDB should not benefit from it, unless the table is quite 'wide'.
SHOW VARIABLES LIKE '%buffer%'; How much RAM do you have?
Is the Query cache in use? If it is, that would explain why it was so fast the second time. Most production system find it better to have the QC turned off.

MySQL using indexes in RAM; why are the disks running?

I have indices of 800M and told MySQL to use 1500M of RAM. After starting MySQL it uses 1000M on Windows 7 x64.
I want to execute this query:
SELECT oo.* FROM table o
LEFT JOIN table oo ON (oo.order = o.order AND oo.type="SHIPPED")
WHERE o.type="ORDERED" and oo.type IS NULL
This finds all items not yet shipped . The execution plan tells me this:
My indices are:
type_order: Multiple index with type and order
order_type with order as first index value, followed by type
So MySQL should use the index type_order from RAM and then pick out the few entries with the order_type index. I'm expecting only about 1000 non shipped items, so this query should be really fast, but it isn't. Disks are going crazy....
What am I doing wrong?
The query says SELECT sometable.*, so for 1000 matching rows, there will be 1000 fetches of all the fields from the table. Whether the WHERE part indexes are fully loaded into ram or not would only help some. The data fields still have to be retrieved. Odds are, they are scattered all over the disk. So, of course the disk(s) will be doing a thousand small reads.