Why is count(*) query slow on some tables but not on others? - mysql

I've got a mysql database running on a wamp server that I'm using to do frequent pattern mining of Flickr data. In the process of loading the data into the database, I ran a count query to determine how many images I had already loaded. I was surprised that it took 3 minutes 49 sec for
select count(*) from image;
In a separate table, "concept", I am storing a list of tags that users give their images. A similar query on the "concept" table took 0.8 sec. The mystery is that both tables have around 200,000 rows. select count(*) from image; returns 283,890 and select count(*) from concept; returns 213,357.
Here's the description of each table
Clearly the "image" table has larger rows. I thought that perhaps "image" was too big to hold in memory based on this blog post, so I also tested the size of the tables using code from this answer.
SELECT table_name AS "Tables",
round(((data_length + index_length) / 1024 / 1024), 2) "Size in MB"
FROM information_schema.TABLES
WHERE table_schema = "$DB_NAME"
ORDER BY (data_length + index_length) DESC;
"image" is 179.98 MB, "concept" is 15.45 MB
I'm running mysql on a machine with 64 GB of RAM, so both these tables should easily fit. What am I missing that is slowing down my queries? And how can I fix it?

When performing SELECT COUNT(*) on an InnDB table, MySQL must scan through an index to count the rows. In this case, your only index is the primary (clustered) index, so MySQL scans through that.
For the clustered index, the actual table data is stored there as well. Not including overhead, your image table is approximately 1973 bytes per row (I'm assuming a single-byte character set for both primary key columns). That's about 8 records max per (16k) page, so about 35,486 pages. Your comcept table is approximately 257 bytes per row. That's about 63 records per page, so about 3,386 pages. That's a huge difference in the amount of data that must be scanned.
It has to read each page entirely because the pages may not be entirely full.
Then, performance wise, perhaps some of those pages are in memory and some are not. There are also some marginal differences due to MySQL's 15/16 preference, but all numbers above should be considered approximations.
Solution
Adding a secondary index to the larger table should yield approximately the same performance for SELECT COUNT(*) as the smaller table. Of course, with another index to update, updates will be a bit slower.
For improved performance, shorten your primary key because secondary indexes include the indexed column(s) and the full primary key.
If you only need an estimated number of rows, you can use the rows value from one of the following, which uses the table statistics instead of scanning the index:
SHOW TABLE STATUS LIKE 'image'
or
EXPLAIN SELECT COUNT(*) FROM image

If you're looking for a ballpark number rather than an exact count, then the Rows column from show table status may be good enough. It's not always accurate for InnoDB tables, but it seems like you're probably ok with a rough estimate anyway.

Related

Weird behavior of SELECT DISTINCT statement with RDS

I have created a 2TB MySQL RDS, and filled it with 2 tables totaling 1.5TB:
+----------+---------------------------+------------+
| Database | Table | Size in MB |
+----------+---------------------------+------------+
| stam_db | owl | 1182043.00 |
| stam_db | owl_owners | 393695.00 |
The instance was set with db.m6g.2xlarge size and 6000 provisioned IOPS.
I ran this query to return the first 10 rows (they are all distinct, no duplicated rows):
SELECT DISTINCT *
FROM owl
ORDER BY
name
LIMIT 10;
To my surprise, this query has been running for the last 2 hours...
Even more surprising, the "Free Storage Space" AWS metric started to decrease at a rate of 2.2GB/minute:
For some reason, Write IOPS suddenly risen to 600-700 per second:
READ IOPS went even higher, to about 1850 per second:
This brings total IOPS to around 2400-2500:
CPU Utilization remained in the low single digits:
I have a few questions:
Why would a SELECT DISTINCT statement cause such massive writes into the database?
Why would the SELECT DISTINCT try to read the entire DB, instead of just the first 10 rows?
Why isn't RDS using the 6000 allocated IOPS? The total IOPS are only about 40% of the allocated amount.
For future reference, here are the answers:
Q2) I think I found an explanation at https://www.percona.com/blog/2019/07/17/mysql-disk-space-exhaustion-for-implicit-temporary-tables/ -" The queries that require a sorting stage most of the time need to rely on a temporary table. For example, when you use GROUP BY, ORDER BY or DISTINCT. Such queries are executed in two stages: the first is to gather the data and put them into a temporary table, the second is to execute the sorting on the temporary table." So even regular SELECT with ORDER BY needs to re-read then whole table
Q1) The massive writes are caused by the temporary table created for the query, they can reach 100% of the original table.
Q3) Looks like MySQL code creating the temporary tables simply isn't efficient enough to utilize the entire 6000 IOPS
Try to use EXPLAIN to analyze your SELECT DISTINCT query. I bet it will include "Using temporary" and/or "Using filesort". With a large enough result set, these queries will use temporary disk space. But the more frequently you run these queries, the more disk space it uses.
I don't know why you use SELECT DISTINCT * if the rows are already distinct. This may cause the use of a temporary table unnecessarily.
Ideally your query should be:
SELECT *
FROM owl
ORDER BY
name
LIMIT 10;
Make sure there is an index on the name column, so it can skip the "Using filesort" by reading rows in the index order by name.
Why isn't it using the full provisioned IOPS? I would guess because MySQL is constrained by the code that builds temporary tables. It can't fill the temp tables fast enough to saturate a high number of IOPS. Perhaps if you were to run this query concurrently in many threads it would. But maybe not. IMO, provisioned IOPS are pretty much a scam.

Mysql Large Table Join Query very slow not Key indexes issue

SELECT t1.*
FROM
( SELECT key_a,key_b,MAX(date) as date
FROM large_table
WHERE date <= **20150126**
group by key_a,key_b
) AS t2
JOIN large_table AS t1 USING(key_a,key_b ,date)
large_table = 1,223,001,206 rows of data
Primary Key key_a,key_b,date
key on key_b
key on date
There are numerous empty dates between rows for a & b that I want the most recent behind or on the "Date" entered.
Is it the Mysql Join settings causing it to be slow ?
I can copy the entire set of a & b data with an INSERT to a temp table just by selecting all the rows and then run the same query on the temp table, but why do multi queries (insert selected, then select from) when only 1 is needed.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Not table optimization, not keys, is it Max sort length, Join Buffer size , I have 128 gig ram, on a 32 core server running this, there is no reason for it to be slow, just never bulk insert this large of a single table to run Join queries on prior if anyone else has dealt with tables this size any info greatly appreciated.
Edited query, yes it's late long day had Distinct when it wasn't needed or in actual query
WHERE date <= **20150126**
group by key_a,key_b
needs an index starting with date. It's about doing what you can with the WHERE clause, not sparse or dense.
Then... Since the inner query references only 3 columns, building a 'covering' index may be useful. (Probably useful in your case.) So, tack on the other two fields, in either order. Such as
INDEX(`date`, key_a, key_b)
For MyISAM this step is critical. For InnoDB, this is redundant, since each secondary key (such as your INDEX(date)) implicitly includes the rest of the fields of the PK.
No, the PRIMARY KEY(key_a, key_b, date) cannot serve the purpose. It's in the wrong order. Also, it is (if you are using InnoDB) "clustered" with the index.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Sorry, I had trouble parsing that. I assume you are saying 4M rows had 'date<...' and the subquery delivered only 180K rows. Hence, the outer query also returned 180K rows.
The first goal is to get through the 4M rows as efficiently as possible. With the index I propose, that might be about 20K blocks (#16KB each) of index scanning. That's 300MB.
Next the MAX and GROUP BY are performed. At 300MB, this will involve a disk tmp table. (See max_heap_size and max_tmp_table_size.)
Then comes the JOIN to fetch t1.*. You are using a good technique for fetching a bunch of rows from a huge table, where you need a GROUP BY (or LIMIT or ...) that is clumsy when done the obvious way. It goes like this: Write the subquery to find the PKs. Get the best index for it. Then JOIN on the PK.
Now for the JOIN. (Again, I assume InnoDB.) Since you are JOINing on the PK, each lookup into t1 will be efficient -- drill down the PK's BTree to find a row. Do that 180K times.
If those 180K lookups are scattered around the table, then this could be 180K disk hits.
Total effort: 20K + 180K = 200K disk hits, possibly less. On commodity spinning disks, this would take about 30 minutes (plus time for the tmp table). (No, only one core will be used. Anyway, I/O is probably the bottleneck.)
OPTIMIZE TABLE -- almost always useless.
I assume innodb_buffer_pool_size is about 90G? If things are going to be cached, that is where it would happen (for InnoDB). Since 200K blocks is 3GB, it could be easily cached. That is, if you run the query twice, the first might be 30 minutes, but the second might be less than 3 minutes.
To get more numbers, you could do:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS;
and look for 'Handler%', '%sort%', 'Innodb%' and maybe a few others.
What version are you running? Recent versions have a leapfrog technique that works better for max+groupby than what I described. I think it is called MRR. If so, your PK is actually optimal. (Hmmm... I should play around with that.)
PARTITIONing -- I don't see any benefit (for this query).

Function of deferred join in MySQL

I am reading High performance MySQL and I am a little confused about deferred join.
The book says that the following operation cannot be optimized by index(sex, rating) because the high offset requires them to spend most of their time scanning a lot of data that they will then throw away.
mysql> SELECT <cols> FROM profiles WHERE sex='M' ORDER BY rating LIMIT 100000, 10;
While a deferred join helps minimize the amount of work MySQL must do gathering data that it will only throw away.
SELECT <cols> FROM profiles INNER JOIN (
SELECT <primary key cols> FROM profiles
WHERE x.sex='M' ORDER BY rating LIMIT 100000, 10
) AS x USING(<primary key cols>);
Why a deferred join will minimize the amount of gathered data.
The example you presented assumes that InnoDB is used. Let's say that the PRIMARY KEY is just id.
INDEX(sex, rating)
is a "secondary key". Every secondary key (in InnoDB) includes the PK implicitly, so it is really an ordered list of (sex, rating, id) values. To get to the "data" (<cols>), it uses id to drill down the PK BTree (which contains the data, too) to find the record.
Fast Case: Hence,
SELECT id FROM profiles
WHERE x.sex='M' ORDER BY rating LIMIT 100000, 10
will do a "range scan" of 100010 'rows' in the index. This will be quite efficient for I/O, since all the information is consecutive, and nothing is wasted. (No, it is not smart enough to jump over 100000 rows; that would be quite messy, especially when you factor in the transaction_isolation_mode.) Those 100010 rows probably fit in about 1000 blocks of the index. Then it gets the 10 values of id.
With those 10 ids, it can do 10 joins ("NLJ" = "Nested Loop Join"). It is rather likely that the 10 rows are scattered around the table, possibly requiring 10 hits to the disk.
Let's "count the disk hits" (ignoring non-leaf nodes in the BTrees, which are likely to be cached anyway): 1000 + 10 = 1010. On ordinary disks, this might take 10 seconds.
Slow Case: Now let's look at the original query (SELECT <cols> FROM profiles WHERE sex='M' ORDER BY rating LIMIT 100000, 10;). Let's continue to assume INDEX(sex, rating) plus the implicit id on the end.
As before, it will index scan through the 100010 rows (est. 1000 disk hits). But as it goes, it is too dumb to do what was done above. It will reach over into the data to get the <cols>. This often (depending on caching) requires a random disk hit. This could be upwards of 100010 disk hits (if the table is huge and caching is not very useful).
Again, 100000 are tossed and 10 are delivered. Total 'cost': 100010 disk hits (worst case), which might take 17 minutes.
Keep in mind that there are 3 editions of High performance MySQL; they were written over the past 13 or so years. You are probably using a much newer version of MySQL than they covered. I do not happen to know if the optimizer has gotten any smarter in this area. These, if available to you, may give clues:
EXPLAIN FORMAT=JSON SELECT ...;
OPTIMIZER TRACE...
My favorite "Handler" trick for studying how things work may be helpful:
FLUSH STATUS;
SELECT ...
SHOW SESSION STATUS LIKE 'Handler%'.
You are likely to see numbers like 100000 and 10, or small multiples of such. But, keep in mind that a fast range scan of the index counts as 1 per row, and so does a slow random disk hit for a big set of <cols>.
Overview: To make this technique work, the subquery need a "covering" index, with the columns correctly ordered.
"Covering" means that (sex, rating, id) contains all the columns touched. (We are assuming that <cols> contains other columns, perhaps bulky ones that won't work in an INDEX.)
"Correct" ordering of the columns: The columns are in just the right order to get all the way through the query. (See also my cookbook.)
First come any WHERE columns compared with = to constants. (sex)
Then comes the entire ORDER BY, in order. (rating)
Finally it is 'covering'. (id)
From the description below from official (https://dev.mysql.com/doc/refman/5.7/en/limit-optimization.html):
If you combine LIMIT row_count with ORDER BY, MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast. If a filesort must be done, all rows that match the query without the LIMIT clause are selected, and most or all of them are sorted, before the first row_count are found. After the initial rows have been found, MySQL does not sort any remainder of the result set.
We can see that they should have no difference.
But the percona suggest this, and give test data. But give no reason, I think there maybe exist some "bug" in mysql when deal with this kind of case. So we just regard this as a useful experience.

Best way to count rows from mysql database

After facing a slow loading time issue with a mysql query, I'm now looking the best way to count rows numbers. I have stupidly used mysql_num_rows() function to do this and now realized its a worst way to do this.
I was actually making a Pagination to make pages in PHP.
I have found several ways to count rows number. But I'm looking the faster way to count it.
The table type is MyISAM
So the question is now
Which is the best and faster to count -
1. `SELECT count(*) FROM 'table_name'`
2. `SELECT TABLE_ROWS
FROM INFORMATION_SCHEMA.TABLES WHERE table_schema = 'database_name'
AND table_name LIKE 'table_name'`
3. `SHOW TABLE STATUS LIKE 'table_name'`
4. `SELECT FOUND_ROWS()`
If there are others better way to do this, please let me know them as well.
If possible please describe along with the answer- why it is best and faster. So I could understand and can use the method based on my requirement.
Thanks.
Quoting the MySQL Reference Manual on COUNT
COUNT(*) is optimized to return very quickly if the SELECT retrieves
from one table, no other columns are retrieved, and there is no WHERE
clause. For example:
mysql> SELECT COUNT(*) FROM student;
This optimization applies only to
MyISAM tables only, because an exact row count is stored for this
storage engine and can be accessed very quickly. For transactional
storage engines such as InnoDB, storing an exact row count is more
problematic because multiple transactions may be occurring, each of
which may affect the count.
Also read this question
MySQL - Complexity of: SELECT COUNT(*) FROM MyTable;
I would start by using SELECT count(*) FROM 'table_name' because it is the most portable, easiset to understand, and because it is likely that the DBMS developers optimise common idiomatic queries of this sort.
Only if that wasn't fast enough would I benchmark the approaches you list to find if any were significantly faster.
It's slightly faster to count a constant:
select count('x') from table;
When the parser hits count(*) it has to go figure out what all the columns of the table are that are represented by the * and get ready to accept them inside the count().
Using a constant bypasses this (albeit slight) column checking overhead.
As an aside, although not faster, one cute option is:
select sum(1) from table;
I've looked around quite a bit for this recently. it seems that there are a few here that I'd never seen before.
Special needs: This database is about 6 million records and is getting crushed by multi-insert queries all the time. Getting a true count is difficult to say the least.
SELECT TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE table_schema = 'admin_worldDomination' AND table_name LIKE 'master'
Showing rows 0 - 0 ( 1 total, Query took 0.0189 sec)
This is decent, Very fast but inaccurate. Showed results from 4 million to almost 8 million rows
SELECT count( * ) AS counter FROM `master`
No time displayed, took 8 seconds real time. Will get much worse as the table grows. This has been killing my site previous to today.
SHOW TABLE STATUS LIKE 'master'
Seems to be as fast as the first, no time displayed though. Offers lots of other table information, not much of it is worth anything though (avg record length maybe).
SELECT FOUND_ROWS() FROM 'master'
Showing rows 0 - 29 ( 4,824,232 total, Query took 0.0004 sec)
This is good, but an average. Closer spread than others (4-5 million) so I'll probably end up taking a sample from a few of these queries and averaging.
EDIT: This was really slow when doing a query in php, ended up going with the first. Query runs 30 times quickly and I take an average, under 1 second ... it' still ranges between 5.3 & 5.5 million
One idea I had, to throw this out there, is to try to find a way to estimate the row count. Since it's just to give your user an idea of the number of pages, maybe you don't need to be exact and could even say Page 1 of ~4837 or Page 1 of about 4800 or something.
I couldn't quickly find an estimate count function, but you could try getting the table size and dividing by a determined/constant avg row size. I don't know if or why getting the table size from TABLE STATUS would be faster than getting the rows from TABLE STATUS.

MySQL using indexes in RAM; why are the disks running?

I have indices of 800M and told MySQL to use 1500M of RAM. After starting MySQL it uses 1000M on Windows 7 x64.
I want to execute this query:
SELECT oo.* FROM table o
LEFT JOIN table oo ON (oo.order = o.order AND oo.type="SHIPPED")
WHERE o.type="ORDERED" and oo.type IS NULL
This finds all items not yet shipped . The execution plan tells me this:
My indices are:
type_order: Multiple index with type and order
order_type with order as first index value, followed by type
So MySQL should use the index type_order from RAM and then pick out the few entries with the order_type index. I'm expecting only about 1000 non shipped items, so this query should be really fast, but it isn't. Disks are going crazy....
What am I doing wrong?
The query says SELECT sometable.*, so for 1000 matching rows, there will be 1000 fetches of all the fields from the table. Whether the WHERE part indexes are fully loaded into ram or not would only help some. The data fields still have to be retrieved. Odds are, they are scattered all over the disk. So, of course the disk(s) will be doing a thousand small reads.