Weird behavior of SELECT DISTINCT statement with RDS - mysql

I have created a 2TB MySQL RDS, and filled it with 2 tables totaling 1.5TB:
+----------+---------------------------+------------+
| Database | Table | Size in MB |
+----------+---------------------------+------------+
| stam_db | owl | 1182043.00 |
| stam_db | owl_owners | 393695.00 |
The instance was set with db.m6g.2xlarge size and 6000 provisioned IOPS.
I ran this query to return the first 10 rows (they are all distinct, no duplicated rows):
SELECT DISTINCT *
FROM owl
ORDER BY
name
LIMIT 10;
To my surprise, this query has been running for the last 2 hours...
Even more surprising, the "Free Storage Space" AWS metric started to decrease at a rate of 2.2GB/minute:
For some reason, Write IOPS suddenly risen to 600-700 per second:
READ IOPS went even higher, to about 1850 per second:
This brings total IOPS to around 2400-2500:
CPU Utilization remained in the low single digits:
I have a few questions:
Why would a SELECT DISTINCT statement cause such massive writes into the database?
Why would the SELECT DISTINCT try to read the entire DB, instead of just the first 10 rows?
Why isn't RDS using the 6000 allocated IOPS? The total IOPS are only about 40% of the allocated amount.
For future reference, here are the answers:
Q2) I think I found an explanation at https://www.percona.com/blog/2019/07/17/mysql-disk-space-exhaustion-for-implicit-temporary-tables/ -" The queries that require a sorting stage most of the time need to rely on a temporary table. For example, when you use GROUP BY, ORDER BY or DISTINCT. Such queries are executed in two stages: the first is to gather the data and put them into a temporary table, the second is to execute the sorting on the temporary table." So even regular SELECT with ORDER BY needs to re-read then whole table
Q1) The massive writes are caused by the temporary table created for the query, they can reach 100% of the original table.
Q3) Looks like MySQL code creating the temporary tables simply isn't efficient enough to utilize the entire 6000 IOPS

Try to use EXPLAIN to analyze your SELECT DISTINCT query. I bet it will include "Using temporary" and/or "Using filesort". With a large enough result set, these queries will use temporary disk space. But the more frequently you run these queries, the more disk space it uses.
I don't know why you use SELECT DISTINCT * if the rows are already distinct. This may cause the use of a temporary table unnecessarily.
Ideally your query should be:
SELECT *
FROM owl
ORDER BY
name
LIMIT 10;
Make sure there is an index on the name column, so it can skip the "Using filesort" by reading rows in the index order by name.
Why isn't it using the full provisioned IOPS? I would guess because MySQL is constrained by the code that builds temporary tables. It can't fill the temp tables fast enough to saturate a high number of IOPS. Perhaps if you were to run this query concurrently in many threads it would. But maybe not. IMO, provisioned IOPS are pretty much a scam.

Related

Why is SELECT COUNT(*) much slower than SELECT * even without WHERE clause in MySQL?

I have a view that has a duration time of ~0.2 seconds when I do a simple SELECT * from it, but has a duration time of ~25 seconds when I do simply SELECT COUNT(*) from it. What would cause this? It seems like if it takes 0.2 seconds to compute the output data then it could run a simple length calculation on that dataset in a trivial amount of time. MySQL 5.7. Details below.
mysql> select count(*) from Lots;
+----------+
| count(*) |
+----------+
| 4136666 |
+----------+
1 row in set (25.29 sec)
In MySQL workbench, the following query produces durations like: 0.217 sec
select * from Lots;
The fetch time is significant given the amount of data, but my understanding is the "Duration" is how long it takes to compute the output dataset of the view.
Definition of Lots view:
select
lot.*,
coalesce(overrides.streetNumber, address.streetNumber, lot.rawStreetNumber) as streetNumber,
coalesce(overrides.street, address.street, lot.rawStreet) as street,
coalesce(overrides.postalCode, address.postalCode, lot.rawPostalCode) as postalCode,
coalesce(overrides.city, address.city, lot.rawCity) as city
from LotsData lot
left join Address address on address.lotNumber = lot.lotNumber
left join Override overrides on overrides.lotId = lot.lotNumber
The data in VIEW objects isn't materialized. That is, it doesn't exist in any sort of tabular form in your database server. Rather, the server pulls it together from its tables when a query (like your COUNT query) references the VIEW. So, there's no simple metadata hanging around in the server that can satisfy your COUNT query instantaneously. The server has to pull together all your joined tables to generate a row count. It takes a while. Remember, your database server may have other clients concurrently INSERTing or DELETEing rows to one or more of the tables in your view.
It's worse than that. In the InnoDB storage engine, even COUNTing the rows of a table is slow. To achieve high concurrency InnoDB doesn't attempt to store any kind of precise row count. So the database server has to count those rows one-by-one as well. (The older MyISAM storage engine does maintain precise row count metadata for tables, but it offers less concurrency.)
Wise data programmers avoid using COUNT(*) on whole tables or views composed from them in production for those reasons.
The real question is why your SELECT * FROM view is so fast. It's unlikely that your database server can compose and deliver a 4-megarow view from its JOINs in less than a second, nor is it likely that Workbench can absorb that many rows in that time. Like #ysth said, many GUI-based SQL client programs, like Workbench and HeidiSQL, sometimes silently append something like LIMIT 1000 to interactive operations calling for the display of whole tables or views. You might look for evidence of that.

Why is count(*) query slow on some tables but not on others?

I've got a mysql database running on a wamp server that I'm using to do frequent pattern mining of Flickr data. In the process of loading the data into the database, I ran a count query to determine how many images I had already loaded. I was surprised that it took 3 minutes 49 sec for
select count(*) from image;
In a separate table, "concept", I am storing a list of tags that users give their images. A similar query on the "concept" table took 0.8 sec. The mystery is that both tables have around 200,000 rows. select count(*) from image; returns 283,890 and select count(*) from concept; returns 213,357.
Here's the description of each table
Clearly the "image" table has larger rows. I thought that perhaps "image" was too big to hold in memory based on this blog post, so I also tested the size of the tables using code from this answer.
SELECT table_name AS "Tables",
round(((data_length + index_length) / 1024 / 1024), 2) "Size in MB"
FROM information_schema.TABLES
WHERE table_schema = "$DB_NAME"
ORDER BY (data_length + index_length) DESC;
"image" is 179.98 MB, "concept" is 15.45 MB
I'm running mysql on a machine with 64 GB of RAM, so both these tables should easily fit. What am I missing that is slowing down my queries? And how can I fix it?
When performing SELECT COUNT(*) on an InnDB table, MySQL must scan through an index to count the rows. In this case, your only index is the primary (clustered) index, so MySQL scans through that.
For the clustered index, the actual table data is stored there as well. Not including overhead, your image table is approximately 1973 bytes per row (I'm assuming a single-byte character set for both primary key columns). That's about 8 records max per (16k) page, so about 35,486 pages. Your comcept table is approximately 257 bytes per row. That's about 63 records per page, so about 3,386 pages. That's a huge difference in the amount of data that must be scanned.
It has to read each page entirely because the pages may not be entirely full.
Then, performance wise, perhaps some of those pages are in memory and some are not. There are also some marginal differences due to MySQL's 15/16 preference, but all numbers above should be considered approximations.
Solution
Adding a secondary index to the larger table should yield approximately the same performance for SELECT COUNT(*) as the smaller table. Of course, with another index to update, updates will be a bit slower.
For improved performance, shorten your primary key because secondary indexes include the indexed column(s) and the full primary key.
If you only need an estimated number of rows, you can use the rows value from one of the following, which uses the table statistics instead of scanning the index:
SHOW TABLE STATUS LIKE 'image'
or
EXPLAIN SELECT COUNT(*) FROM image
If you're looking for a ballpark number rather than an exact count, then the Rows column from show table status may be good enough. It's not always accurate for InnoDB tables, but it seems like you're probably ok with a rough estimate anyway.

MySQL SELECT queries of same limits getting slower and slower

I'm using PHP & MySQL for my website and have to aggregate some metrics to generate a report for the client. The metrics can be found in a 4-million row table, which uses the MyISAM engine and is distributed in 12 PARTITIONS.
In a PHP loop (a for-loop), for each iteration, I retrieve 1000 rows that match specific ids with
id = X OR id =Y OR id = Z
(I'm not using any inner join with a temporary table like UNION 1 id UNION 2 etc. as it is a little bit slower, might be because of the partition option that relies on the hash of the id).
The problem is that the queries are getting slower and slower. It might be caused by something that is cached progressively, but I don't know what.
Any help would be very precious, many thanks.
MySQL gets very slow when you use a LIMIT that starts deep in the index. Check out this article for optimizing late row lookups:
http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/

MySQL using indexes in RAM; why are the disks running?

I have indices of 800M and told MySQL to use 1500M of RAM. After starting MySQL it uses 1000M on Windows 7 x64.
I want to execute this query:
SELECT oo.* FROM table o
LEFT JOIN table oo ON (oo.order = o.order AND oo.type="SHIPPED")
WHERE o.type="ORDERED" and oo.type IS NULL
This finds all items not yet shipped . The execution plan tells me this:
My indices are:
type_order: Multiple index with type and order
order_type with order as first index value, followed by type
So MySQL should use the index type_order from RAM and then pick out the few entries with the order_type index. I'm expecting only about 1000 non shipped items, so this query should be really fast, but it isn't. Disks are going crazy....
What am I doing wrong?
The query says SELECT sometable.*, so for 1000 matching rows, there will be 1000 fetches of all the fields from the table. Whether the WHERE part indexes are fully loaded into ram or not would only help some. The data fields still have to be retrieved. Odds are, they are scattered all over the disk. So, of course the disk(s) will be doing a thousand small reads.

Should I COUNT(*) or not?

I know it's generally a bad idea to do queries like this:
SELECT * FROM `group_relations`
But when I just want the count, should I go for this query since that allows the table to change but still yields the same results.
SELECT COUNT(*) FROM `group_relations`
Or the more specfic
SELECT COUNT(`group_id`) FROM `group_relations`
I have a feeling the latter could potentially be faster, but are there any other things to consider?
Update: I am using InnoDB in this case, sorry for not being more specific.
If the column in question is NOT NULL, both of your queries are equivalent. When group_id contains null values,
select count(*)
will count all rows, whereas
select count(group_id)
will only count the rows where group_id is not null.
Also, some database systems, like MySQL employ an optimization when you ask for count(*) which makes such queries a bit faster than the specific one.
Personally, when just counting, I'm doing count(*) to be on the safe side with the nulls.
If I remember it right, in MYSQL COUNT(*) counts all rows, whereas COUNT(column_name) counts only the rows that have a non-NULL value in the given column.
COUNT(*) count all rows while COUNT(column_name) will count only rows without NULL values in the specified column.
Important to note in MySQL:
COUNT() is very fast on MyISAM tables for * or not-null columns, since the row count is cached. InnoDB has no row count caching, so there is no difference in performance for COUNT(*) or COUNT(column_name), regardless if the column can be null or not. You can read more on the differences on this post at the MySQL performance blog.
if you try SELECT COUNT(1) FROMgroup_relations it will be a bit faster because it will not try to retrieve information from your columns.
Edit: I just did some research and found out that this only happens in some db. In sqlserver it's the same to use 1 or *, but on oracle it's faster to use 1.
http://social.msdn.microsoft.com/forums/en-US/transactsql/thread/9367c580-087a-4fc1-bf88-91a51a4ee018/
Apparently there is no difference between them in mysql, like sqlserver the parser appears to change the query to select(1). Sorry if I mislead you in some way.
I was curious about this myself. It's all fine to read documentation and theoretical answers, but I like to balance those with empirical evidence.
I have a MySQL table (InnoDB) that has 5,607,997 records in it. The table is in my own private sandbox, so I know the contents are static and nobody else is using the server. I think this effectively removes all outside affects on performance. I have a table with an auto_increment Primary Key field (Id) that I know will never be null that I will use for my where clause test (WHERE Id IS NOT NULL).
The only other possible glitch I see in running tests is the cache. The first time a query is run will always be slower than subsequent queries that use the same indexes. I'll refer to that below as the cache Seeding call. Just to mix it up a little I ran it with a where clause I know will always evaluate to true regardless of any data (TRUE = TRUE).
That said here are my results:
QueryType
| w/o WHERE | where id is not null | where true=true
COUNT()
| 9 min 30.13 sec ++ | 6 min 16.68 sec ++ | 2 min 21.80 sec ++
| 6 min 13.34 sec | 1 min 36.02 sec | 2 min 0.11 sec
| 6 min 10.06 se | 1 min 33.47 sec | 1 min 50.54 sec
COUNT(Id)
| 5 min 59.87 sec | 1 min 34.47 sec | 2 min 3.96 sec
| 5 min 44.95 sec | 1 min 13.09 sec | 2 min 6.48 sec
COUNT(1)
| 6 min 49.64 sec | 2 min 0.80 sec | 2 min 11.64 sec
| 6 min 31.64 sec | 1 min 41.19 sec | 1 min 43.51 sec
++This is considered the cache Seeding call. It is expected to be slower than the rest.
I'd say the results speak for themselves. COUNT(Id) usually edges out the others. Adding a Where clause dramatically decreases the access time even if it's a clause you know will evaluate to true. The sweet spot appears to be COUNT(Id)... WHERE Id IS NOT NULL.
I would love to see other peoples' results, perhaps with smaller tables or with where clauses against different fields than the field you're counting. I'm sure there are other variations I haven't taken into account.
Seek Alternatives
As you've seen, when tables grow large, COUNT queries get slow. I think the most important thing is to consider the nature of the problem you're trying to solve. For example, many developers use COUNT queries when generating pagination for large sets of records in order to determine the total number of pages in the result set.
Knowing that COUNT queries will grow slow, you could consider an alternative way to display pagination controls that simply allows you to side-step the slow query. Google's pagination is an excellent example.
Denormalize
If you absolutely must know the number of records matching a specific count, consider the classic technique of data denormalization. Instead of counting the number of rows at lookup time, consider incrementing a counter on record insertion, and decrementing that counter on record deletion.
If you decide to do this, consider using idempotent, transactional operations to keep those denormalized values in synch.
BEGIN TRANSACTION;
INSERT INTO `group_relations` (`group_id`) VALUES (1);
UPDATE `group_relations_count` SET `count` = `count` + 1;
COMMIT;
Alternatively, you could use database triggers if your RDBMS supports them.
Depending on your architecture, it might make sense to use a caching layer like memcached to store, increment and decrement the denormalized value, and simply fall through to the slow COUNT query when the cache key is missing. This can reduce overall write-contention if you have very volatile data, though in cases like this, you'll want to consider solutions to the dog-pile effect.
MySQL ISAM tables should have optimisation for COUNT(*), skipping full table scan.
An asterisk in COUNT has no bearing with asterisk for selecting all fields of table. It's pure rubbish to say that COUNT(*) is slower than COUNT(field)
I intuit that select COUNT(*) is faster than select COUNT(field). If the RDBMS detected that you specify "*" on COUNT instead of field, it doesn't need to evaluate anything to increment count. Whereas if you specify field on COUNT, the RDBMS will always evaluate if your field is null or not to count it.
But if your field is nullable, specify the field in COUNT.
COUNT(*) facts and myths:
MYTH: "InnoDB doesn't handle count(*) queries well":
Most count(*) queries are executed same way by all storage engines if you have a WHERE clause, otherwise you InnoDB will have to perform a full table scan.
FACT: InnoDB doesn't optimize count(*) queries without the where clause
It is best to count by an indexed column such as a primary key.
SELECT COUNT(`group_id`) FROM `group_relations`
It should depend on what you are actually trying to achieve as Sebastian has already said, i.e. make your intentions clear! If you are just counting the rows then go for the COUNT(*), or counting a single column go for the COUNT(column).
It might be worth checking out your DB vendor too. Back when I used to use Informix it had an optimisation for COUNT(*) which had a query plan execution cost of 1 compared to counting single or mutliple columns which would result in a higher figure
if you try SELECT COUNT(1) FROM group_relations it will be a bit faster because it will not try to retrieve information from your columns.
COUNT(1) used to be faster than COUNT(*), but that's not true anymore, since modern DBMS are smart enough to know that you don't wanna know about columns
The advice I got from MySQL about things like this is that, in general, trying to optimize a query based on tricks like this can be a curse in the long run. There are examples over MySQL's history where somebody's high-performance technique that relies on how the optimizer works ends up being the bottleneck in the next release.
Write the query that answers the question you're asking -- if you want a count of all rows, use COUNT(*). If you want a count of non-null columns, use COUNT(col) WHERE col IS NOT NULL. Index appropriately, and leave the optimization to the optimizer. Trying to make your own query-level optimizations can sometimes make the built-in optimizer less effective.
That said, there are things you can do in a query to make it easier for the optimizer to speed it up, but I don't believe COUNT is one of them.
Edit: The statistics in the answer above are interesting, though. I'm not sure whether there is actually something at work in the optimizer in this case. I'm just talking about query-level optimizations in general.
I know it's generally a bad idea to do
queries like this:
SELECT * FROM `group_relations`
But when I just want the count, should
I go for this query since that allows
the table to change but still yields
the same results.
SELECT COUNT(*) FROM `group_relations`
As your question implies, the reason SELECT * is ill-advised is that changes to the table could require changes in your code. That doesn't apply to COUNT(*). It's pretty rare to want the specialized behavior that SELECT COUNT('group_id') gives you - typically you want to know the number of records. That's what COUNT(*) is for, so use it.