In theory, which of these would return results faster? I'm having to deal with almost half a billion rows in table and coming up with a plan to remove quite a few. I need to ensure I'm providing the quickest possible solution.
+----+-------------+------------------+------+---------------+------+---------+------+-----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------+------+---------+------+-----------+---------------------------------+
| 1 | PRIMARY | tableA | ALL | NULL | NULL | NULL | NULL | 505432976 | Using where |
| 2 | SUBQUERY | tableA | ALL | NULL | NULL | NULL | NULL | 505432976 | Using temporary; Using filesort |
+----+-------------+------------------+------+---------------+------+---------+------+-----------+---------------------------------+
2 rows in set (0.00 sec)
+----+-------------+------------------+--------+---------------------------------------------+---------+---------+-----------+-----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+--------+---------------------------------------------+---------+---------+-----------+-----------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 505432976 | Using where |
| 1 | PRIMARY | a1 | eq_ref | PRIMARY,FK_address_1,idx_address_1 | PRIMARY | 8 | t2.max_id | 1 | Using where |
| 2 | DERIVED | tableA | ALL | NULL | NULL | NULL | NULL | 505432976 | Using temporary; Using filesort |
+----+-------------+------------------+--------+---------------------------------------------+---------+---------+-----------+-----------+---------------------------------+
3 rows in set (0.01 sec)
Your question may be focused on "subquery" versus "derived table".
And your question is related to Deleting a large part of a table. Ignore my discussion of EXPLAIN and skip to my link below. That is, neither is "the quickest"!
Explaining the EXPLAINs
A very crude way to use the EXPLAIN is to multiple the Rows column. In the first query, that is (505432976 * 505432976). This tells me that the queries could take years, maybe centuries, to run. The query seems to say "For each row in 'primary', scan all of 'subquery'".
In the second ('DERIVED') query, multiple each "table", then "it depends" when it comes to whether to multiply or add the results. I think that "add" would happen -- (505432976 + 505432976). Bad, but not nearly as terrible. It seems to say "First copy all of the 'derived' tableA into a temp table, then scan all of that temp table to get the final results."
ALL means a "table scan", which may mean that there is no useful index. Or it may mean that you are deliberately looking at all rows of each 500M-row table.
Caveat: LIMITis usually not factored into the numbers in EXPLAIN. But sometimes LIMIT does not shorten the execution time.
Each table must have a PRIMARY KEY. Secondary indexes are often very useful. "Composite" indexes are often better than single-column indexes.
Look at the WHERE clause for what column(s) should be indexed. (The art of indexing is much more complex than that, but this would get you started.)
See also EXPLAIN FORMAT=JSON SELECT ...
Show us the queries and tell us about what you need to delete (or "keep")!
Plan to remove quite a few
It may be much faster to copy over the rows you want to keep.
I discuss various techniques for deleting lots of rows. Reading that may save you a lot of grief with your 500M rows!
Related
I have a simple InnoDB table with 1M+ rows and some simple indexes.
I need to sort this table by first_public and id columns and get some of them, this is why I've indexed first_public column.
first_public is unique at the moment, but in real life it might be not.
mysql> desc table;
+--------------+-------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+-------------------------+------+-----+---------+----------------+
| id | bigint unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(255) | NO | | NULL | |
| id_category | int | NO | MUL | NULL | |
| active | smallint | NO | | NULL | |
| status | enum('public','hidden') | NO | | NULL | |
| first_public | datetime | YES | MUL | NULL | |
| created_at | timestamp | YES | | NULL | |
| updated_at | timestamp | YES | | NULL | |
+--------------+-------------------------+------+-----+---------+----------------+
8 rows in set (0.06 sec)
it works well while I'm working with rows before 130000+
mysql> explain select id from table where active = 1 and status = 'public' order by first_public desc, id desc limit 24 offset 130341;
+----+-------------+--------+------------+-------+---------------+---------------------+---------+------+--------+----------+----------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+---------------------+---------+------+--------+----------+----------------------------------+
| 1 | SIMPLE | table | NULL | index | NULL | firstPublicDateIndx | 6 | NULL | 130365 | 5.00 | Using where; Backward index scan |
+----+-------------+--------+------------+-------+---------------+---------------------+---------+------+--------+----------+----------------------------------+
1 row in set, 1 warning (0.00 sec)
but when I try to get some next rows (with offset 140000+), it looks like MySQL don't use first_public column index at all.
mysql> explain select id from table where active = 1 and status = 'public' order by first_public desc, id desc limit 24 offset 140341;
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-----------------------------+
| 1 | SIMPLE | table | NULL | ALL | NULL | NULL | NULL | NULL | 1133533 | 5.00 | Using where; Using filesort |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-----------------------------+
1 row in set, 1 warning (0.00 sec)
I tried to add first_public column in to select clause, but nothing changed.
What I'm doing wrong?
MySQL's optimizer tries to estimate the cost of doing your query, to decide if it's worth using an index. Sometimes it compares the cost of using the index versus just reading the rows in order, and discarding the ones that don't belong in the result.
In this case, it decided that if you use an OFFSET greater than 140k, it gives up on using the index.
Keep in mind how OFFSET works. There's no way of looking up the location of an offset by an index. Indexes help to look up rows by value, not by position. So to do an OFFSET query, it has to examine all the rows from the first matching row on up. Then it discards the rows it examined up to the offset, and then counts out the enough rows to meet the LIMIT and returns those.
It's like if you wanted to read pages 500-510 in a book, but to do this, you had to read pages 1-499 first. Then when someone asks you to read pages 511-520, and you have to read pages 1-510 over again.
Eventually the offset gets to be so large that it's less expensive to read 14000 rows in a table-scan, than to read 14000 index entries + 14000 rows.
What?!? Is OFFSET really so expensive? Yes, it is. It's much more common to look up rows by value, so MySQL is optimized for that usage.
So if you can reimagine your pagination queries to look up rows by value instead of using LIMIT/OFFSET, you'll be much happier.
For example, suppose you read "page" 1000, and you see that the highest id value on that page is 13999. When the client requests the next page, you can do the query:
SELECT ... FROM mytable WHERE id > 13999 LIMIT 24;
This does the lookup by the value of id, which is optimized because it utilizes the primary key index. Then it reads just 24 rows and returns them (MySQL is at least smart enough to stop reading after it reaches the OFFSET + LIMIT rows).
The best index is
INDEX(active, status, first_public, id)
Using huge offsets is terribly inefficient -- it must scan over 140341 + 24 rows to perform the query.
If you are trying to "walk through" the table, use the technique of "remembering where you left off". More discussion of this: http://mysql.rjweb.org/doc.php/pagination
The reason for the Optimizer to abandon the index: It decided that the bouncing back and forth between the index and the table was possibly worse than simply scanning the entire table. (The cutoff is about 20%, but varies widely.)
In a MySQL db I have a table that only has 2 columns, for all intents and purposes: a key hash and a value. Both are INTEGER type. The hash column will have a large number of duplicates (worst case expect ~80k dup for each hash, not possible to make unique due to small hash preimage), and the table contains on the order of 100 billion rows.
Right now I have the hash column indexed (CREATE INDEX idx_hash ON table(hash)); however lookups are very slow. Something like SELECT value FROM table WHERE hash=123 LIMIT 50 will take minutes if not hours, while a similar select on a similar sized table on a primary key column will finish in a jiffy on the same machine.
So my question is how do I optimize for lookup in this case? Is sub-linear time SELECT possible on index columns? This table will be mostly read-only, rebuilding it is possible but will take a long time, so I'd like to gather some information and do it correctly.
EXPLAIN says:
+----+-------------+----------------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| 1 | SIMPLE | partial_lookup | NULL | ALL | NULL | NULL | NULL | NULL | 100401571 | 10.00 | Using where |
+----+-------------+----------------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
ANALYZE:
+--------------------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+--------------------+---------+----------+----------+
| partial_lookup | analyze | status | OK |
+--------------------+---------+----------+----------+
1 row in set (1.47 sec)
What is the performance penalty for SELECT * FROM Table VS SELECT * FROM (SELECT * FROM Table AS A) AS B
My questions are: Firstly, does the SELECT * involve iteration over the rows in the table, or will it simply return all rows as a chunk without any iteration (because no WHERE clause was given), and if so does the nested query in example two involve iterating over the table twice, and will take 2x the time of the first query? thanks...
The answer to this question hinges on whether you are using mysql before 5.7, or 5.7 and after. I may be altering your question slightly, but hopefully the following captures what you are after.
Your SELECT * FROM Table does a table scan via the clustered index (the physical ordering). In the case of no primary key, one is implicitly available to the engine. There is no where clause as you say. No filtering or choice of another index is attempted.
The Explain output (see also) shows 1 row in its summary. It is relatively straight forward. The explain output and performance with your derived table B will differ depending on whether you are on a version before 5.7, or 5.7 and after.
The document Derived Tables in MySQL 5.7 describes it well for versions 5.6 and 5.7, where the latter will provide no penalty due to the change in materialized derived table output being incorporated into the outer query. In prior versions, substantial overhead was endured with temporary tables with the derived.
It is quite easy to test the performance penalty prior to 5.7. All it takes is a medium sized table to see the noticeable impact that your question's derived table has on impacting performance. The following example is on a small table in version 5.6:
explain
select qm1.title
from questions_mysql qm1
join questions_mysql qm2
on qm2.qid<qm1.qid
where qm1.qid>3333 and qm1.status='O';
+----+-------------+-------+-------+-----------------+---------+---------+------+-------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-----------------+---------+---------+------+-------+------------------------------------------------+
| 1 | SIMPLE | qm1 | range | PRIMARY,cactus1 | PRIMARY | 4 | NULL | 5441 | Using where |
| 1 | SIMPLE | qm2 | ALL | PRIMARY,cactus1 | NULL | NULL | NULL | 10882 | Range checked for each record (index map: 0x3) |
+----+-------------+-------+-------+-----------------+---------+---------+------+-------+------------------------------------------------+
explain
select b.title from
( select qid,title from questions_mysql where qid>3333 and status='O'
) b
join questions_mysql qm2
on qm2.qid<b.qid;
+----+-------------+-----------------+-------+-----------------+---------+---------+------+-------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+-------+-----------------+---------+---------+------+-------+----------------------------------------------------+
| 1 | PRIMARY | qm2 | index | PRIMARY,cactus1 | cactus1 | 10 | NULL | 10882 | Using index |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 5441 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DERIVED | questions_mysql | range | PRIMARY,cactus1 | PRIMARY | 4 | NULL | 5441 | Using where |
+----+-------------+-----------------+-------+-----------------+---------+---------+------+-------+----------------------------------------------------+
Note, I did change the question, but it illustrates the impact that derived tables and their lack of index use with the optimizer has in versions prior to 5.7. The derived table benefits from indexes as it is being materialized. But thereafter it endures overhead as a temporary table and is incorporated into the outer query without index use. This is not the case in version 5.7
Table structure:
+-------------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| total | int(11) | YES | | NULL | |
| thedatetime | datetime | YES | MUL | NULL | |
+-------------+----------+------+-----+---------+----------------+
Total rows: 137967
mysql> explain select * from out where thedatetime <= NOW();
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | out | ALL | thedatetime | NULL | NULL | NULL | 137967 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
The real query is much more longer with more table joins, the point is, I can't get the table to use the datetime index. This is going to be hard for me if I want to select all data until certain date. However, I noticed that I can get MySQL to use the index if I select a smaller subset of data.
mysql> explain select * from out where thedatetime <= '2008-01-01';
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
| 1 | SIMPLE | out | range | thedatetime | thedatetime | 9 | NULL | 15826 | Using where |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
mysql> select count(*) from out where thedatetime <= '2008-01-01';
+----------+
| count(*) |
+----------+
| 15990 |
+----------+
So, what can I do to make sure MySQL will use the index no matter what date that I put?
There are two things in play here -
Index is not selective enough - if the index covers more than approx. 30% of the rows, MySQL will decide a full table scan is more efficient. When you contract the range the index kicks in.
One index per table in a join
The real query is much more longer
with more table joins, the point is ...
The point is exactly because it has joins that it probably can't use that index. MySQL can use one index per table in a join (unless it qualifies for an index-merge optimization). If the primary key is already used for the join, thedatetime won't be used. In order to use it, you need to create a multi-column index on the join key + thedatetime index, in the correct order.
Check the EXPLAIN of the actual query to see which key MySQL uses for the join. Modify that index to include the thedatetime column as well, or create a new multi-column index from both (depending on what you use the join key for).
Everything works as it is supposed to. :)
Indexes are there to speed up retrieval. They do it using index lookups.
In you first query the index is not used because you are retrieving ALL rows, and in this case using index is slower (lookup index, get row, lookup index, get row... x number of rows is slower then get all rows == table scan)
In the second query you are retrieving only a portion of the data and in this case table scan is much slower.
The job of the optimizer is to use statistics that RDBMS keeps on the index to determine the best plan. In first case index was considered, but planner (correctly) threw it away.
EDIT
You might want to read something like this to get some concepts and keywords regarding mysql query planner.
Is there any tangible difference (speed/efficiency) between these statements? Assume the column is indexed.
SELECT MAX(someIntColumn) AS someIntColumn
or
SELECT someIntColumn ORDER BY someIntColumn DESC LIMIT 1
This depends largely on the query optimizer in your SQL implementation. At best, they will have the same performance. Typically, however, the first query is potentially much faster.
The first query essentially asks for the DBMS to inspect every value in someIntColumn and pick the largest one.
The second query asks the DBMS to sort all the values in someIntColumn from largest to smallest and pick the first one. Depending on the number of rows in the table and the existence (or lack thereof) of an index on the column, this could be significantly slower.
If the query optimizer is sophisticated enough to realize that the second query is equivalent to the first one, you are in luck. But if you retarget your app to another DBMS, you might get unexpectedly poor performance.
EDIT based on explain plan:
Explain plan shows that max(column) is more efficient. The explain plan say, “Select tables optimized away”.
EXPLAIN SELECT version from schema_migrations order by version desc limit 1;
+----+-------------+-------------------+-------+---------------+--------------------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+-------+---------------+--------------------------+---------+------+------+-------------+
| 1 | SIMPLE | schema_migrations | index | NULL | unique_schema_migrations | 767 | NULL | 1 | Using index |
+----+-------------+-------------------+-------+---------------+--------------------------+---------+------+------+-------------+
1 row in set (0.00 sec)
EXPLAIN SELECT max(version) FROM schema_migrations ;
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
1 row in set (0.00 sec)