I have a database table with 300 000 records. The most common query done is:
SELECT [..] WHERE `word` LIKE 'userInput%';
At the moment the column type is varchar(50) UNIQUE. I am wondering whether there is a way to optimize the column for this specific query?
Update 2011 11 19 GMT 00:00:
mysql> EXPLAIN SELECT `word_id`, `word` FROM `words` WHERE `word` LIKE 'bar%'
-> ;
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
| 1 | SIMPLE | words | range | word | word | 152 | NULL | 435 | Using where; Using index |
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
As long as you are using a "starts" with type wild card search, just a standard index should work fine on word.
It's only when you start getting into a "contains" wild card search that you start having hard issues with indexing.
Of course, attaching an explain plan would help...
Based on the explain plan, it looks like you're doing about as good as you can do. It's indicating that it's using the index properly, and not doing any full table scans or file sorts.
Related
In theory, which of these would return results faster? I'm having to deal with almost half a billion rows in table and coming up with a plan to remove quite a few. I need to ensure I'm providing the quickest possible solution.
+----+-------------+------------------+------+---------------+------+---------+------+-----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------+------+---------+------+-----------+---------------------------------+
| 1 | PRIMARY | tableA | ALL | NULL | NULL | NULL | NULL | 505432976 | Using where |
| 2 | SUBQUERY | tableA | ALL | NULL | NULL | NULL | NULL | 505432976 | Using temporary; Using filesort |
+----+-------------+------------------+------+---------------+------+---------+------+-----------+---------------------------------+
2 rows in set (0.00 sec)
+----+-------------+------------------+--------+---------------------------------------------+---------+---------+-----------+-----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+--------+---------------------------------------------+---------+---------+-----------+-----------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 505432976 | Using where |
| 1 | PRIMARY | a1 | eq_ref | PRIMARY,FK_address_1,idx_address_1 | PRIMARY | 8 | t2.max_id | 1 | Using where |
| 2 | DERIVED | tableA | ALL | NULL | NULL | NULL | NULL | 505432976 | Using temporary; Using filesort |
+----+-------------+------------------+--------+---------------------------------------------+---------+---------+-----------+-----------+---------------------------------+
3 rows in set (0.01 sec)
Your question may be focused on "subquery" versus "derived table".
And your question is related to Deleting a large part of a table. Ignore my discussion of EXPLAIN and skip to my link below. That is, neither is "the quickest"!
Explaining the EXPLAINs
A very crude way to use the EXPLAIN is to multiple the Rows column. In the first query, that is (505432976 * 505432976). This tells me that the queries could take years, maybe centuries, to run. The query seems to say "For each row in 'primary', scan all of 'subquery'".
In the second ('DERIVED') query, multiple each "table", then "it depends" when it comes to whether to multiply or add the results. I think that "add" would happen -- (505432976 + 505432976). Bad, but not nearly as terrible. It seems to say "First copy all of the 'derived' tableA into a temp table, then scan all of that temp table to get the final results."
ALL means a "table scan", which may mean that there is no useful index. Or it may mean that you are deliberately looking at all rows of each 500M-row table.
Caveat: LIMITis usually not factored into the numbers in EXPLAIN. But sometimes LIMIT does not shorten the execution time.
Each table must have a PRIMARY KEY. Secondary indexes are often very useful. "Composite" indexes are often better than single-column indexes.
Look at the WHERE clause for what column(s) should be indexed. (The art of indexing is much more complex than that, but this would get you started.)
See also EXPLAIN FORMAT=JSON SELECT ...
Show us the queries and tell us about what you need to delete (or "keep")!
Plan to remove quite a few
It may be much faster to copy over the rows you want to keep.
I discuss various techniques for deleting lots of rows. Reading that may save you a lot of grief with your 500M rows!
A "select distinct col1,col2 from table1" where col1 and col2 are of type TEXT and table1 has about 65K rows works fine with MySQL 5.5.58. Now that I've upgraded to MySQL 5.7.20 it takes almost an hour! Does anyone know of any changes to MySQL that may be causing this? Does anyone have any suggestions how col1 and col2 should be optimally indexed for this query, or what other settings I should check to make this query run faster? I don't get the feeling that indexes are even being used since EXPLAIN says it's using a temporary table and no keys:
mysql> `
explain SELECT DISTINCT author,sort_author from itemsbyauthor;
+----+-------------+---------------+------------+------+---------------+------+---------+------+-------+----------+-----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+------+---------------+------+---------+------+-------+----------+-----------------+
| 1 | SIMPLE | itemsbyauthor | NULL | ALL | NULL | NULL | NULL | NULL | 64727 | 100.00 | Using temporary |
+----+-------------+---------------+------------+------+---------------+------+---------+------+-------+----------+-----------------+
1 row in set, 1 warning (0.00 sec)
In many cases, MySQL doesn't use prefix indexes properly and it seems this is one of these cases.
Do you really need the column type to be TEXT?
From the column names, it looks like the columns are holding author names, which seems like a relatively short string (let's say, up to 50 or 100 characters)?
I would re-consider the column type and try to alter it to VARCHAR with a fixed size, instead of TEXT.
Then, add a compound index that includes both columns.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Disadvantages of quoting integers in a Mysql query?
I have a very simple table Called Device on MYSql database.
+-----------------------------------+--------------+------+-----+----------------+
| Field | Type | Null | Key | Extra |
+-----------------------------------+--------------+------+-----+----------------+
| DTYPE | varchar(31) | NO | | |
| id | bigint(20) | NO | PRI | auto_increment |
| dateCreated | datetime | NO | | |
| dateModified | datetime | NO | | |
| phoneNumber | varchar(255) | YES | MUL | |
| version | bigint(20) | NO | | |
| oldPhoneNumber | varchar(255) | YES | | |
+-----------------------------------+--------------+------+-----+----------------+
This table has more than 100K records. I am running a very simple query
select * from AttDevice where phoneNumber = 5107357058;
This query takes almost 4-6 second, But when I change this query a little bit as shown below.
select * from AttDevice where phoneNumber = '5107357058';
It takes almost no time to get executed.
Notice that phoneNumber column is varchar. I don't understand why the former case takes longer time and later doesn't. The difference between these two queries is the single quote.
Does MYSQL treats these to query differently if so then why?
EDIT 1
I used EXPLAIN and got the following output but don't know how to interpret these two results.
mysql> EXPLAIN select * from AttDevice where phoneNumber = 5107357058;
+----+-------------+-----------+------+---------------------------------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------------------------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | Device | ALL | phoneNumber,idx_Device_phoneNumber | NULL | NULL | NULL | 6482116 | Using where |
+----+-------------+-----------+------+---------------------------------------+------+---------+------+---------+-------------+
1 row in set (0.00 sec)
mysql> EXPLAIN select * from AttDevice where phoneNumber = '5107357058';
+----+-------------+-----------+------+---------------------------------------+-------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------------------------------+-------------+---------+-------+------+-------------+
| 1 | SIMPLE | Device | ref | phoneNumber,idx_Device_phoneNumber | phoneNumber | 258 | const | 2 | Using where |
+----+-------------+-----------+------+---------------------------------------+-------------+---------+-------+------+-------------+
1 row in set (0.00 sec)
Can someone explain me about the key, key_len and rows present in EXPLAIN query output?
1) Thank you for the "EXPLAIN". We all (including you, I'm sure) knew that the problem was that mysql had to convert the integer to a string, and had to do it for each row. But your "EXPLAIN" proved it.
2) Here's a nice, short article about EXPLAIN:
http://www.lornajane.net/posts/2011/explaining-mysqls-explain
The *possible_keys* shows which indexes apply to this query and the key
tells us which of those was actually used -... Finally the rows entry tell
us how many rows MySQL had to look at to find the result set.
Search value: key: type: ref: rows:
------------- --- ---- ---- ----
5107357058 NULL ALL NULL 6482116
'5107357058' phoneNumber ref const 2
3) The "ref" column is the "The columns compared to the index". In the second case, the string literal ("constant") '5107357058' was compared to the key "phoneNumber". In the first case, there was no usable key (because your search condition was a completely different type); hence "ref" was NULL.
4) The "type" column is "The join type". "Ref" means "All rows with matching index values are read from this table" (in this case, 2 rows). "ALL" mans "full table scan". Which in this case means 6 million rows.
5) Here's the mysql documentation for "EXPLAIN":
http://dev.mysql.com/doc/refman/5.5/en/explain-output.html
You fooled MySQL into making a bad choice by NOT quoting the phone number. Consider:
The column definition is varchar
In the first (unquoted) case you provided the value as an integer (long). I would have thought MySQL could figure this one out, but obviously it didn't, and did a full table scan.
In the second (quoted) case, you gave the search key in the correct datatype (character) and MySQL chose the index over the full-table-scan.
The varchar index cannot be used when you use a number as the operand, excerpt from the fine documentation on implicit type conversions:
For comparisons of a string column with a number, MySQL cannot use an index on the column to look up the value quickly. If str_col is an indexed string column, the index cannot be used when performing the lookup in the following statement:
SELECT * FROM tbl_name WHERE str_col=1;
The reason for this is that there are many different strings that may convert to the value 1, such as '1', ' 1', or '1a'.
I believe that MySQL has to convert the number into a varchar in the first example. In the second example it does not. I'm guessing that's where the difference is coming from.
The first example looks through the table one by one, the other one uses the index.
http://dev.mysql.com/doc/refman/5.0/en/show-columns.html
If Key is MUL, multiple occurrences of a given value are permitted within the column. The column is the first column of a nonunique index or a unique-valued index that can contain NULL values.
So instead of scanning all the null values, the second query look exclusively for for non-null values which speeds things up.
....I think.
Table structure:
+-------------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| total | int(11) | YES | | NULL | |
| thedatetime | datetime | YES | MUL | NULL | |
+-------------+----------+------+-----+---------+----------------+
Total rows: 137967
mysql> explain select * from out where thedatetime <= NOW();
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | out | ALL | thedatetime | NULL | NULL | NULL | 137967 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
The real query is much more longer with more table joins, the point is, I can't get the table to use the datetime index. This is going to be hard for me if I want to select all data until certain date. However, I noticed that I can get MySQL to use the index if I select a smaller subset of data.
mysql> explain select * from out where thedatetime <= '2008-01-01';
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
| 1 | SIMPLE | out | range | thedatetime | thedatetime | 9 | NULL | 15826 | Using where |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
mysql> select count(*) from out where thedatetime <= '2008-01-01';
+----------+
| count(*) |
+----------+
| 15990 |
+----------+
So, what can I do to make sure MySQL will use the index no matter what date that I put?
There are two things in play here -
Index is not selective enough - if the index covers more than approx. 30% of the rows, MySQL will decide a full table scan is more efficient. When you contract the range the index kicks in.
One index per table in a join
The real query is much more longer
with more table joins, the point is ...
The point is exactly because it has joins that it probably can't use that index. MySQL can use one index per table in a join (unless it qualifies for an index-merge optimization). If the primary key is already used for the join, thedatetime won't be used. In order to use it, you need to create a multi-column index on the join key + thedatetime index, in the correct order.
Check the EXPLAIN of the actual query to see which key MySQL uses for the join. Modify that index to include the thedatetime column as well, or create a new multi-column index from both (depending on what you use the join key for).
Everything works as it is supposed to. :)
Indexes are there to speed up retrieval. They do it using index lookups.
In you first query the index is not used because you are retrieving ALL rows, and in this case using index is slower (lookup index, get row, lookup index, get row... x number of rows is slower then get all rows == table scan)
In the second query you are retrieving only a portion of the data and in this case table scan is much slower.
The job of the optimizer is to use statistics that RDBMS keeps on the index to determine the best plan. In first case index was considered, but planner (correctly) threw it away.
EDIT
You might want to read something like this to get some concepts and keywords regarding mysql query planner.
I'm wondering if MySQL takes collation into account when generating an index, or if the index is generated the same regardless of collation, the collation only being taken into account when later traversing that index.
For my purposes, I'd like to use the collation utf8_unicode_ci on a field. I know this particular collation has a relatively high performance penalty, but it's still important to me to use it.
I have an index on that field which is being used to satisfy an ORDER BY clause, retrieving the rows in order quickly (avoiding a filesort). However, I'm not sure whether using this collation is going to affect the speed of rows as they are read back from the index, or if the index stores data in an already-normalised state according to that collation, allowing for the performance penalty to be entirely in generating the index and not reading it back.
I believe that the btree structure will be different because it has to compare the column values differently.
Look at these two query plans:
mysql> explain select * from sometable where keycol = '3';
+----+-------------+-------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+---------+---------+-------+------+--------------------------+
| 1 | SIMPLE | pro | ref | PRIMARY | PRIMARY | 66 | const | 34 | Using where; Using index |
+----+-------------+-------+------+---------------+---------+---------+-------+------+--------------------------+
mysql> explain select * from sometable where binary keycol = '3';
+----+-------------+-------+-------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+-------+--------------------------+
| 1 | SIMPLE | pro | index | NULL | PRIMARY | 132 | NULL | 14417 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+-------+--------------------------+
If we change the collation for the comparison, suddenly it isn't even able to seek the index anymore and has to scan every row. The actual values stored in the index will be the same regardless of collation, for instance, because it will still return the value in its original casing regardless of whether it's using a case sensitive or case insensitive collation.
So lookups against a case insensitive collation should be a little less efficient.
However, I doubt you'd ever be able to notice the difference; note that MySQL makes everything case insensitive by default, so the impact can't be that terrible.
UPDATE:
You can see a similar effect for order by operations:
mysql> explain select * from sometable order by keycol collate latin1_general_cs;
+----+-------------+-------+-------+---------------+---------+---------+------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+-------+-----------------------------+
| 1 | SIMPLE | pro | index | NULL | PRIMARY | 132 | NULL | 14417 | Using index; Using filesort |
+----+-------------+-------+-------+---------------+---------+---------+------+-------+-----------------------------+
mysql> explain select * from sometable order by keycol ;
+----+-------------+-------+-------+---------------+---------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+-------+-------------+
| 1 | SIMPLE | pro | index | NULL | PRIMARY | 132 | NULL | 14417 | Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+-------+-------------+
Note the extra 'filesort' stage required to execute the query. That means mysql is queuing up the result in a temporary buffer and sorting it itself using a quicksort in an extra stage, throwing out whatever the index order was. Using the original collation this step is uneccessary as mysql knows the order from index initially.
MySQL will use the collation of the column for the index. So if you make a utf8_unicode_ci field, then the index will also be in utf8_unicode_ci order effectively.
Keep in mind that using the index will not always 100% bypass the performance impact, but for most practical purposes it will.
Many database systems aren't CPU bound, so I doubt you would notice the impact.