I have this big table (ca million records) and I'm trying to retrieve the last record of each type.
The table, the index and the query are very simple, and the fact that MySQL is not using the index means I must be overlooking something.
The table looks like this:
CREATE TABLE `MyTable001` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`TypeField` int(11) NOT NULL,
`Value` bigint(20) NOT NULL,
`Timestamp` bigint(20) NOT NULL,
`AnotherField1` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_MyTable001_TypeField` (`TypeField`),
KEY `idx_MyTable001_Timestamp` (`Timestamp`)
) ENGINE=MyISAM
Show Index gives this:
+------------+------------+--------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+--------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| MyTable001 | 0 | PRIMARY | 1 | id | A | 626141 | NULL | NULL | | BTREE | | |
| MyTable001 | 1 | idx_MyTable001_TypeField | 1 | TypeField | A | 458 | NULL | NULL | | BTREE | | |
| MyTable001 | 1 | idx_MyTable001_Timestamp | 1 | Timestamp | A | 156535 | NULL | NULL | | BTREE | | |
+------------+------------+--------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
But when I execute EXPLAIN for the following query:
SELECT *
FROM MyTable001
GROUP BY TypeField
ORDER BY id DESC
The result is this:
+----+-------------+------------+------+---------------+------+---------+------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+--------+---------------------------------+
| 1 | SIMPLE | MyTable001 | ALL | NULL | NULL | NULL | NULL | 626141 | Using temporary; Using filesort |
+----+-------------+------------+------+---------------+------+---------+------+--------+---------------------------------+
Why won't MySQL use idx_MyTable001_TypeField?
Thanks in advance.
The problem is that the content of the fields not in the group by are still being inspected. Therefore, all rows must be read, and it's better to do a full table scan. This is clearly seen with the following examples:
SELECT TypeField, COUNT(*) FROM MyTable001 GROUP BY TypeField uses the index.
SELECT TypeField, COUNT(id) FROM MyTable001 GROUP BY TypeField does not.
The original query was incorrect. The correct query is:
SELECT l.*
FROM MyTable001 l
JOIN (
SELECT MAX(id) m_id
FROM MyTable001 l
GROUP BY l.TypeField) l_id ON l_id.m_id = l.id;
It takes 260ms in a table with 630k records. Joachim Isaksson's and fancyPants' alternatives took several minutes in my tests.
Related
Is there a more efficent way to do this? I keep thinking I'm missing something. Thanks
SELECT DISTINCT eventId
FROM event_tags_map
WHERE tagId in (
SELECT tagId FROM event_tags_map WHERE eventId=114778
) ORDER BY RAND() LIMIT 5;
I'm hitting the same table twice and I'm wondering if I can get the same results faster.
Table structure:
mysql> describe event_tags_map;
+---------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+------------------+------+-----+---------+-------+
| eventId | int(10) unsigned | NO | PRI | NULL | |
| tagId | int(10) unsigned | NO | PRI | NULL | |
+---------+------------------+------+-----+---------+-------+
Indexes:
mysql> show index from event_tags_map;
+----------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| event_tags_map | 0 | PRIMARY | 1 | eventId | A | 302032 | NULL | NULL | | BTREE | | |
| event_tags_map | 0 | PRIMARY | 2 | tagId | A | 604065 | NULL | NULL | | BTREE | | |
+----------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
2 rows in set (0.02 sec)
I looks like you will need to refer to the original table twice, one way or another.
I would recommend not using the IN condition, which is not very scalable and has various conter-intuitive behaviors.
My first option to use a correlated subquery with an EXISTS condition. This is usually the most efficient way to check that something exists...
SELECT DISTINCT eventId
FROM event_tags_map m
WHERE EXISTS (
SELECT 1 FROM event_tags_map m1 WHERE m1.eventId = 114778 AND m1.tagId = m.tagId
)
ORDER BY RAND() LIMIT 5;
An alternative option is to use a self-INNER JOIN:
SELECT DISTINCT eventId
FROM event_tags_map m
INNER JOIN event_tags_map m1 ON m1.eventId = 114778 AND m1.tagId = m.tagId
ORDER BY RAND() LIMIT 5;
Both solutions should be able to take advantage of a composite index on event_tags_map(eventId, tagId).
I am creating a table as
create table temp_test2 (
date_id int(11) NOT NULL DEFAULT '0',
`date` date NOT NULL,
`day` int(11) NOT NULL,
PRIMARY KEY (date_id)
);
create table temp_test1 (
date_id int(11) NOT NULL DEFAULT '0',
`date` date NOT NULL,
`day` int(11) NOT NULL,
PRIMARY KEY (date_id)
);
explain select * from temp_test as t inner join temp_test2 as t2 on (t2.date_id =t.date_id) limit 3;
+----+-------------+-------+------+---------------+------+---------+------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------------------------------------------+
| 1 | SIMPLE | t | ALL | date_id | NULL | NULL | NULL | 4 | NULL |
| 1 | SIMPLE | t2 | ALL | date_id | NULL | NULL | NULL | 4 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------------------------------------------+
why the code_id key is not used in both the table, but when I use code_id=something in on condition it's using the key,
explain select * from temp_test as t inner join temp_test2 as t2 on (t2.date_id =t.date_id and t.date_id =1) limit 3;
+----+-------------+-------+-------+-------------------------------------+---------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-------------------------------------+---------+---------+-------+------+-------+
| 1 | SIMPLE | t | const | PRIMARY,date_id,date_id_2,date_id_3 | PRIMARY | 4 | const | 1 | NULL |
| 1 | SIMPLE | t2 | ref | date_id,date_id_2,date_id_3 | date_id | 4 | const | 1 | NULL |
+----+-------------+-------+-------+-------------------------------------+---------+---------+-------+------+-------+
I tried (unique,composite primary,composite) key also but it is not working.
Can anyone explain why this so?
Because your tables contain a very small number of records, the optimiser decides that it is not worth using the index. A table scan will do just as good.
Also, you selected all fields (SELECT *), if it used the index for executing the JOIN a row scan would still be required to get the full contents.
The query would be more likely to use the index if:
you selected only the date_id field
there were more than 4 rows in temp_test
I have a table with 18,310,298 records right now.
And next query
SELECT COUNT(obj_id) AS cnt
FROM
`common`.`logs`
WHERE
`event` = '11' AND
`obj_type` = '2' AND
`region` = 'us' AND
DATE(`date`) = DATE('20120213010502');
With next structure
CREATE TABLE `logs` (
`log_id` int(11) NOT NULL AUTO_INCREMENT,
`event` tinyint(4) NOT NULL,
`obj_type` tinyint(1) NOT NULL DEFAULT '0',
`obj_id` int(11) unsigned NOT NULL DEFAULT '0',
`region` varchar(3) NOT NULL DEFAULT '',
`date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`log_id`),
KEY `event` (`event`),
KEY `obj_type` (`obj_type`),
KEY `region` (`region`),
KEY `for_stat` (`event`,`obj_type`,`obj_id`,`region`,`date`)
) ENGINE=InnoDB AUTO_INCREMENT=83126347 DEFAULT CHARSET=utf8 COMMENT='Logs table' |
and MySQL explain show the next
+----+-------------+-------+------+--------------------------------+----------+---------+-------------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+--------------------------------+----------+---------+-------------+--------+----------+--------------------------+
| 1 | SIMPLE | logs | ref | event,obj_type,region,for_stat | for_stat | 2 | const,const | 837216 | 100.00 | Using where; Using index |
+----+-------------+-------+------+--------------------------------+----------+---------+-------------+--------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)
Running such query in daily peak usage time take about 5 seconds.
What can I do to make it faster ?
UPDATED: Regarding all comments I modified INDEX and take off DATE function in WHERE clause
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| logs | 0 | PRIMARY | 1 | log_id | A | 15379109 | NULL | NULL | | BTREE | |
| logs | 1 | event | 1 | event | A | 14 | NULL | NULL | | BTREE | |
| logs | 1 | obj_type | 1 | obj_type | A | 14 | NULL | NULL | | BTREE | |
| logs | 1 | region | 1 | region | A | 14 | NULL | NULL | | BTREE | |
| logs | 1 | for_stat | 1 | event | A | 157 | NULL | NULL | | BTREE | |
| logs | 1 | for_stat | 2 | obj_type | A | 157 | NULL | NULL | | BTREE | |
| logs | 1 | for_stat | 3 | region | A | 157 | NULL | NULL | | BTREE | |
| logs | 1 | for_stat | 4 | date | A | 157 | NULL | NULL | | BTREE | |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
mysql> explain extended SELECT COUNT(obj_id) as cnt
-> FROM `common`.`logs`
-> WHERE `event`= '11' AND
-> `obj_type` = '2' AND
-> `region`= 'est' AND
-> date between '2012-11-25 00:00:00' and '2012-11-25 23:59:59';
+----+-------------+-------+-------+--------------------------------+----------+---------+------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+-------+--------------------------------+----------+---------+------+------+----------+-------------+
| 1 | SIMPLE | logs | range | event,obj_type,region,for_stat | for_stat | 21 | NULL | 9674 | 75.01 | Using where |
+----+-------------+-------+-------+--------------------------------+----------+---------+------+------+----------+-------------+
It seems it's running faster. Thanks everyone.
The EXPLAIN output shows that the query is using only the first two columns of the for_stat index.
This is because the query doesn't use obj_id in the WHERE clause. If you create a new key without obj_id (or modify the existing key to reorder the columns), more of the key can be used and you may see better performance:
KEY `for_stat2` (`event`,`obj_type`,`region`,`date`)
If it's still too slow, changing the last condition, where you use DATE(), as said by Salman and Sashi, might improve things.
#Joni already explained what is wrong with your index. For query, I assume that your example query selects all records for 2012-02-13 regardless of time. You can change the where clause to use >= and < instead of DATE cast:
SELECT COUNT(obj_id) AS cnt
FROM
`common`.`logs`
WHERE
`event` = 11 AND
`obj_type` = 2 AND
`region` = 'us' AND
`date` >= DATE('20120213010502') AND
`date` < DATE('20120213010502') + INTERVAL 1 DAY
The date function on the date column is making the full table scan.
Try this ::
SELECT COUNT(obj_id) as cnt
FROM
`common`.`logs`
WHERE
`event` = 11
AND
`obj_type` = 2
AND
`region` = 'us'
AND
`date` = DATE('20120213010502')
As logging (inserts) needs to be fast too, use as less indices as possible.
Evaluation may take long as that is admin, not necessarily needing indices.
CREATE TABLE `logs` (
`log_id` int(11) NOT NULL AUTO_INCREMENT,
`event` tinyint(4) NOT NULL,
`obj_type` tinyint(1) NOT NULL DEFAULT '0',
`obj_id` int(11) unsigned NOT NULL DEFAULT '0',
`region` varchar(3) NOT NULL DEFAULT '',
`date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`log_id`),
KEY `for_stat` (`event`,`obj_type`,`region`,`date`)
) ENGINE=InnoDB AUTO_INCREMENT=83126347 DEFAULT CHARSET=utf8 COMMENT='Logs table' |
And about the date search #SashiKant and #SalmanA already answered.
Is Mysql you should place index columns by collation count; less possible values in table - placed closer to the left.
Also you can try to change column region to enum() and try to search date with BETWEEN clause.
Mysql is not using third column in the index because it's usage takes more efforts then just filtering (it's a common thing in Mysql).
I have the following MySQL query:
SELECT pool.username
FROM pool
LEFT JOIN sent ON pool.username = sent.username
AND sent.campid = 'YA1LGfh9'
WHERE sent.username IS NULL
AND pool.gender = 'f'
AND (`location` = 'united states' OR `location` = 'us' OR `location` = 'usa');
The problem is that the pool table contains millions of rows and this query takes over 12 minutes to complete. I realize that in this query, the entire left table (pool) is being scanned. The pool table has an auto incremented id row.
I would like to split this query into multiple queries so that rather than scanning the entire pool table I scan 1000 rows at a time and in the next query I would pick up where I left off (1000-2000,2000-3000) and so on using the id column to keep track.
How can I specify this in my query? Please show examples if you know the answer. Thank you.
here are my indexes if it helps:
mysql> show index from main.pool;
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| pool | 0 | PRIMARY | 1 | id | A | 9275039 | NULL | NULL | | BTREE | |
| pool | 1 | username | 1 | username | A | 9275039 | NULL | NULL | | BTREE | |
| pool | 1 | source | 1 | source | A | 1 | NULL | NULL | | BTREE | |
| pool | 1 | location | 1 | location | A | 38168 | NULL | NULL | | BTREE | |
| pool | 1 | pdex | 1 | gender | A | 2 | NULL | NULL | | BTREE | |
| pool | 1 | pdex | 2 | username | A | 9275039 | NULL | NULL | | BTREE | |
| pool | 1 | pdex | 3 | id | A | 9275039 | NULL | NULL | | BTREE | |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
8 rows in set (0.00 sec)
mysql> show index from main.sent;
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| sent | 0 | PRIMARY | 1 | primary_key | A | 351 | NULL | NULL | | BTREE | |
| sent | 1 | username | 1 | username | A | 175 | NULL | NULL | | BTREE | |
| sent | 1 | sdex | 1 | campid | A | 7 | NULL | NULL | | BTREE | |
| sent | 1 | sdex | 2 | username | A | 351 | NULL | NULL | | BTREE | |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
and here is the explain for my query:
----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+-------+---------+--------------------------------------+
| 1 | SIMPLE | pool | ref | location,pdex | pdex | 5 | const | 6084332 | Using where |
| 1 | SIMPLE | sent | index | sdex | sdex | 309 | NULL | 351 | Using where; Using index; Not exists |
+----+-------------+-------+-------+---------------+------+---------+-------+---------+--------------------------------------+
here is the structure of the pool table:
| pool | CREATE TABLE `pool` (
`id` int(20) NOT NULL AUTO_INCREMENT,
`username` varchar(50) CHARACTER SET utf8 NOT NULL,
`source` varchar(10) CHARACTER SET utf8 NOT NULL,
`gender` varchar(1) CHARACTER SET utf8 NOT NULL,
`location` varchar(50) CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`id`),
KEY `username` (`username`),
KEY `source` (`source`),
KEY `location` (`location`),
KEY `pdex` (`gender`,`username`,`id`)
) ENGINE=MyISAM AUTO_INCREMENT=9327026 DEFAULT CHARSET=latin1 |
here is the structure of the sent table:
| sent | CREATE TABLE `sent` (
`primary_key` int(50) NOT NULL AUTO_INCREMENT,
`username` varchar(50) NOT NULL,
`from` varchar(50) NOT NULL,
`campid` varchar(255) NOT NULL,
`timestamp` int(20) NOT NULL,
PRIMARY KEY (`primary_key`),
KEY `username` (`username`),
KEY `sdex` (`campid`,`username`)
) ENGINE=MyISAM AUTO_INCREMENT=352 DEFAULT CHARSET=latin1 |
This produces a syntax error but this WHERE clause in the beginning is what im after:
SELECT pool.username
FROM pool
WHERE id < 1000
LEFT JOIN sent ON pool.username = sent.username
AND sent.campid = 'YA1LGfh9'
WHERE sent.username IS NULL
AND pool.gender = 'f'
AND (location = 'united states' OR location = 'us' OR location = 'usa');
Splitting your query does not sound like the right approach.
The better way would be to fetch some records from your existing query, send your messages and then continue fetching.
Your query could benefit from another compound index on
pool( location, gender, username )
This should allow to run your complete query from sdex and your new index.
If you really want to split the query, an easy approach could be to
SELECT MIN(id), MAX(id) FROM pool
and then loop from min to max in steps of 1000, and add id >= r AND id < r+1000 to your query.
This could return 0 rows if you have gaps, but it would never return more than 1000 rows at once. A different compound index on pool including (id, location, gender and maybe username) could help for this query.
Looks like its using pool.location
Could try adding an index on gender, might no be a lot of help though.
Rationalising location to a country code in your data, and indexing that would probably be useful.
But the first index to add looks to be campid to me, that could make a serious dent in the number of records it has to test.
MySQL Server version: 5.0.95
Tables All: InnoDB
I am having an issue with a MySQL db query. Basically I am finding that if I index a particular varchar(50) field tag.name, my queries take longer (x10) than not indexing the field. I am trying to speed this query up, however my efforts seem to be counter productive.
The culprit line and field seems to be:
WHERE `t`.`name` IN ('news','home')
I have noticed that if i query the tag table directly without a join using the same criteria and with the name field indexed, i do not have the issue.. It actually works faster as expected.
EXAMPLE Query **
SELECT `a`.*, `u`.`pen_name`
FROM `tag_link` `tl`
INNER JOIN `tag` `t`
ON `t`.`tag_id` = `tl`.`tag_id`
INNER JOIN `article` `a`
ON `a`.`article_id` = `tl`.`link_id`
INNER JOIN `user` `u`
ON `a`.`user_id` = `u`.`user_id`
WHERE `t`.`name` IN ('news','home')
AND `tl`.`type` = 'article'
AND `a`.`featured` = 'featured'
GROUP BY `article_id`
LIMIT 0 , 5
EXPLAIN with index **
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+---------+---------+-------------------+------+-----------------------------------------------------------+
| 1 | SIMPLE | t | range | PRIMARY,name | name | 152 | NULL | 4 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | tl | ref | tag_id,link_id,link_id_2 | tag_id | 4 | portal.t.tag_id | 10 | Using where |
| 1 | SIMPLE | a | eq_ref | PRIMARY,fk_article_user1 | PRIMARY | 4 | portal.tl.link_id | 1 | Using where |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | portal.a.user_id | 1 | |
+----+-------------+-------+--------+--------------------------+---------+---------+-------------------+------+-----------------------------------------------------------+
EXPLAIN without index **
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+------+-------------+
| 1 | SIMPLE | a | index | PRIMARY,fk_article_user1 | PRIMARY | 4 | NULL | 8742 | Using where |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | portal.a.user_id | 1 | |
| 1 | SIMPLE | tl | ref | tag_id,link_id,link_id_2 | link_id | 4 | portal.a.article_id | 3 | Using where |
| 1 | SIMPLE | t | eq_ref | PRIMARY | PRIMARY | 4 | portal.tl.tag_id | 1 | Using where |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+------+-------------+
TABLE CREATE
CREATE TABLE `tag` (
`tag_id` int(11) NOT NULL auto_increment,
`name` varchar(50) NOT NULL,
`type` enum('layout','image') NOT NULL,
`create_dttm` datetime default NULL,
PRIMARY KEY (`tag_id`)
) ENGINE=InnoDB AUTO_INCREMENT=43077 DEFAULT CHARSET=utf8
INDEXS
SHOW INDEX FROM tag_link;
+----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| tag_link | 0 | PRIMARY | 1 | tag_link_id | A | 42023 | NULL | NULL | | BTREE | |
| tag_link | 1 | tag_id | 1 | tag_id | A | 10505 | NULL | NULL | | BTREE | |
| tag_link | 1 | link_id | 1 | link_id | A | 14007 | NULL | NULL | | BTREE | |
+----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
SHOW INDEX FROM article;
+---------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+---------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| article | 0 | PRIMARY | 1 | article_id | A | 5723 | NULL | NULL | | BTREE | |
| article | 1 | fk_article_user1 | 1 | user_id | A | 1 | NULL | NULL | | BTREE | |
| article | 1 | create_dttm | 1 | create_dttm | A | 5723 | NULL | NULL | YES | BTREE | |
+---------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Final Solution
It seems that MySQL is just sorted the data incorrectly. In the end it turned out faster to look at the tag table as a sub query returning the ids.
It seems that article_id is the primary key for the article table.
Since you're grouping by article_id, MySQL needs to return the records in order by that column, in order to perform the GROUP BY.
You can see that without the index, it scans all records in the article table, but they're at least in order by article_id, so no later sort is required. The LIMIT optimization can be applied here, since it's already in order, it can just stop after it gets five rows.
In the query with the index on tag.name, instead of scanning the entire articles table, it utilizes the index, but against the tag table, and starts there. Unfortunately, when doing this, the records must later be sorted by article.article_id in order to complete the GROUP BY clause. The LIMIT optimization can't be applied since it must return the entire result set, then order it, in order to get the first 5 rows.
In this case, MySQL just guesses wrongly.
Without the LIMIT clause, I'm guessing that using the index is faster, which is maybe what MySQL was guessing.
How big are your tables?
I noticed in the first explain you have a "Using temporary; Using filesort" which is bad. Your query is likely being dumped to disc which makes it way slower than in memory queries.
Also try to avoid using "select *" and instead query the minimum fields needed.