Mysql performance with nested indices - mysql

I have a mysql table (articles) with a nested index (blog_id, published), and performs poorly. I see a lot of these in my slow query logs:
- Query_time: 23.184007 Lock_time: 0.000063 Rows_sent: 380 Rows_examined: 6341
SELECT id from articles WHERE category_id = 11 AND blog_id IN (13,14,15,16,17,18,19,20,21,22,23,24,26,27,6330,6331,8269,12218,18889) order by published DESC LIMIT 380;
I have trouble understanding why mysql would run through all rows with those blog_ids to figure out my top 380 rows. I would expect the whole purpose of the nested index is to speed that up. To the very least, even a naive implementation, should look-up by blog_id and get it's top 380 rows ordered by published. That should be fast, since, we can figure out the exact 200 rows, due to the nested index. And then sort the resulting 19*200=3800 rows.
If one were to implement it in the most optimal way, you would put a heap from the set of all blog-id based streams and pick the one with the max(published) and repeat it 200 times. Each operation should be fast.
I'm surely missing something since Google, Facebook, Twitter, Microsoft and all the big companies are using mysql for production purposes. Any one with experience?
Edit: Updating as per, thieger's answer. I tried index hinting, and it doesn't seem to help. Results are attached below, at the end. Mysql order by optimisation claims to address the concern theiger is raising:
I agree that MySQL might possibly use
the composite blog_id-published-index,
but only for the blog_id part of the
query.
SELECT * FROM t1 WHERE
key_part1=constant ORDER BY
key_part2;
Atleast mysql seems to claim it can be used beyond just the WHERE clause (blog_id part of the query). Any help theiger?
Thanks,
-Prasanna
[myprasanna at gmail dot com]
CREATE TABLE IF NOT EXISTS `articles` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`category_id` int(11) DEFAULT NULL,
`blog_id` int(11) DEFAULT NULL,
`cluster_id` int(11) DEFAULT NULL,
`title` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`description` text COLLATE utf8_unicode_ci,
`keywords` text COLLATE utf8_unicode_ci,
`image_url` varchar(511) COLLATE utf8_unicode_ci DEFAULT NULL,
`url` varchar(511) COLLATE utf8_unicode_ci DEFAULT NULL,
`url_hash` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`author` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`categories` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`published` int(11) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`is_image_crawled` tinyint(1) DEFAULT NULL,
`image_candidates` text COLLATE utf8_unicode_ci,
`title_hash` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`article_readability_crawled` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_articles_on_url_hash` (`url_hash`),
KEY `index_articles_on_cluster_id` (`cluster_id`),
KEY `index_articles_on_published` (`published`),
KEY `index_articles_on_is_image_crawled` (`is_image_crawled`),
KEY `index_articles_on_category_id` (`category_id`),
KEY `index_articles_on_title_hash` (`title_hash`),
KEY `index_articles_on_article_readability_crawled` (`article_readability_crawled`),
KEY `index_articles_on_blog_id` (`blog_id`,`published`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=562907 ;
SELECT id from articles USE INDEX(index_articles_on_blog_id) WHERE category_id = 11 AND blog_id IN (13,14,15,16,17,18,19,20,21,22,23,24,26,27,6330,6331,8269,12218,18889) order by published DESC LIMIT 380;
....
380 rows in set (11.27 sec)
explain SELECT id from articles USE INDEX(index_articles_on_blog_id) WHERE category_id = 11 AND blog_id IN (13,14,15,16,17,18,19,20,21,22,23,24,26,27,6330,6331,8269,12218,18889) order by published DESC LIMIT 380\G;
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: articles
type: range
possible_keys: index_articles_on_blog_id
key: index_articles_on_blog_id
key_len: 5
ref: NULL
rows: 8640
Extra: Using where; Using filesort
1 row in set (0.00 sec)

Did you try EXPLAIN to see whether your index is used at all? Did you ANALYZE to update the index statistics?
I agree that MySQL might possibly use the composite blog_id-published-index, but only for the blog_id part of the query. If the index is not used after ANALYZE, you can try giving MySQL a hint with USE INDEX or even FORCE INDEX, but the MySQL optimizer may also correctly assume that a sequential scan is faster than using the index. For your kind of query, I would also propose to add an index on category_id and blog_id and try to use that.

Aside from thieger's excellent answer, you might also want to check:
if an index on (category_id,blog_id,published) is any use.
if there is enough room to keep all indexes in memory (innodb buffer pool usage & flushes for instance, mysqlreport is a very handy tool in that respect)

MySQL has a cutoff mechanism where if it detects that it will probably have to look at more than about a third of the table anyway, it won't use the index. Since it appears your query will match just over 6000 rows of an 8000-odd row table, that is definitely what is happening.
In addition, MySQL can't usually use an index twice on the same table, nor can it use more than one. In this case, it won't use the index for the ORDER BY clause because it has different columns specified than in the WHERE clause.

Related

MySQL query with GROUP BY on a full text index is very slow

I'm building an online tool for collecting feedback.
Right now I'm building a visual summary of all answers per question with answer occurence next to it. I use this query:
SELECT
feedback_answer,
feedback_qtype,
COUNT(feedback_answer) as occurence
FROM acc_data_1005
WHERE (feedback_qtype=5 or feedback_qtype=4 or feedback_qtype=12 or feedback_qtype=13 or feedback_qtype=1 or feedback_qtype=2)
and survey_id=205283
GROUP BY feedback_answer ORDER BY feedback_qtype DESC, COUNT(feedback_answer) DESC
DB table:
CREATE TABLE `acc_data_1005` (
`id` int UNSIGNED NOT NULL,
`survey_id` int UNSIGNED NOT NULL,
`feedback_id` int UNSIGNED NOT NULL,
`date_registered` date NOT NULL,
`feedback_qid` int UNSIGNED NOT NULL,
`feedback_question` varchar(140) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`feedback_qtype` tinyint UNSIGNED NOT NULL COMMENT 'nps, text, input etc',
`data_type` tinyint UNSIGNED NOT NULL COMMENT '0 till 10 are sensitive data options (first name, last name, email etc.)',
`feedback_answer` varchar(1500) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`additional_data` varchar(500) COLLATE utf8mb4_unicode_ci NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC;
ALTER TABLE `acc_data_1005`
ADD PRIMARY KEY (`id`),
ADD KEY `date_registered` (`date_registered`),
ADD KEY `feedback_qid` (`feedback_qid`,`feedback_question`) USING BTREE,
ADD KEY `feedback_id` (`feedback_id`),
ADD KEY `survey_id` (`survey_id`);
ALTER TABLE `acc_data_1005` ADD FULLTEXT KEY `feedback_answer` (`feedback_answer`);
ALTER TABLE `acc_data_1005`
MODIFY `id` int UNSIGNED NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=2020001;
COMMIT;
The table has around 2 million rows and for this test, they all have the same survey_id.
Profling says executing takes up 96% of time, explain result:
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
acc_data_1005
NULL
ref
survey_id,feedback_answer
survey_id
4
const
998375
46.86
Using where; Using temporary; Using filesort
This query takes around 22-30 seconds for just 11 rows.
If I remove the survey_id (which is important), the query takes around 2-4 seconds (still way too much).
I've been at it for hours but can't find why this query is so slow.
If it helps I can dump the rows in a SQL file (around 400-600MB).
The group by is slow because of scanning 2 million rows on a fulltext index (feedback_answer) long character items.
I created another table "analytic_stats" and create a cron job that runs this query every month (for only the data of that month) and store that in the stats table.
When the customer want's to get the data of a full year (2 million+ rows, which is too slow) I just get the data of a few rows from the stats table and run the group by query only for the current month. This would just have to group around 10.000-20.000 rows instead of 2 million which is instant.
Maybe not the most efficent way, but it works for me ;)
Hope it might help someone with a similar problem.

Concurrent queries on composite index with order by id drastically slow

I have a table defined as follows:
| book | CREATE TABLE `book` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`provider_id` int(10) unsigned DEFAULT '0',
`source_id` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,
`title` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`description` longtext COLLATE utf8_unicode_ci,
PRIMARY KEY (`id`),
UNIQUE KEY `provider` (`provider_id`,`source_id`),
KEY `idx_source_id` (`source_id`),
) ENGINE=InnoDB AUTO_INCREMENT=1605425 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
when there are about 10 concurrent read with following sql:
SELECT * FROM `book` WHERE (provider_id = '1' AND source_id = '1037122800') ORDER BY `book`.`id` ASC LIMIT 1
it becomes slow, it takes about 100 ms.
however if I changed it to
SELECT * FROM `book` WHERE (provider_id = '1' AND source_id = '221630001') LIMIT 1
then it is normal, it takes several ms.
I don't understand why adding order by id makes query much slower? could anyone expain?
Try to add desired columns (Select Column Name,.. ) instead of * or Refer this.
Why is my SQL Server ORDER BY slow despite the ordered column being indexed?
I'm not a mysql expert, and not able to perform a detailed analysis, but my guess would be that because you are providing values for the UNIQUE KEY in the WHERE clause, the engine can go and fetch that row directly using an index.
However, when you ask it to ORDER BY the id column, which is a PRIMARY KEY, that changes the access path. The engine now guesses that since it has an index on id, and you want to order by id, it is better to fetch that data in PK order, which will avoid a sort. In this case though, it leads to a slower result, as it has to compare every row to the criteria (a table scan).
Note that this is just conjecture. You would need to EXPLAIN both statements to see what is going on.

MYSQL - How to add index for group by / order by / sum / with where

I am processing a mysql table with 40K rows. Current execution time is around 2 seconds with the table indexed.could some one guide me how to optimized this query and table better? and how to getrid of "Using where; Using temporary; Using filesort" ??. Any help is appreciated.
The goup by with be for the following cases...
LS_CHG_DTE_OCR
LS_CHG_DTE_OCR/RES_STATE_HSE
LS_CHG_DTE_OCR/RES_STATE_HSE/RES_CITY_HSE
LS_CHG_DTE_OCR/RES_STATE_HSE/RES_CITY_HSE/POSTAL_CDE_HSE
Thanks in advance
SELECT DATE_FORMAT(`LS_CHG_DTE_OCR`, '%Y-%b') AS fmt_date,
SUM(IF(`TYPE`='Connect',COUNT_SUBS,0)) AS connects,
SUM(IF(`TYPE`='Disconnect',COUNT_SUBS,0)) AS disconnects,
SUM(IF(`TYPE`='Connect',ROUND(REV,2),0)) AS REV,
SUM(IF(`TYPE`='Upgrade',COUNT_SUBS,0)) AS upgrades,
SUM(IF(`TYPE`='Downgrade',COUNT_SUBS,0)) AS downgrades,
SUM(IF(`TYPE`='Upgrade',ROUND(REV,2),0)) AS upgradeRev FROM `hsd`
WHERE LS_CHG_DTE_OCR!='' GROUP BY MONTH(LS_CHG_DTE_OCR) ORDER BY LS_CHG_DTE_OCR ASC
CREATE TABLE `hsd` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`SYS_OCR` varchar(255) DEFAULT NULL,
`PRIN_OCR` varchar(255) DEFAULT NULL,
`SERV_CDE_OHI` varchar(255) DEFAULT NULL,
`DSC_CDE_OHI` varchar(255) DEFAULT NULL,
`LS_CHG_DTE_OCR` datetime DEFAULT NULL,
`SALESREP_OCR` varchar(255) DEFAULT NULL,
`CHANNEL` varchar(255) DEFAULT NULL,
`CUST_TYPE` varchar(255) DEFAULT NULL,
`LINE_BUS` varchar(255) DEFAULT NULL,
`ADDR1_HSE` varchar(255) DEFAULT NULL,
`RES_CITY_HSE` varchar(255) DEFAULT NULL,
`RES_STATE_HSE` varchar(255) DEFAULT NULL,
`POSTAL_CDE_HSE` varchar(255) DEFAULT NULL,
`ZIP` varchar(100) DEFAULT NULL,
`COUNT_SUBS` double DEFAULT NULL,
`REV` double DEFAULT NULL,
`TYPE` varchar(255) DEFAULT NULL,
`lat` varchar(100) DEFAULT NULL,
`long` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx` (`LS_CHG_DTE_OCR`,`CHANNEL`,`CUST_TYPE`,`LINE_BUS`,`RES_CITY_HSE`,`RES_STATE_HSE`,`POSTAL_CDE_HSE`,`ZIP`,`COUNT_SUBS`,`TYPE`)
) ENGINE=InnoDB AUTO_INCREMENT=402342 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC
Using where; Using temporary; Using filesort[enter image description here][1]
The only condition you apply is LS_CHG_DTE_OCR != "". Other than that you are doing a full table scan because of the aggregations. Index wise you can't do much here.
I ran into the same problem. I had fully optimized my queries (I had joins and more conditions) but the table kept growing and with it query time. Finally I decided to mirror the data to ElasticSearch. In my case it cut down query time to about 1/20th to 1/100th (for different queries).
The only possible index for that SELECT is INDEX(LS_CHG_DTE_OCR). But it is unlikely for it to be used.
Perform the WHERE -- If there are a lot of '' values, then the index may be used for filtering.
GROUP BY MONTH(...) -- You might be folding the same month from multiple years. The Optimizer can't tell, so it will punt on using the index.
ORDER BY LS_CHG_DTE_OCR -- This is done after the GROUP BY; the ORDER BY can't be performed until the data is gathered -- too late for any index. However, if multiple years are folded together, you could get some strange results. Cure it by making the ORDER BY be the same as the GROUP BY. This will also prevent an extra sort that is caused by the GROUP BY and ORDER BY being different.
Yeah, if that idx you added has all the columns in the SELECT, then it is a "covering index". But it won't help any because of the comments above. "Using index" won't help a lot.
GROUP BY LS_CHG_DTE_OCR/RES_STATE_HSE -- Eh? Divide a DATETIME by a VARCHAR? That sounds like a disaster.
This table will grow even bigger over time, correct? Consider building and maintaining Summary Table(s) with month as part of the PRIMARY KEY.

Mysql not use index over huge table

I have next table:
CREATE TABLE `test` (
`fingerprint` varchar(80) COLLATE utf8_unicode_ci NOT NULL,
`country` varchar(5) COLLATE utf8_unicode_ci NOT NULL,
`loader` int(10) unsigned NOT NULL,
`date` date NOT NULL,
`installer` int(10) unsigned DEFAULT NULL,
`browser` varchar(5) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`version` varchar(5) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`os` varchar(10) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`language` varchar(10) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
PRIMARY KEY (`fingerprint`, `date`),
KEY `date_1` (`date`),
KEY `date_2` (`date`,`loader`,`installer`,`country`,`browser`,`os`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Right now it contains 10M records and will increase per 2M records / day.
My question, why MySQL use "Using Where" on next query:
explain select count(*) from test where date between '2013-08-01' and '2013-08-10'
1 SIMPLE test range date_1,date_2 date_1 3 1601644 Using where; Using index
Update, why next question have type - All and Using where then:
explain select * from test use key(date_1) where date between '2013-08-01' and '2013-08-10'
1 SIMPLE test ALL date_1 null null null 3648813 Using where
It does use the index.
It says so right there: Using where; Using index. The "using where" doesn't mean full scan, it means it's using the WHERE condition you provided.
The 1601644 number also hints at that: it means it expect to read roughly 1.6M records, not the whole 10M in the table, and it correlates with your ~2M/day estimate.
In short, it seems to be doing well, it's just a lot of data you will retrieve.
Still, it's reading the table data too, when it seems the index should be enough. Try changing the count(*) with count(date), so date is the only field mentioned in the whole query. If you get only Using index, then it could be faster.
Your query is not just "Using where", it is actually "Using where; Using index". This means the index is used to match your WHERE condition and the index is being used to perform lookups of key values. This is the best case scenario, because in fact the table has never been scanned, the query could be processed with the index only.
Here you can find a full description of the meaning of the output you are looking at.
Your second query only shows the "Using where" notice. This means the index is only used to filter rows. The data must be read from the table (no "Using index" notice), because the index does not contain all the row data (you selected all columns, but the chosen index only covers date). If you had a covering index (that covers all columns), this index would probably be used instead.

Any way to avoid a filesort when order by is different to where clause?

I have an incredibly simple query (table type InnoDb) and EXPLAIN says that MySQL must do an extra pass to find out how to retrieve the rows in sorted order.
SELECT * FROM `comments`
WHERE (commentable_id = 1976)
ORDER BY created_at desc LIMIT 0, 5
exact explain output:
table select_type type extra possible_keys key key length ref rows
comments simple ref using where; using filesort common_lookups common_lookups 5 const 89
commentable_id is indexed. Comments has nothing trick in it, just a content field.
The manual suggests that if the order by is different to the where, there is no way filesort can be avoided.
http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
I also tried order by id as well as it's equivalent but makes no difference, even if I add id as an index (which I understand is not required as id is indexed implicitly in MySQL).
thanks in advance for any ideas!
To Mark -- here's SHOW CREATE TABLE
CREATE TABLE `comments` (
`id` int(11) NOT NULL auto_increment,
`user_id` int(11) default NULL,
`commentable_type` varchar(255) default NULL,
`commentable_id` int(11) default NULL,
`content` text,
`created_at` datetime default NULL,
`updated_at` datetime default NULL,
`hidden` tinyint(1) default '0',
`public` tinyint(1) default '1',
`access_point` int(11) default '0',
`item_id` int(11) default NULL,
PRIMARY KEY (`id`),
KEY `created_at` (`created_at`),
KEY `common_lookups` (`commentable_id`,`commentable_type`,`hidden`,`created_at`,`public`),
KEY `index_comments_on_item_id` (`item_id`),
KEY `index_comments_on_item_id_and_created_at` (`item_id`,`created_at`),
KEY `index_comments_on_user_id` (`user_id`),
KEY `id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=31803 DEFAULT CHARSET=latin1
Note that the MySQL term filesort doesn't necessarily mean it writes to the disk. It just means it's going to sort without using an index. If the result set is small enough, MySQL will sort it in memory, which is orders of magnitude faster than disk I/O.
You can increase the amount of memory MySQL allocates for in-memory filesorts using the sort_buffer_size server variable. In MySQL 5.1, the default sort buffer size is 2MB, and the maximum you can allocate is 4GB.
update: Regarding Jonathan Leffler's comment about measuring how long the sorting takes, you can learn how to use SHOW PROFILE FOR QUERY which will give you the breakdown of how long each phase of query execution takes.
Try adding a combined index on (commentable_id, created_at).