I need help on optimizing the following query
select
DATE_FORMAT( traffic.stat_date, '%Y/%m'),
pt.promotion,
sum(traffic.voice_nat_onnet_mins - pt.promo_minutes_onnet) as total_onnet_mins,
sum(traffic.voice_nat_offnet_mins + traffic.voice_nat_landline_mins + traffic.voice_int_mins + traffic.voice_nng_mins + traffic.voice_not_rec_mins - pt.promo_minutes_offnet) as total_offnet_mins,
sum(traffic.sms_ptp_onnet_evts) as total_onnet_sms,
sum(traffic.sms_ptp_offnet_evts + traffic.sms_vas_pta_evts) as total_offnet_sms,
sum(traffic.dati_kb) as internet_kb
from
stats_novercanet.mnp_prod_stat_outgoing_traffic traffic
INNER JOIN stats_novercanet.mnp_prod_stat_promotion_traffic pt
ON pt.id_source_user=traffic.id_source_user
INNER JOIN stats_novercanet.mnp_prod_stat_customer_first_signup fs
ON pt.id_source_user = fs.id_source_user
where
traffic.stat_date between '2013-11-01' and '2013-11-30'
and traffic.stat_date >= (
select min(ft.stat_date)
from stats_novercanet.mnp_prod_stat_promotion_traffic ft
where
traffic.id_source_user=ft.id_source_user
and (ft.sub_rev>0 or ft.ren_rev>0)
and pt.promotion=ft.promotion
)
and pt.stat_date between '2013-11-01' and '2013-11-30'
group by
DATE_FORMAT( traffic.stat_date, '%Y/%m'),
pt.promotion
order by
DATE_FORMAT( traffic.stat_date, '%Y/%m'),
pt.promotion **
I have used explain for this query and it showed me following result
+----+--------------------+---------+-------+------------------------------------------------+---------------------------------+---------+-----------------------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+-------+------------------------------------------------+---------------------------------+---------+-----------------------------------------+--------+----------------------------------------------+
| 1 | PRIMARY | pt | range | idx_prod_stat_pro_tra_stat_date,id_source_user | idx_prod_stat_pro_tra_stat_date | 4 | NULL | 530114 | Using where; Using temporary; Using filesort |
| 1 | PRIMARY | fs | ref | id_source_user | id_source_user | 5 | stats_novercanet.pt.id_source_user | 1 | Using where; Using index |
| 1 | PRIMARY | traffic | ref | stat_date,id_source_user | id_source_user | 5 | stats_novercanet.pt.id_source_user | 60 | Using where |
| 2 | DEPENDENT SUBQUERY | ft | ref | id_source_user,promotion | id_source_user | 5 | stats_novercanet.traffic.id_source_user | 93 | Using where |
+----+--------------------+---------+-------+------------------------------------------------+---------------------------------+---------+-----------------------------------------+--------+----------------------------------------------+
Any help on optimization will be great. I have created index on id_source_user, stat_date and promotion as well but no luck. Also tried with subquery in join but no luck.
Result is as follow for mnp_prod_stat_promotion_traffic.**
| mnp_prod_stat_promotion_traffic | CREATE TABLE `mnp_prod_stat_promotion_traffic` (
`stat_date` date DEFAULT NULL,
`id_source_user` int(64) DEFAULT NULL,
`promotion` varchar(64) DEFAULT NULL,
`num_of_sub` int(64) DEFAULT NULL,
`num_of_ren` int(64) DEFAULT NULL,
`credit` float DEFAULT NULL,
`minutes` float DEFAULT NULL,
`kb` float DEFAULT NULL,
`sms` int(64) DEFAULT NULL,
`lbs` int(64) DEFAULT NULL,
`sub_rev` float DEFAULT NULL,
`ren_rev` float DEFAULT NULL,
`consumed_credit` float DEFAULT NULL,
`sim_type` varchar(32) DEFAULT NULL,
`price_plan` varchar(64) DEFAULT NULL,
`WiFi_mins` float DEFAULT NULL,
`over_min` float DEFAULT NULL,
`over_min_consumed` float DEFAULT NULL,
`over_sms` float DEFAULT NULL,
`over_sms_consumed` float DEFAULT NULL,
`over_data` float DEFAULT NULL,
`over_data_consumed` float DEFAULT NULL,
`promo_minutes_onnet` float DEFAULT NULL,
`promo_minutes_offnet` float DEFAULT NULL,
`promo_sms_onnet` int(64) DEFAULT NULL,
`promo_sms_offnet` int(64) DEFAULT NULL,
KEY `idx_prod_stat_pro_tra_stat_date` (`stat_date`),
KEY `id_source_user` (`id_source_user`),
KEY `promotion` (`promotion`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
How many results are you expecting to get returned. If you know for example that you only want one record returned then you can use LIMITS
When a query is ran it will search the whole table for that record, but if you know the is only one, two or three results returned then you can LIMIT. This will save MySQL a lot of time, but again it will depend on the number of results you are excepting, and you will have to apply it to the tables you are running it on.
Also, another tip is to check what type of table types you are using. Have a look at this webpage for more information: http://www.mysqltutorial.org/understand-mysql-table-types-innodb-myisam.aspx
Another option to do is to build a script to use your existing query above and store the result in a new table, and only run the script once a month via cron at like midnight. I did this for an analytical project, and it worked well.
Related
I am trying to execute a simple select query using a table indexed on src_ip like so:
SELECT * FROM netflow_nov2 WHERE src_IP=3111950672;
However this is not completed after even 4 or 5 hours. I need the response to be in the range of a few seconds. I am wondering how I can optimize it so this is the case.
Also note that source ip’s were converted to integers using the built in SQL command.
Other information about the table:
The table contains netflow data parsed from nfdump. I am using the table to get information about specific IP addresses. In other words, basically only queries like the above will be used.
Here is the relevant info as given by SHOW TABLE STATUS for this table:
Rows: 4,205,602,143 (4 billion)
Data Length: 426,564,911,104 (426 GB)
Index Length: 57,283,706,880 (57 GB)
Information about the system:
Hard disk: ~2TB, using close to maximum
RAM: 64GB
my.cnf file:
see gist: https://gist.github.com/ashtonwebster/e0af038101e1b42ca7e3
Table structure:
mysql> DESCRIBE netflow_nov2;
+-----------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+------------------+------+-----+---------+-------+
| date | datetime | YES | MUL | NULL | |
| duration | float | YES | | NULL | |
| protocol | varchar(16) | YES | | NULL | |
| src_IP | int(10) unsigned | YES | MUL | NULL | |
| src_port | int(2) | YES | | NULL | |
| dest_IP | int(10) unsigned | YES | MUL | NULL | |
| dest_port | int(2) | YES | | NULL | |
| flags | varchar(8) | YES | | NULL | |
| Tos | int(4) | YES | | NULL | |
| packets | int(8) | YES | | NULL | |
| bytes | int(8) | YES | | NULL | |
| pps | int(8) | YES | | NULL | |
| bps | int(8) | YES | | NULL | |
| Bpp | int(8) | YES | | NULL | |
| Flows | int(8) | YES | | NULL | |
+-----------+------------------+------+-----+---------+-------+
15 rows in set (0.02 sec)
I have additional info about the indexes and the results of explain, but briefly:
-The indexes are b-trees, and there are indexes for date, src_ip, and dest_ip, but only src_ip will really be used
-Based on the output of EXPLAIN, the src_ip index is being used for that particular query mentioned at the top
And the output of mysqltuner:
see gist: https://gist.github.com/ashtonwebster/cbfd98ee1799a7f6b323
SHOW CREATE TABLE output:
| netflow_nov2 | CREATE TABLE `netflow_nov2` (
`date` datetime DEFAULT NULL,
`duration` float DEFAULT NULL,
`protocol` varchar(16) DEFAULT NULL,
`src_IP` int(10) unsigned DEFAULT NULL,
`src_port` int(2) DEFAULT NULL,
`dest_IP` int(10) unsigned DEFAULT NULL,
`dest_port` int(2) DEFAULT NULL,
`flags` varchar(8) DEFAULT NULL,
`Tos` int(4) DEFAULT NULL,
`packets` int(8) DEFAULT NULL,
`bytes` int(8) DEFAULT NULL,
`pps` int(8) DEFAULT NULL,
`bps` int(8) DEFAULT NULL,
`Bpp` int(8) DEFAULT NULL,
`Flows` int(8) DEFAULT NULL,
KEY `src_IP` (`src_IP`),
KEY `dest_IP` (`dest_IP`),
KEY `date` (`date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
Thanks in advance
Your current table structure is optimized for random writes: records are placed on disk in the order of writes.
Unfortunately the only read pattern that is well supported by such a structure is a full-table scan.
Usage of non-covering secondary indices still results in a lot of random disk seeks which are killing performance.
The best reading performance is obtained when data is read in the same order as it is located on disk, which for InnoDB means in the primary key order.
A materialized view (another InnoDB table that has an appropriate primary key) could be a possible solution. In this case a primary key that starts with src_IP is required.
upd: The idea is to achieve data locality and avoid random disk IO, aiming for sequential reading. This means that your materialized view will look like this:
CREATE TABLE `netflow_nov2_view` (
`row_id` bigint not null, -- see below
`date` datetime DEFAULT NULL,
`duration` float DEFAULT NULL,
`protocol` varchar(16) DEFAULT NULL,
`src_IP` int(10) unsigned DEFAULT NULL,
`src_port` int(2) DEFAULT NULL,
`dest_IP` int(10) unsigned DEFAULT NULL,
`dest_port` int(2) DEFAULT NULL,
`flags` varchar(8) DEFAULT NULL,
`Tos` int(4) DEFAULT NULL,
`packets` int(8) DEFAULT NULL,
`bytes` int(8) DEFAULT NULL,
`pps` int(8) DEFAULT NULL,
`bps` int(8) DEFAULT NULL,
`Bpp` int(8) DEFAULT NULL,
`Flows` int(8) DEFAULT NULL,
PRIMARY KEY (`src_IP`, `row_id`) -- you won't need other keys
) ENGINE=InnoDB DEFAULT CHARSET=latin1
where row_id has to be maintained by your materializing logic, since you don't have it in the original table (or you can introduce an explicit auto_increment field to your original table, it's how InnoDB handles it anyway).
The crucial difference is that now all data on the disk is placed in the primary key order, which means that once you locate the first record with a given 'src_IP' all other records can be obtained as sequentially as possible.
Depending on the way your data is written and adjacent application logic it can be accomplished either via triggers or by some custom external process.
If it is possible to sacrifice current write performance (or use some async queue as a buffer) then probably having a single table optimized for reading would suffice.
More on InnoDB indexing:
http://dev.mysql.com/doc/refman/5.6/en/innodb-index-types.html
I would think that reading the table without an index would take less than 5 hours. But you do have a big table. There are two "environmental" possibilities that would kill the performance:
The table is locked by another process.
The result set is huge (tens of millions of rows) and the network latency/processing time for returning the result set is causing the problem.
My first guess, though, is that the query is not using the index. I missed this at first, but you have one multi-part index. The only index this query can take advantage of is one where the first key is src_IP. So, if you index is either netflow_nov2(src_IP, date, dest_ip) or netflow_nov2(src_IP, dest_ip, date), then you are ok. If either of the other columns is first, then this index will not be used. You can easily see what is happening by putting explain in front the query to see if the index is being used.
If this is a problem, create an index with src_IP as the first (or only) key in the index.
I can't find a way to fasten simple queries in a huge table.
I don't think i'm asking something crazy to MySQL, even with the amount of datas… and i can't understand why these following queries have so much different execution time !
I tried my best to read all articles about big datas in mysql, fields optimization, and already achieved to reduce query time with field types… but really, i'm getting lost now with this kind of simple queries !
Here is an example on MySQL 5.1.69 :
SELECT rv.`id_prd`,SUM(`quantite`)
FROM `report_ventes` AS rv
WHERE `periode` BETWEEN 201301 AND 201312
GROUP BY rv.`id_prd`
Execution time : 3.76 sec
Let's add a LEFT JOIN and another selected field :
SELECT rv.`id_prd`,SUM(`quantite`),`acl_cip_7`
FROM `report_ventes` AS rv
LEFT JOIN `report_produits` AS rp
ON (rv.`id_prd` = rp.`id_prd`)
WHERE `periode` BETWEEN 201301 AND 201312
GROUP BY rv.`id_prd`
Execution time : 12.10 sec
Explain :
+----+-------------+-------+--------+---------------+---------+---------+--------------------------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------------------------+----------+----------------------------------------------+
| 1 | SIMPLE | rv | ALL | periode | NULL | NULL | NULL | 16556188 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | rp | eq_ref | PRIMARY | PRIMARY | 4 | main_reporting.rv.id_prd | 1 | Using index |
+----+-------------+-------+--------+---------------+---------+---------+--------------------------+----------+----------------------------------------------+
And let's another where clause :
SELECT rv.`id_prd`,SUM(`quantite`),`acl_cip_7`
FROM `report_ventes` AS rv
LEFT JOIN `report_produits` AS rp
ON (rv.`id_prd` = rp.`id_prd`)
WHERE rp.`id_clas_prd` LIKE '1%'
AND `periode` BETWEEN 201301 AND 201312
GROUP BY rv.`id_prd`
Execution time : 21.00 sec
Explain :
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+----------+----------------------------------------------+
| 1 | SIMPLE | rv | ALL | periode | NULL | NULL | NULL | 16556188 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | rp | eq_ref | PRIMARY,id_clas_prd | PRIMARY | 4 | main_reporting.rv.id_prd | 1 | Using where |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+----------+----------------------------------------------+
And here are the tables parameters :
report_produits : 80 000 rows
CREATE TABLE `report_produits` (
`id_prd` int(11) unsigned NOT NULL,
`acl_cip_7` int(7) NOT NULL,
`acl_cip_ean_13` varchar(255) DEFAULT NULL,
`lib_prd` varchar(255) DEFAULT NULL,
`id_clas_prd` char(7) NOT NULL DEFAULT '',
`id_lab_prd` int(11) unsigned NOT NULL,
`id_rbt_prd` int(11) unsigned NOT NULL,
`id_tva_prd` int(11) unsigned NOT NULL,
`t_gen` varchar(255) NOT NULL,
`id_grp_gen` varchar(16) NOT NULL DEFAULT '',
`id_liste_delivrance` int(11) unsigned NOT NULL,
PRIMARY KEY (`id_prd`),
KEY `index_lab` (`id_lab_prd`),
KEY `index_grp` (`id_grp_gen`),
KEY `id_clas_prd` (`id_clas_prd`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
report_ventes : 16 556 188 rows
CREATE TABLE `report_ventes` (
`id` int(13) NOT NULL AUTO_INCREMENT,
`periode` mediumint(6) DEFAULT NULL,
`id_phie` smallint(4) unsigned NOT NULL,
`id_prd` mediumint(8) unsigned NOT NULL,
`quantite` smallint(11) DEFAULT NULL,
`ca_ht` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `periode` (`periode`)
) ENGINE=MyISAM AUTO_INCREMENT=18491315 DEFAULT CHARSET=utf8;
There is no covering index and MySQL decides that scanning the whole table is more effective than to use an index and lookup for the requested values.
You are joining to the report_ventes on the id_prd, but that column is not the part of the clustering index (PK in MySQL). This means, the server should lookup for all the values. The server bypasses the periode index possibly because it is not enough selective to use it.
An index could help which includes the id_prd, periode and quantite columns. With this index, there is a chance that the MySQL server will use it since it is a covering index for this query.
Give it a try, but its hard to tell the real truth without testing it on the actual environment.
Basically your indexes is not being used, i can't spot the precise reason without trying it on a sql server, but a common cause is the data has different types.
AND periode BETWEEN 201301 AND 201312
"periode" has datatype mediumint(6) and the litteral "201301" possible has datatype int(10)
LEFT JOIN `report_produits` AS rp ON (rv.`id_prd` = rp.`id_prd`)
Here are the 2 datatypes also different.
i have one MySQL issue. I have to optimize some queries on my website. One of them i have already done, but there are still some which i cannot resolve without your help.
I have a table called "news":
CREATE TABLE IF NOT EXISTS `news` (
`id` int(10) NOT NULL auto_increment,
`edited` smallint(1) NOT NULL default '0',
`site` varchar(30) default NULL,
`foreign_id` varchar(25) default NULL,
`title` varchar(255) NOT NULL,
`text` text NOT NULL,
`image` varchar(255) default NULL,
`horizontal` smallint(1) NOT NULL,
`image_author` varchar(255) default NULL,
`text_author` varchar(255) default NULL,
`lang` varchar(3) NOT NULL,
`link` varchar(255) NOT NULL,
`date` date NOT NULL,
`redirect` smallint(1) NOT NULL,
`parent` int(10) NOT NULL,
`views` int(5) NOT NULL,
`status` smallint(1) NOT NULL,
PRIMARY KEY (`id`),
KEY `lang` (`lang`,`status`),
KEY `date` (`date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=47122 ;
as you can see i have two indexes: "lang" and "date"
I have tried some combinations of different indexes and this one has produced me the best results ... unfortunately only on my local computer. On the server i still have bad results. I want to say that the database is the same.
query:
SELECT id FROM news WHERE lang = 'en' AND STATUS =1 ORDER BY DATE DESC LIMIT 0, 10
localhost explain:
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | news | index | lang | date | 3 | NULL | 23 | Using where |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
server explain:
+----+-------------+-------+------+---------------+--------+---------+-------------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+--------+---------+-------------+-------+-----------------------------+
| 1 | SIMPLE | news | ref | status | status | 13 | const,const | 15840 | Using where; Using filesort |
+----+-------------+-------+------+---------------+--------+---------+-------------+-------+-----------------------------+
I have looked a lot of other similar topics, but unfortunately i cannot find any solution to work on my server. I will be very glad to here from you some solution with some explanation for that so i can optimize my other queries.
Thanks !
This is your query:
SELECT id
FROM news
WHERE lang = 'en' AND STATUS =1
ORDER BY DATE DESC
LIMIT 0, 10
The best index is one that contains all the fields used in the query (four fields in all). The ordering in the index is by equality conditions in the where clause followed by the order by clause followed by other columns in the select clause.
So, try this index: ndws(leng, status, date, id).
Here is the query:
select timespans.id as timespan_id, count(*) as num
from reports, timespans
where timespans.after_date >= '2011-04-13 22:08:38' and
timespans.after_date <= reports.authored_at and
reports.authored_at < timespans.before_date
group by timespans.id;
Here are the table defs:
CREATE TABLE `reports` (
`id` int(11) NOT NULL auto_increment,
`source_id` int(11) default NULL,
`url` varchar(255) default NULL,
`lat` decimal(20,15) default NULL,
`lng` decimal(20,15) default NULL,
`content` text,
`notes` text,
`authored_at` datetime default NULL,
`created_at` datetime default NULL,
`updated_at` datetime default NULL,
`data` text,
`title` varchar(255) default NULL,
`author_id` int(11) default NULL,
`orig_id` varchar(255) default NULL,
PRIMARY KEY (`id`),
KEY `index_reports_on_title` (`title`),
KEY `index_content_on_reports` (`content`(128))
CREATE TABLE `timespans` (
`id` int(11) NOT NULL auto_increment,
`after_date` datetime default NULL,
`before_date` datetime default NULL,
`after_offset` int(11) default NULL,
`before_offset` int(11) default NULL,
`is_common` tinyint(1) default NULL,
`created_at` datetime default NULL,
`updated_at` datetime default NULL,
`is_search_chunk` tinyint(1) default NULL,
`is_day` tinyint(1) default NULL,
PRIMARY KEY (`id`),
KEY `index_timespans_on_after_date` (`after_date`),
KEY `index_timespans_on_before_date` (`before_date`)
And here is the explain:
+----+-------------+-----------+-------+--------------------------------------------------------------+-------------------------------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+--------------------------------------------------------------+-------------------------------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | timespans | range | index_timespans_on_after_date,index_timespans_on_before_date | index_timespans_on_after_date | 9 | NULL | 84 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | reports | ALL | NULL | NULL | NULL | NULL | 183297 | Using where |
+----+-------------+-----------+-------+--------------------------------------------------------------+-------------------------------+---------+------+--------+----------------------------------------------+
And here is the explain after I create an index on authored_at. As you can see, the index is not actually getting used (I think...)
+----+-------------+-----------+-------+--------------------------------------------------------------+-------------------------------+---------+------+--------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+--------------------------------------------------------------+-------------------------------+---------+------+--------+------------------------------------------------+
| 1 | SIMPLE | timespans | range | index_timespans_on_after_date,index_timespans_on_before_date | index_timespans_on_after_date | 9 | NULL | 86 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | reports | ALL | index_reports_on_authored_at | NULL | NULL | NULL | 183317 | Range checked for each record (index map: 0x8) |
+----+-------------+-----------+-------+--------------------------------------------------------------+-------------------------------+---------+------+--------+------------------------------------------------+
There are about 142k rows in the reports table, and far fewer in the timespans table.
The query is taking about 3 seconds now.
The strange thing is that if I add an index on reports.authored_at, it actually makes the query far slower, about 20 seconds. I would have thought it would do the opposite, since it would make it easy to find the reports at either end of the range, and throw the rest away, rather than having to examine all of them.
Can someone clarify? I'm stumped.
Instead of two separate indexes for the timespan table, try merging them into a single multi-column index with before_date and after_date in a single index. Then add that index to authored_at as well.
i rewrite you query like this:
select t.id, count(*) as num from timespans t
join reports r where t.after_date >= '2011-04-13 22:08:38'
and r.authored_at >= '2011-04-13 22:08:38'
and r.authored_at < t.before_date
group by t.id order by null;
and change indexes of tables
alter table reports add index authored_at_idx(authored_at);
You can used partition feature of database on column after_date. It will help u a lot.
I currently have 2 tables that are used for a select query with a simple join. The first table houses around 6-9 million rows, and this gets used as the join. The primary table is anywhere from 1mil to 300mil rows. However, I notice when I join above 10mil rows on the primary table the select query goes from instant to very slow (3+ seconds and grows).
Here is my table structure and queries.
CREATE TABLE IF NOT EXISTS `links` (
`link_id` int(10) unsigned NOT NULL,
`domain_id` mediumint(7) unsigned NOT NULL,
`parent_id` int(11) unsigned DEFAULT NULL,
`hash` int(10) unsigned NOT NULL,
`url` text NOT NULL,
`type` enum('html','pdf') DEFAULT NULL,
`processed` enum('N','Y') NOT NULL DEFAULT 'N',
UNIQUE KEY `hash` (`hash`),
KEY `idx_processed` (`processed`),
KEY `domain_id` (`domain_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT;
CREATE TABLE IF NOT EXISTS `domains` (
`domain_id` mediumint(7) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(170) NOT NULL,
`blocked` enum('N','Y') NOT NULL DEFAULT 'N',
`count` mediumint(6) NOT NULL DEFAULT '0',
`mcount` mediumint(3) NOT NULL,
PRIMARY KEY (`domain_id`),
KEY `name` (`name`),
KEY `blocked` (`blocked`),
KEY `mcount` (`mcount`),
KEY `count` (`count`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=10834389 ;
Query:
(SELECT link_id, url, hash FROM links, domains WHERE links.domain_id = domains.domain_id and mcount > 1 and processed='N' limit 200)
UNION
(SELECT link_id, url, hash FROM links where processed='N' and type='html' limit 200)
Explain select:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+------------+-------+-------------------------+--------------- +---------+---------------------------+---------+-------------+
| 1 | PRIMARY | domains | range | PRIMARY,mcount | mcount | 3 | NULL | 257673 | Using where |
| 1 | PRIMARY | links | ref | idx_processed,domain_id | domain_id | 3 | crawler.domains.domain_id | 1 | Using where |
| 2 | UNION | links | ref | idx_processed | idx_processed | 1 | const | 7090017 | Using where |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+------------+-------+-------------------------+---------------+---------+---------------------------+---------+-------------+
Right now, I'm trying a partition with 20 partitions on links using domain_id as the key.
Any other options would be greatly appreciated.
A single SELECT statement would replace your entire UNION statement:
SELECT link_id, url, hash
FROM links, domains
WHERE links.domain_id = domains.domain_id
AND mcount > 1
AND processed='N'
AND type='html'
This may not be THE answer you are looking for, but it should help you simplify your question.
When things suddenly slow down you might want to check the size of your indexes (used in the query execution) vs size of various mysql buffers.