MySQL indexing for INSERT INTO... ON DUPLICATE UPDATE - mysql

To give you a context, we have a huge table in our database, with well over 15 million rows.
We are executing a INSERT INTO... ON DUPLICATE KEY query on this table, which is taking more than 20 mins to complete the insert/update.
Example query -
INSERT INTO table1 (date_time, block_start, block_end, tx_id, tz_id, z_id, interval_span,
interval_id, updated, req, imp, cli)
VALUES ('2018-02-02 15:55:00', '2018-02-02 15:55:00', '2018-02-02 15:59:59', '51530',
'51530', '8005', '5', '1631', '2018-02-02 15:58:50', '1', '0', '0')
ON DUPLICATE KEY
UPDATE req = req + 1, imp = imp + 0, cli = cli + 0
Table structure is as below -
CREATE TABLE `table1` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`date_time` datetime NOT NULL,
`interval_span` int(10) unsigned NOT NULL,
`interval_id` int(10) unsigned NOT NULL,
`block_start` datetime NOT NULL,
`block_end` datetime NOT NULL,
`tx_id` int(10) unsigned NOT NULL,
`tz_id` int(10) unsigned NOT NULL,
`z_id` int(10) unsigned NOT NULL,
`req` int(10) unsigned NOT NULL DEFAULT '0',
`imp` int(10) unsigned NOT NULL DEFAULT '0',
`cli` int(10) unsigned NOT NULL DEFAULT '0',
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `iaz_table1` (`block_start`,`tx_id`,`z_id`),
KEY `tx_id` (`tx_id`,`date_time`),
KEY `z_id` (`z_id`,`date_time`),
KEY `date_time` (`date_time`),
KEY `block_start` (`block_start`)
) ENGINE=InnoDB AUTO_INCREMENT=257679784 DEFAULT CHARSET=utf8
How can I improve the speed of this insert? I need to achieve execution time of less than 5 seconds.

Sounds like you do not have a PRIMARY KEY or UNIQUE KEY included in the list of columns: (date_time, block_start, block_end, tx_id, tz_id, z_id, interval_span, interval_id, updated, req, imp, cli).

A table definition would be helpful. It looks like all fields are strings, but it seems like many of them could be integers. Integer comparisons are much faster than varchar (integers take up way less space). See this post:
SQL SELECT speed int vs varchar

Related

Keeping last seen values in a summary table

I have two tables:
parameters keeps all the para_ids and their names and is always updated to have all parameters in it.
CREATE TABLE `parameters` (
`para_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`para_id`),
UNIQUE KEY `idx_parameters_name` (`name`)
) ENGINE=InnoDB;
processing is holding a chunk of data every 5 minutes.
CREATE TABLE `processing` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`t_ns` bigint(20) unsigned NOT NULL DEFAULT '0',
`para_id` int(10) unsigned NOT NULL DEFAULT '0',
`value` varchar(1024) NOT NULL DEFAULT '',
`isanchor` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `data` (`para_id`,`t_ns`)
) ENGINE=InnoDB;
I want to keep a table actual_values with the last seen values that each parameter (if it occurred in processing) had. The para_ids are updated with an INSERT IGNORE before the update. Currently I have those queries:
INSERT IGNORE INTO actual_values (para_id) (SELECT DISTINCT para_id FROM parameters);
UPDATE actual_values a
JOIN processing p ON a.para_id = p.para_id
SET a.value = (SELECT p.value FROM processing p WHERE a.para_id = p.para_id ORDER BY t_ns DESC LIMIT 1);
I feel like this is not the optimal way to go, it takes quite long. Do you guys have better suggestions?

How to optimise this slow MySQL query - late row lookups?

I'm converting a site over to use XenForo as forum software, however this site has millions of thread rows in the MySQL table. If I try to browse a paginated listing of threads, it slows to a crawl the further I go. Once I'm at page 10,000 it takes almost 30s.
My aim is to improve the query below, perhaps by using late row lookups so that I can make this query run faster:
SELECT thread.*
,
user.*, IF(user.username IS NULL, thread.username, user.username) AS username,
NULL AS thread_read_date,
0 AS thread_is_watched,
0 AS user_post_count
FROM xf_thread AS thread
LEFT JOIN xf_user AS user ON
(user.user_id = thread.user_id)
WHERE (thread.node_id = 152) AND (thread.sticky = 0) AND (thread.discussion_state IN ('visible'))
ORDER BY thread.last_post_date DESC
LIMIT 20 OFFSET 238340
Run Time: 4.383607
Select Type Table Type Possible Keys Key Key Len Ref Rows Extra
SIMPLE thread ref node_id_last_post_date,node_id_sticky_state_last_post node_id_last_post_date 4 const 552480 Using where
SIMPLE user eq_ref PRIMARY PRIMARY 4 sitename.thread.user_id 1
Schema:
CREATE TABLE `xf_thread` (
`thread_id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`node_id` INT(10) UNSIGNED NOT NULL,
`title` VARCHAR(150) NOT NULL,
`reply_count` INT(10) UNSIGNED NOT NULL DEFAULT '0',
`view_count` INT(10) UNSIGNED NOT NULL DEFAULT '0',
`user_id` INT(10) UNSIGNED NOT NULL,
`username` VARCHAR(50) NOT NULL,
`post_date` INT(10) UNSIGNED NOT NULL,
`sticky` TINYINT(3) UNSIGNED NOT NULL DEFAULT '0',
`discussion_state` ENUM('visible','moderated','deleted') NOT NULL DEFAULT 'visible',
`discussion_open` TINYINT(3) UNSIGNED NOT NULL DEFAULT '1',
`discussion_type` VARCHAR(25) NOT NULL DEFAULT '',
`first_post_id` INT(10) UNSIGNED NOT NULL,
`first_post_likes` INT(10) UNSIGNED NOT NULL DEFAULT '0',
`last_post_date` INT(10) UNSIGNED NOT NULL,
`last_post_id` INT(10) UNSIGNED NOT NULL,
`last_post_user_id` INT(10) UNSIGNED NOT NULL,
`last_post_username` VARCHAR(50) NOT NULL,
`prefix_id` INT(10) UNSIGNED NOT NULL DEFAULT '0',
`sonnb_xengallery_import` TINYINT(3) DEFAULT '0',
PRIMARY KEY (`thread_id`),
KEY `node_id_last_post_date` (`node_id`,`last_post_date`),
KEY `node_id_sticky_state_last_post` (`node_id`,`sticky`,`discussion_state`,`last_post_date`),
KEY `last_post_date` (`last_post_date`),
KEY `post_date` (`post_date`),
KEY `user_id` (`user_id`)
) ENGINE=INNODB AUTO_INCREMENT=2977 DEFAULT CHARSET=utf8
Can anyone help me improve the speed of this query? I'm a real MySQL novice, but I am running the same dataset on other forum software and it is much faster - so I'm sure there is a way somehow. This table is INNODB and I'd consider the server well optimised.
This might help: http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/
The concept being, query just the index column with your required paging/ordering, then join this list to the other columns you want from the table
Your User table is already index by user ID... good.
For your thread table, I would have a compound index on it with the key
( note_id, sticky, discussion_state, last_post_date )
This way, the index is optimized on all parts in the WHERE clause... AND since it has the last_post_date too, that can be utilized by the ORDER BY clause. Order By clauses are notorious for killing query performance.

proper index (or removal) to optimize a large data set table

We have a 'visitor' tracking schema going on - that when pushed, seems to be causing some strain on the DB server.
VISITORS table identifies unique users by a HASH (current records 310,000). A search is performed on the hash, and if not found, it is added. The ID is needed for the following two tables
CREATE TABLE visitors (
id int(10) UNSIGNED NOT NULL auto_increment,
ip varchar(25) NOT NULL,
hash varchar(64) NOT NULL,
first_visit varchar(32) NOT NULL,
created_at datetime NOT NULL default '0000-00-00 00:00:00',
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE visitors ADD UNIQUE INDEX (hash);
ALTER TABLE visitors ADD INDEX (created_at);
VISITOR_VISITS table identifies when a user visited only when we can identify some referral sources (current count 142,000). A search is performed looking for the visitor_id, type and visit_date. If there is nothing found - it is added. The ID is used in the following table.
CREATE TABLE visitor_visits (
id int(10) UNSIGNED NOT NULL auto_increment,
visitor_id int(10) UNSIGNED NOT NULL,
source varchar(64) NULL DEFAULT NULL DEFAULT NULL,
medium varchar(64) NULL DEFAULT NULL,
campaign varchar(256) NULL DEFAULT NULL,
page varchar(32) NULL DEFAULT NULL,
landing varchar(32) NULL DEFAULT NULL,
type enum('fundraiser_view') NULL DEFAULT NULL,
visit_date date NOT NULL default '0000-00-00',
created_at datetime NOT NULL default '0000-00-00 00:00:00',
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE visitor_visits ADD UNIQUE INDEX (visitor_id,type,visit_date);
ALTER TABLE visitor_visits ADD CONSTRAINT FK_visits_visitor_id FOREIGN KEY (visitor_id) REFERENCES visitors(id);
PAGE_VIEWS logs individual page views (not all pages, just pages we are tracking). It can be linked to a visitor and can reference a visitor_visit (current count 2.4million -- reason it is higher is we started micro-visitor logging after logging individual pages). An insert/on duplicate query is used to add the record to this based on the view_date for the identified user. Since the ID is not needed, a pure lookup query isnt required
CREATE TABLE page_views (
id int(10) UNSIGNED NOT NULL auto_increment,
page_id int(10) UNSIGNED NOT NULL,
current_donations decimal(10,2) NOT NULL DEFAULT 0,
ip varchar(25) NOT NULL,
hash varchar(32) NOT NULL,
visitor_id int(10) UNSIGNED NULL DEFAULT NULL AFTER,
visitor_visit_id int(10) UNSIGNED NULL DEFAULT NULL AFTER,
page_views int(10) UNSIGNED NOT NULL DEFAULT 0,
widget_views int(10) UNSIGNED NOT NULL DEFAULT 0,
view_date date NOT NULL,
viewed_at datetime NOT NULL default '0000-00-00 00:00:00',
created_at datetime NOT NULL default '0000-00-00 00:00:00',
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE page_views ADD UNIQUE INDEX (page_id,view_date,visitor_id,hash);
ALTER TABLE page_views ADD INDEX (visitor_id);
ALTER TABLE page_views ADD INDEX (visitor_visit_id);
ALTER TABLE page_views ADD CONSTRAINT FK_page_views_page_id FOREIGN KEY (page_id) REFERENCES pages(id);
ALTER TABLE page_views ADD CONSTRAINT FK_page_views_visitor_id FOREIGN KEY (visitor_id) REFERENCES visitors(id);
ALTER TABLE page_views ADD CONSTRAINT FK_page_views_visit_id FOREIGN KEY (visitor_visit_id) REFERENCES visitor_visits(id);
Last week, our site got a inflow of people due to a news article, and this visitor identifying rall bottlenecked performance. I am wondering if there is an obvious optimization in there. Could it be the foreign key constraints ? Over indexing? Need for better indexing?
Try this ::
1) Index on varchar doesn't much improve the performance.
2) Try to partition the table, on a date range.
You didn't tell us what is bottlenecking your database, so I just guess it's InnoDB concurrent writes. If it isn't so and the problem is only with SELECTs (which I doubt), you should show us the exact queries. You could try to reduce the write performance hit by creating a staging table and then bulk-moving stuff from in to the main table:
CREATE TABLE page_views_tmp (
id int(10) UNSIGNED NOT NULL auto_increment,
page_id int(10) UNSIGNED NOT NULL,
current_donations decimal(10,2) NOT NULL DEFAULT 0,
ip varchar(25) NOT NULL,
hash varchar(32) NOT NULL,
visitor_id int(10) UNSIGNED NULL DEFAULT NULL AFTER,
visitor_visit_id int(10) UNSIGNED NULL DEFAULT NULL AFTER,
page_views int(10) UNSIGNED NOT NULL DEFAULT 0,
widget_views int(10) UNSIGNED NOT NULL DEFAULT 0,
view_date date NOT NULL,
viewed_at datetime NOT NULL default '0000-00-00 00:00:00',
created_at datetime NOT NULL default '0000-00-00 00:00:00',
PRIMARY KEY (id)
) ENGINE=MEMORY DEFAULT CHARSET=utf8;
And then, once per a couple of seconds or after this table has a considerable amount of rows in it:
START TRANSACTION;
INSERT INTO page_views SELECT * FROM page_views_tmp;
DELETE FROM page_views_tmp;
COMMIT;

mysql join not use index for 'between' operator

So basically I have three tables:
CREATE TABLE `cdIPAddressToLocation` (
`IPADDR_FROM` int(10) unsigned NOT NULL COMMENT 'Low end of the IP Address block',
`IPADDR_TO` int(10) unsigned NOT NULL COMMENT 'High end of the IP Address block',
`IPLOCID` int(10) unsigned NOT NULL COMMENT 'The Location ID for the IP Address range',
PRIMARY KEY (`IPADDR_TO`),
KEY `Index_2` USING BTREE (`IPLOCID`),
KEY `Index_3` USING BTREE (`IPADDR_FROM`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `cdIPLocation` (
`IPLOCID` int(10) unsigned NOT NULL default '0',
`Country` varchar(4) default NULL,
`Region` int(10) unsigned default NULL,
`City` varchar(90) default NULL,
`PostalCode` varchar(10) default NULL,
`Latitude` float NOT NULL,
`Longitude` float NOT NULL,
`MetroCode` varchar(4) default NULL,
`AreaCode` varchar(4) default NULL,
`State` varchar(45) default NULL,
`Continent` varchar(10) default NULL,
PRIMARY KEY (`IPLOCID`)
) ENGINE=MyISAM AUTO_INCREMENT=218611 DEFAULT CHARSET=latin1;
and
CREATE TABLE 'data'{
'IP' varchar(50)
'SCORE' int
}
My task is to join these three tables and find the location data for given IP address.
My query is as follows:
select
t.ip,
l.Country,
l.State,
l.City,
l.PostalCode,
l.Latitude,
l.Longitude,
t.score
from
(select
ip, inet_aton(ip) ipv, score
from
data
order by score desc
limit 5) t
join
cdIPAddressToLocation a ON t.ipv between a.IPADDR_FROM and a.IPADDR_TO
join
cdIPLocation l ON l.IPLOCID = a.IPLOCID
While this query works, it's very very slow, it took about 100 seconds to return the result on my dev box.
I'm using mysql 5.1, the cdIPAddressToLocation has 5.9 million rows and cdIPLocation table has about 0.3 million rows.
When I check the execution plan, I found it's not using any index in the table 'cdIPAddressToLocation', so for each row in the 'data' table it would do a full table scan against table 'cdIPAddressToLocation'.
It is very weird to me. I mean since there are already two indexes in table 'cdIPAddressToLocation' on columns 'IPADDR_FROM' and 'IPADDR_TO', the execution plan should exploit the index to improve the performance, but why it didn't use them.
Or was there something wrong with my query?
Please help, thanks a lot.
Have you tried using a composite index on the columns cdIPAddressToLocation.IPADDR_FROM and cdIPAddressToLocation.IPADDR_TO?
Multiple-Column Indexes

SQL: Refactoring a multi-join query

I have a query that should be quite simple and yet it causes me a lot of headaches.
I have a simple ads system that requires filtering ads according to a few variables.
I need to limit the number of views/clicks per day and the total number of views/clicks for a given ad. Also each ad is linked to one or more slots in which the ad can appear. I have a table that saves the statistics that I need about each ad. Note that the statistics table changes very frequently.
These are the tables that I'm using:
CREATE TABLE `t_ads` (
`id` int(10) unsigned NOT NULL auto_increment,
`name` varchar(255) NOT NULL,
`content` text NOT NULL,
`is_active` tinyint(1) unsigned NOT NULL,
`start_date` date NOT NULL,
`end_date` date NOT NULL,
`max_views` int(10) unsigned NOT NULL,
`type` tinyint(3) unsigned NOT NULL default '0',
`refresh` smallint(5) unsigned NOT NULL default '0',
`max_clicks` int(10) unsigned NOT NULL,
`max_daily_clicks` int(10) unsigned NOT NULL,
`max_daily_views` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `t_ad_slots` (
`id` int(10) unsigned NOT NULL auto_increment ,
`name` varchar(255) NOT NULL,
`width` int(10) unsigned NOT NULL,
`height` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `t_ads_to_slots` (
`ad_id` int(10) unsigned NOT NULL,
`slot_id` int(10) unsigned NOT NULL,
`value` int(10) unsigned NOT NULL,
PRIMARY KEY (`ad_id`,`slot_id`),
KEY `slot_id` (`slot_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `t_ads_to_slots`
ADD CONSTRAINT `t_ads_to_slots_ibfk_1` FOREIGN KEY (`ad_id`) REFERENCES `t_ads` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION,
ADD CONSTRAINT `t_ads_to_slots_ibfk_2` FOREIGN KEY (`slot_id`) REFERENCES `t_ad_slots` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION;
CREATE TABLE `t_ad_stats` (
`ad_id` int(10) unsigned NOT NULL,
`slot_id` int(10) unsigned NOT NULL,
`date` date NOT NULL COMMENT,
`views` int(10) unsigned NOT NULL,
`unique_views` int(10) unsigned NOT NULL,
`clicks` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`ad_id`,`slot_id`,`date`),
KEY `slot_id` (`slot_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `t_ad_stats`
ADD CONSTRAINT `t_ad_stats_ibfk_1` FOREIGN KEY (`ad_id`) REFERENCES `t_ads` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION,
ADD CONSTRAINT `t_ad_stats_ibfk_2` FOREIGN KEY (`slot_id`) REFERENCES `t_ad_slots` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION;
This is the query that I use to get ads for a given slot (Note that in this example I hard coded 20 as the slot id and 0,1,2 as the ad type, I get this data from a php script which invokes this query)
SELECT `ads`.`content`, `slots`.`value`, `ads`.`id`, `ads`.`refresh`, `ads`.`type`,
SUM(`total_stats`.`views`) AS "total_views",
SUM(`total_stats`.`clicks`) AS "total_clicks"
FROM (`t_ads` AS `ads`,
`t_ads_to_slots` AS `slots`)
LEFT JOIN `t_ad_stats` AS `total_stats`
ON `total_stats`.`ad_id` = `ads`.`id`
LEFT JOIN `t_ad_stats` AS `daily_stats`
ON (`daily_stats`.`ad_id` = `ads`.`id`) AND
(`daily_stats`.`date` = CURDATE())
WHERE (`ads`.`id` = `slots`.`ad_id`) AND
(`ads`.`type` IN(0,1,2)) AND
(`slots`.`slot_id` = 20) AND
(`ads`.`is_active` = 1) AND
(`ads`.`end_date` >= NOW()) AND
(`ads`.`start_date` <= NOW()) AND
((`ads`.`max_views` = 0) OR
(`ads`.`max_views` > "total_views")) AND
((`ads`.`max_clicks` = 0) OR
(`ads`.`max_clicks` > "total_clicks")) AND
((`ads`.`max_daily_clicks` = 0) OR
(`ads`.`max_daily_clicks` > IFNULL(`daily_stats`.`clicks`,0))) AND
((`ads`.`max_daily_views` = 0) OR
(`ads`.`max_daily_views` > IFNULL(`daily_stats`.`views`,0)))
GROUP BY (`ads`.`id`)
I believe that this query is self explanatory, even though its quite long. Note that the MySQL version that I'm using is: 5.0.51a-community. It seems to me like the big issue here is the double join to the stats table (I did that so that I will be able to get the data from a specific record and from multiple records (sum)).
How would you implement this query in order to get better results? (Note that I can't change from InnoDB).
Hopefully everything is clear about my question, but if that is not the case, please ask and I will clarify.
Thanks in advance,
Kfir
Add indexes to following columns:
t_ads.is_active
t_ads.start_date
t_ads.end_date
Change the order of the primary key on t_ad_stats to:
(`ad_id`,`date`,`slot_id`)
or add a covering index to t_ad_stats
('ad_id', 'date')
Change from 0 meaning "no limit" to 2147483647 meaning no limit, so you can change things like:
((`ads`.`max_views` = 0) OR (`ads`.`max_views` > "total_views"))
to
(`ads`.`max_views` > "total_views")
You could greatly improve this is if you were keeping running totals instead of having to calculate them each time.
Expanding on a comment above I believe that the following columns should be indexed:
ads.id
ads.type
ads.start_date
ads.end_date
daily_stats.date
As well as these:
slots.slot_id
ads.is_active
And these as well:
ads.max_views
ads.max_clicks
ads.max_daily_clicks
ads.max_daily_views
daily_stats.clicks
daily_stats.views
Do note that applying indexes on these columns will speed up your SELECTs but slow down your INSERTs since the indexes will need updating as well. But, you don't have to apply all of this all at once. You can do it incrementally and see how the performance shakes out for selects as well as inserts. If you cannot find a good middleground then I would suggest denormalization.