MySQL INSERT INTO ... SELECT ... GROUP BY is too slow - mysql

I have a table with about 50M rows and format:
CREATE TABLE `big_table` (
`id` BIGINT NOT NULL,
`t1` DATETIME NOT NULL,
`a` BIGINT NOT NULL,
`type` VARCHAR(10) NOT NULL,
`b` BIGINT NOT NULL,
`is_c` BOOLEAN NOT NULL,
PRIMARY KEY (`id`),
INDEX `a_b_index` (a,b)
) ENGINE=InnoDB;
I then define the table t2, with no indices:
Create table `t2` (
`id` BIGINT NOT NULL,
`a` BIGINT NOT NULL,
`b` BIGINT NOT NULL,
`t1min` DATETIME NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I then populate t2 using a query from big_table (this will add about 12M rows).
insert into opportunities
(id, a,b,t1min)
SELECT id,a,b,min(t1)
FROM big_table use index (a_b_index)
where type='SUBMIT' and is_c=1
GROUP BY a,b;
I find that it takes this query about a minute to process 5000 distinct (a,b) in big_table.
Since there are 12M distinct (a,b) in big_table then it would take about 40 hours to run
the query on all of big_table.
What is going wrong?
If I just do SELECT ... then the query does 5000 lines in about 2s. If I SELECT ... INTO OUTFILE ..., then the query still takes 60s for 5000 lines.
EXPLAIN SELECT ... gives:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,stdnt_intctn_t,index,NULL,a_b_index,16,NULL,46214255,"Using where"

I found that the problem was that the GROUP_BY resulted in too many random-access reads of big_table. The following strategy allows one sequential trip through big_table. First, we add a key to t2:
Create table `t2` (
`id` BIGINT NOT NULL,
`a` BIGINT NOT NULL,
`b` BIGINT NOT NULL,
`t1min` DATETIME NOT NULL,
PRIMARY KEY (a,b),
INDEX `id` (id)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Then we fill t2 using:
insert into t2
(id, a,b,t1min)
SELECT id,a,b,t1
FROM big_table
where type='SUBMIT' and is_c=1
ON DUPLICATE KEY UPDATE
t1min=if(t1<t1min,t1,t1min),
id=if(t1<t1min,big_table.id,t2.id);
The resulting speed-up is several orders of magnitude.

The group by might be part of the issue. You are using an index on (a,b), but your where is not being utilized. I would have an index on
(type, is_c, a, b )
Also, you are getting the "ID", but not specifying which... you probably want to do a MIN(ID) for a consistent result.

Related

Use Indexes For Join on Indexed DATETIME and Indexed DATE columns

EDIT
I misread my initial error and blamed the INDEX not being used on the wrong columns.
I was able to recreate the issue that I saw and the solution that ysth suggested worked.
Below are the create tables statements, inserts to the tables, and two queries - one that has the error and another with the solution which does not have it.
# Make tables and indices
DROP TABLE a;
DROP TABLE b;
create table a
(
DT DATE,
USER INT,
COMMENT_SENTIMENT INT,
PRIMARY KEY (USER, DT));
CREATE INDEX a_DT_USER_IDX ON a (DT,USER);
create table b
(
id int auto_increment primary key,
DT DATETIME(6),
USER mediumtext,
COMMENT_SENTIMENT INT);
CREATE INDEX b_DT_USER_IDX ON b (DT);
CREATE UNIQUE INDEX b_DT_USER ON b (USER(16), DT);
# Insert some dummy data
INSERT INTO a VALUES('2023-01-01', 5, 4);
INSERT INTO b VALUES(NULL, '2023-01-01 00:00:00', 5, 4);
# Explain that shows the issue I was seeing.
EXPLAIN
SELECT *
FROM a
JOIN b
ON a.DT = b.DT
AND a.USER = b.USER;
# Out
# 1,SIMPLE,a,,ALL,"PRIMARY,a_DT_USER_IDX",,,,1,100,
# 1,SIMPLE,b,,ref,"b_DT_USER,b_DT_USER_IDX",b_DT_USER_IDX,9,a.DT,1,100,Using index condition; Using where
[2023-01-24 18:00:14] [HY000][1739] Cannot use ref access on index 'b_DT_USER' due to type or collation conversion on field 'USER'
[2023-01-24 18:00:14] [HY000][1003] /* select#1 */ select `a`.`DT` AS `DT`, `a`.`USER` AS `USER`,`a`.`COMMENT_SENTIMENT` AS `COMMENT_SENTIMENT`,`b`.`id` AS `id`,`b`.`DT` AS `DT`,`b`.`USER` AS `USER`,`b`.`COMMENT_SENTIMENT` AS `COMMENT_SENTIMENT` from `a` join `b` where ((`a`.`DT` = `b`.`DT`) and (`a`.`USER` = `b`.`USER`))
# Explain with the fix ysth suggested
EXPLAIN
SELECT *
FROM a
JOIN b
ON a.DT = b.DT
AND a.USER = CAST(b.USER AS DECIMAL );
# 1,SIMPLE,a,,ALL,"PRIMARY,a_DT_USER_IDX",,,,1,100,
# 1,SIMPLE,b,,ref,b_DT_USER_IDX,b_DT_USER_IDX,9,a.DT,1,100,Using index condition; Using where
# [2023-01-24 18:04:24] [HY000][1003] /* select#1 */ select `a`.`DT` AS `DT`,`a`.`USER` AS `USER`,`a`.`COMMENT_SENTIMENT` AS `COMMENT_SENTIMENT`,`b`.`id` AS `id`,`b`.`DT` AS `DT`,`b`.`USER` AS `USER`,`b`.`COMMENT_SENTIMENT` AS `COMMENT_SENTIMENT` from `a` join `b` where ((`a`.`DT` = `b`.`DT`) and (`a`.`USER` = cast(`b`.`USER` as decimal(10,0))))
# [2023-01-24 18:04:24] 2 rows retrieved starting from 1 in 359 ms (execution: 250 ms, fetching: 109 ms)
__
The below information is incorrect. Please use the edit to see the issue I was having and it's solution.
I have three tables a, b, and c in my MySQL 5.7 database. SHOW CREATE statements for each table are:
CREATE TABLE `a` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`DT` date DEFAULT NULL,
`USER` int(11) DEFAULT NULL,
`COMMENT_SENTIMENT` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `a_DT_USER_IDX` (`DT`,`USER`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `b` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`DT` datetime DEFAULT NULL,
`USER` int(11) DEFAULT NULL,
`COMMENT_SENTIMENT` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `b_DT_USER_IDX` (`DT`,`USER`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `c` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`DT` date DEFAULT NULL,
`USER` int(11) DEFAULT NULL,
`COMMENT_SENTIMENT` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `b_DT_USER_IDX` (`DT`,`USER`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Table a has a DATE column a.DT, table b has a DATETIME column b.DT, and table c has a DATE column c.DT.
All of these DT columns are indexed.
As a caveat, while b.DT is a DATETIME, all of the 'time' portions in it are 00:00:00 and they always will be. It probably should be a DATE, but I cannot change it.
I want to join table a and table b on their DT columns, but explain tells me that their indices are not used:
Cannot use ref access on index 'b.DT_datetime_index' due to type or collation conversion on field 'DT'
When I join table a and b on a.DT and b.DT
SELECT *
FROM a
JOIN b
ON a.DT = b.DT;
The result is much slower than when I do the same with a and c
SELECT *
FROM a
JOIN c
ON a.DT = c.DT;
Is there a way to use the indices in join from the first query on a.DT = b.DT, specifically without altering the tables? I'm not sure if b.DT having only 00:00:00 for the time portion could be relevant in a solution.
The end goal is a faster select using this join.
Thank you!
-- What I've done section --
I compared the joins between a.DT = b.DT and a.DT = c.DT, and saw the time difference.
I also tried wrapping b's DT column with DATE(b.DT), but explain gave the same issue, which is pretty expected.
MySQL won't use an index to join DATE and DATETIME columns.
You can create a virtual column with the corresponding DATE and use that.
CREATE TABLE `b` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`DT` datetime DEFAULT NULL,
`USER` int(11) DEFAULT NULL,
`COMMENT_SENTIMENT` int(11) DEFAULT NULL,
`DT_DATE` DATE AS (DATE(DT)),
PRIMARY KEY (`id`),
KEY `b_DT_USER_IDX` (`DT_DATE`,`USER`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
SELECT *
FROM a
JOIN b
ON a.DT = b.DT_DATE;
Assuming you want to read a and join b rows, you can just do
SELECT *
FROM a
JOIN b
ON b.DT = timestamp(a.DT);
If the other way around, then
SELECT *
FROM b
JOIN a
ON a.DT = date(b.DT);
No need for a virtual column.
Virtually any function call is "not sargable " That is, CAST(b.USER AS DECIMAL ) prevents the use of an index.
Do not mix strings and ints in comparisons. The string will be converted to numeric. If the string is a literal, such as '123' then the Optimizer is smart enough to do that once. If it is a column name, it must check every row.
Tip: If you are likely to test for one user and a range of dates, then this works better than the opposite order.
INDEX(user, dt)`
(You may need an index starting with dt for other queries.)

How to properly index table

We use to index our tables basing on where statement. It works fine during our MSSQL days, but now we are using MySQL and things are differenct. Sub-query has terrible performance. Consider this table :
# 250K records per day
create table t_101(
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`transaction_date` datetime not null,
`memo_1` nvarchar(255) not null,
`memo_2` nvarchar(255) not null,
`product_id` bigint not null,
#many more columns
PRIMARY KEY (`id`),
key `index.t_101.101`(`transaction_date`, `product_id`, `memo_1`),
key `index.t_101.102`(`transaction_date`, `product_id`, `memo_2`)
)ENGINE=MyIsam;
A temporary table where I store condition values :
# 150 records
create temporary table `temporary.user.accessibleProducts`
(
product_id bigint not null,
PRIMARY KEY (`product_id`)
)Engine=MyIsam;
And this is the original query :
select
COUNT(a.id) as rowCount_la1,
COUNT(DISTINCT a.product_id) as productCount
from t_101 a
where a.transaction_date = '2017-05-01'
and a.product_id in(select xa.product_id from `temporary.user.accessibleProducts` xa)
and a.memo_1 <> '';
it takes 7 seconds to be executed, while this query :
select
COUNT(a.id) as rowCount_la1,
COUNT(DISTINCT a.product_id) as productCount
from t_101 a
inner join `temporary.user.accessibleProducts` b on b.product_id = a.product_id
where a.transaction_date = '2017-05-01'
and a.memo_1 <> '';
takes 0.063 seconds to execute.. Even though 0.063 seconds is acceptable, I'm worrying about index. With the given above, how do I index t_101 properly?
We are using MySQL 5.5.42.

How to copy a very large table into another table in MYSQL?

I have a large table with 110M rows. I would like to copy some of the fields into a new table and here is a rough idea of how I am trying to do:
DECLARE l_seenChangesTo DATETIME DEFAULT '1970-01-01 01:01:01';
DECLARE l_migrationStartTime DATETIME;
SELECT NOW() into l_migrationStartTime;
-- See if we've run this migration before and if so, pick up from where we left off...
IF EXISTS(SELECT seenChangesTo FROM migration_status WHERE client_user = CONCAT('this-migration-script-', user())) THEN
SELECT seenChangesTo FROM migration_status WHERE client_user = CONCAT('this-migration-script-', user()) INTO l_seenChangesTo;
SELECT NOW() as LogTime, CONCAT('Picking up from where we left off: ', l_seenChangesTo) as MigrationStatus;
END IF;
INSERT IGNORE INTO newTable
(field1, field2, lastModified)
SELECT o.column1 AS field1,
o.column2 AS field2,
o.lastModified
FROM oldTable o
WHERE
o.lastModified >= l_seenChangesTo AND
o.lastModified <= l_migrationStartTime;
INSERT INTO migration_status (client_user,seenChangesTo)
VALUES (CONCAT('this-migration-script-', user()), l_migrationStartTime)
ON DUPLICATE KEY UPDATE seenChangesTo=l_migrationStartTime;
Context:
CREATE TABLE IF NOT EXISTS `newTable` (
`field1` varchar(255) NOT NULL,
`field2` tinyint unsigned NOT NULL,
`lastModified` datetime NOT NULL,
PRIMARY KEY (`field1`, `field2`),
KEY `ix_field1` (`field1`),
KEY `ix_lastModified` (`lastModified`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE IF NOT EXISTS `oldTable` (
`column1` varchar(255) NOT NULL,
`column2` tinyint unsigned NOT NULL,
`lastModified` datetime NOT NULL,
PRIMARY KEY (`column1`, `column2`),
KEY `ix_column1` (`column1`),
KEY `ix_lastModified` (`lastModified`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE IF NOT EXISTS `migration_status` (
`client_user` char(64) NOT NULL,
`seenChangesTo` char(128) NOT NULL,
PRIMARY KEY (`client_user`)
);
Note: I have a few more columns in oldTable. Both oldTable and newTable are in same DB schema using mysql.
What's the general strategy when copying a very table? Should I perform this migration in an iterative manner by copy say 50,000 rows at time.
The insert speed doing a migration like this iteratively is going to be dreadfully slow. Why not SELECT oldTable INTO OUTFILE, then LOAD DATA INFILE ?

Super slow query with CROSS JOIN

I have two tables named table_1 (1GB) and reference (250Mb).
When I query a cross join on reference it takes 16hours to update table_1 .. We changed the system files EXT3 for XFS but still it's taking 16hrs.. WHAT AM I DOING WRONG??
Here is the update/cross join query :
mysql> UPDATE table_1 CROSS JOIN reference ON
-> (table_1.start >= reference.txStart AND table_1.end <= reference.txEnd)
-> SET table_1.name = reference.name;
Query OK, 17311434 rows affected (16 hours 36 min 48.62 sec)
Rows matched: 17311434 Changed: 17311434 Warnings: 0
Here is a show create table of table_1 and reference:
CREATE TABLE `table_1` (
`strand` char(1) DEFAULT NULL,
`chr` varchar(10) DEFAULT NULL,
`start` int(11) DEFAULT NULL,
`end` int(11) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
`name2` varchar(255) DEFAULT NULL,
KEY `annot` (`start`,`end`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;
CREATE TABLE `reference` (
`bin` smallint(5) unsigned NOT NULL,
`name` varchar(255) NOT NULL,
`chrom` varchar(255) NOT NULL,
`strand` char(1) NOT NULL,
`txStart` int(10) unsigned NOT NULL,
`txEnd` int(10) unsigned NOT NULL,
`cdsStart` int(10) unsigned NOT NULL,
`cdsEnd` int(10) unsigned NOT NULL,
`exonCount` int(10) unsigned NOT NULL,
`exonStarts` longblob NOT NULL,
`exonEnds` longblob NOT NULL,
`score` int(11) DEFAULT NULL,
`name2` varchar(255) NOT NULL,
`cdsStartStat` enum('none','unk','incmpl','cmpl') NOT NULL,
`cdsEndStat` enum('none','unk','incmpl','cmpl') NOT NULL,
`exonFrames` longblob NOT NULL,
KEY `chrom` (`chrom`,`bin`),
KEY `name` (`name`),
KEY `name2` (`name2`),
KEY `annot` (`txStart`,`txEnd`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;
You should index table_1.start, reference.txStart, table_1.end and reference.txEnd table fields:
ALTER TABLE `table_1` ADD INDEX ( `start` ) ;
ALTER TABLE `table_1` ADD INDEX ( `end` ) ;
ALTER TABLE `reference` ADD INDEX ( `txStart` ) ;
ALTER TABLE `reference` ADD INDEX ( `txEnd` ) ;
Cross joins are Cartesian Products, which are probably one of the most computationally expensive things to compute (they don't scale well).
For each table T_i for i = 1 to n, the number of rows generated by crossing tables T_1 to T_n is the size of each table multiplied by the size of each other table, ie
|T_1| * |T_2| * ... * |T_n|
Assuming each table has M rows, the resulting cost of computing the cross join is then
M_1 * M_2 ... M_n = O(M^n)
which is exponential in the number of tables involved in the join.
I see 2 problems with the UPDATE statement.
There is no index for the End fields. The compound indexes (annot) you have will be used only for the start fields in this query. You should add them as suggested by Emre:
ALTER TABLE `table_1` ADD INDEX ( `end` ) ;
ALTER TABLE `reference` ADD INDEX ( `txEnd` ) ;
Second, the JOIN may (and probably does) find many rows of table reference that are related to a row of table_1. So some (or all) rows of table_1 that are updated, are updated many times. Check the result of this query, to see if it is the same as your updated rows count (17311434):
SELECT COUNT(*)
FROM table_1
WHERE EXISTS
( SELECT *
FROM reference
WHERE table_1.start >= reference.txStart
AND table_1.`end` <= reference.txEnd
)
There can be other ways to write this query but the lack of a PRIMARY KEY on both tables makes it harder. If you define a primary key on table_1, try this, replacing id with the primary key.
Update: No, do not try it on a table with 34M rows. Check the execution plan and try with smaller tables first.
UPDATE table_1 AS t1
JOIN
( SELECT t2.id
, r.name
FROM table_1 AS t2
JOIN
( SELECT name, txStart, txEnd
FROM reference
GROUP BY txStart, txEnd
) AS r
ON t2.start >= r.txStart
AND t2.`end` <= r.txEnd
GROUP BY t2.id
) AS good
ON good.id = t1.id
SET t1.name = good.name;
You can check the query plan by running EXPLAIN on the equivalent SELECT:
EXPLAIN
SELECT t1.id, t1.name, good.name
FROM table_1 AS t1
JOIN
( SELECT t2.id
, r.name
FROM table_1 AS t2
JOIN
( SELECT name, txStart, txEnd
FROM reference
GROUP BY txStart, txEnd
) AS r
ON t2.start >= r.txStart
AND t2.`end` <= r.txEnd
GROUP BY t2.id
) AS good
ON good.id = t1.id ;
Try this:
UPDATE table_1 SET
table_1.name = (
select reference.name
from reference
where table_1.start >= reference.txStart
and table_1.end <= reference.txEnd)
Somebody already offered you to add some indexes. But I think the best performance you may get with these two indexes:
ALTER TABLE `test`.`time`
ADD INDEX `reference_start_end` (`txStart` ASC, `txEnd` ASC),
ADD INDEX `table_1_star_end` (`start` ASC, `end` ASC);
Only one of them will be used by MySQL query, but MySQL will decide which is more useful automatically.

Optimizing MySQL query for the Tatoeba project

I've downloaded the Tatoeba project databases and I'm trying to query them but queries with a subquery are taking way too long.
-- 800.000 rows approx.
CREATE TABLE `sentences` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`language` char(3) DEFAULT NULL,
`text` mediumtext,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=912551 DEFAULT CHARSET=utf8
-- 1.5 million rows approx.
CREATE TABLE `links` (
`sentenceId` int(10) unsigned NOT NULL,
`translatedId` int(10) unsigned NOT NULL,
PRIMARY KEY (`sentenceId`,`translatedId`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
Basically the links table joins two sentences in the sentences table together (the original sentence and one translation). A sentence can have zero or more translations. So I have an id of a sentence I want to work with and want to grab ALL the translations available.
This query gets me what I want but takes almost 18 seconds to complete.
SELECT * FROM `sentences` WHERE `id` IN (SELECT `translatedId` FROM `links` WHERE `sentenceId` = 157967);
Running both queries by themselves just takes an instant.
What am I doing wrong?
SELECT `sentences`.* FROM
`sentences` JOIN
`links` ON `id` = `translatedId`
WHERE `sentenceId` = 157967;
Some versions of MySQL are known not to use indexes in sub-queries.
Try this(using an EXISTS clause):
SELECT *
FROM `sentences` a
WHERE EXISTS (SELECT 1 FROM `links` b WHERE `sentenceId` = 157967 AND b.`translatedId`=a.`id`);
If the translatedId is unique in the links you can go for a inner join as given below
SELECT a.*
FROM `sentences` a INNER JOIN `links` b
ON b.`translatedId`=a.`id`