Deleting Duplicates in MySQL

Deleting Duplicates in MySQL - mysql

Query was this:
CREATE TABLE `query` (
`id` int(11) NOT NULL auto_increment,
`searchquery` varchar(255) NOT NULL default '',
`datetime` int(11) NOT NULL default '0',
PRIMARY KEY (`id`)
) ENGINE=MyISAM
first I want to drop the table with:
ALTER TABLE `querynew` DROP `id`
and then delete the double entries..
I tried it with:
INSERT INTO `querynew` SELECT DISTINCT * FROM `query`
but with no success.. :(
and with ALTER TABLE query ADD UNIQUE ( searchquery ) - is it possible to save the queries only one time?

I would use MySQL's multi-table delete syntax:
DELETE q2 FROM query q1 JOIN query q2 USING (searchquery, datetime)
WHERE q1.id < q2.id;

I would do this using an index with the MySQL-specific IGNORE keyword. This kills two birds with one stone: it deletes duplicate rows, and adds a unique index so that you will not get any more of them. It is usually faster than the other methods as well:
alter ignore table query add unique index(searchquery, datetime);

You should be able to do it without first removing the column:
DELETE FROM `query`
WHERE `id` IN (
SELECT `id`
FROM `query` q
WHERE EXISTS ( -- Any matching rows with a lower id?
SELECT *
FROM `query`
WHERE `searchquery` = q.`searchquery`
AND `datetime` = q.`datetime`
AND `id` < q.`id`
)
);
You could also go via a temp table:
SELECT MIN(`id`), `searchquery`, `datetime`
INTO `temp_query`
GROUP BY `searchquery`, `datetime`;
DELETE FROM `query`;
INSERT INTO `query` SELECT * FROM `temp_query`;

Related

Use Indexes For Join on Indexed DATETIME and Indexed DATE columns

EDIT
I misread my initial error and blamed the INDEX not being used on the wrong columns.
I was able to recreate the issue that I saw and the solution that ysth suggested worked.
Below are the create tables statements, inserts to the tables, and two queries - one that has the error and another with the solution which does not have it.
# Make tables and indices
DROP TABLE a;
DROP TABLE b;
create table a
(
DT DATE,
USER INT,
COMMENT_SENTIMENT INT,
PRIMARY KEY (USER, DT));
CREATE INDEX a_DT_USER_IDX ON a (DT,USER);
create table b
(
id int auto_increment primary key,
DT DATETIME(6),
USER mediumtext,
COMMENT_SENTIMENT INT);
CREATE INDEX b_DT_USER_IDX ON b (DT);
CREATE UNIQUE INDEX b_DT_USER ON b (USER(16), DT);
# Insert some dummy data
INSERT INTO a VALUES('2023-01-01', 5, 4);
INSERT INTO b VALUES(NULL, '2023-01-01 00:00:00', 5, 4);
# Explain that shows the issue I was seeing.
EXPLAIN
SELECT *
FROM a
JOIN b
ON a.DT = b.DT
AND a.USER = b.USER;
# Out
# 1,SIMPLE,a,,ALL,"PRIMARY,a_DT_USER_IDX",,,,1,100,
# 1,SIMPLE,b,,ref,"b_DT_USER,b_DT_USER_IDX",b_DT_USER_IDX,9,a.DT,1,100,Using index condition; Using where
[2023-01-24 18:00:14] [HY000][1739] Cannot use ref access on index 'b_DT_USER' due to type or collation conversion on field 'USER'
[2023-01-24 18:00:14] [HY000][1003] /* select#1 */ select `a`.`DT` AS `DT`, `a`.`USER` AS `USER`,`a`.`COMMENT_SENTIMENT` AS `COMMENT_SENTIMENT`,`b`.`id` AS `id`,`b`.`DT` AS `DT`,`b`.`USER` AS `USER`,`b`.`COMMENT_SENTIMENT` AS `COMMENT_SENTIMENT` from `a` join `b` where ((`a`.`DT` = `b`.`DT`) and (`a`.`USER` = `b`.`USER`))
# Explain with the fix ysth suggested
EXPLAIN
SELECT *
FROM a
JOIN b
ON a.DT = b.DT
AND a.USER = CAST(b.USER AS DECIMAL );
# 1,SIMPLE,a,,ALL,"PRIMARY,a_DT_USER_IDX",,,,1,100,
# 1,SIMPLE,b,,ref,b_DT_USER_IDX,b_DT_USER_IDX,9,a.DT,1,100,Using index condition; Using where
# [2023-01-24 18:04:24] [HY000][1003] /* select#1 */ select `a`.`DT` AS `DT`,`a`.`USER` AS `USER`,`a`.`COMMENT_SENTIMENT` AS `COMMENT_SENTIMENT`,`b`.`id` AS `id`,`b`.`DT` AS `DT`,`b`.`USER` AS `USER`,`b`.`COMMENT_SENTIMENT` AS `COMMENT_SENTIMENT` from `a` join `b` where ((`a`.`DT` = `b`.`DT`) and (`a`.`USER` = cast(`b`.`USER` as decimal(10,0))))
# [2023-01-24 18:04:24] 2 rows retrieved starting from 1 in 359 ms (execution: 250 ms, fetching: 109 ms)
__
The below information is incorrect. Please use the edit to see the issue I was having and it's solution.
I have three tables a, b, and c in my MySQL 5.7 database. SHOW CREATE statements for each table are:
CREATE TABLE `a` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`DT` date DEFAULT NULL,
`USER` int(11) DEFAULT NULL,
`COMMENT_SENTIMENT` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `a_DT_USER_IDX` (`DT`,`USER`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `b` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`DT` datetime DEFAULT NULL,
`USER` int(11) DEFAULT NULL,
`COMMENT_SENTIMENT` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `b_DT_USER_IDX` (`DT`,`USER`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `c` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`DT` date DEFAULT NULL,
`USER` int(11) DEFAULT NULL,
`COMMENT_SENTIMENT` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `b_DT_USER_IDX` (`DT`,`USER`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Table a has a DATE column a.DT, table b has a DATETIME column b.DT, and table c has a DATE column c.DT.
All of these DT columns are indexed.
As a caveat, while b.DT is a DATETIME, all of the 'time' portions in it are 00:00:00 and they always will be. It probably should be a DATE, but I cannot change it.
I want to join table a and table b on their DT columns, but explain tells me that their indices are not used:
Cannot use ref access on index 'b.DT_datetime_index' due to type or collation conversion on field 'DT'
When I join table a and b on a.DT and b.DT
SELECT *
FROM a
JOIN b
ON a.DT = b.DT;
The result is much slower than when I do the same with a and c
SELECT *
FROM a
JOIN c
ON a.DT = c.DT;
Is there a way to use the indices in join from the first query on a.DT = b.DT, specifically without altering the tables? I'm not sure if b.DT having only 00:00:00 for the time portion could be relevant in a solution.
The end goal is a faster select using this join.
Thank you!
-- What I've done section --
I compared the joins between a.DT = b.DT and a.DT = c.DT, and saw the time difference.
I also tried wrapping b's DT column with DATE(b.DT), but explain gave the same issue, which is pretty expected.

MySQL won't use an index to join DATE and DATETIME columns.
You can create a virtual column with the corresponding DATE and use that.
CREATE TABLE `b` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`DT` datetime DEFAULT NULL,
`USER` int(11) DEFAULT NULL,
`COMMENT_SENTIMENT` int(11) DEFAULT NULL,
`DT_DATE` DATE AS (DATE(DT)),
PRIMARY KEY (`id`),
KEY `b_DT_USER_IDX` (`DT_DATE`,`USER`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
SELECT *
FROM a
JOIN b
ON a.DT = b.DT_DATE;

Assuming you want to read a and join b rows, you can just do
SELECT *
FROM a
JOIN b
ON b.DT = timestamp(a.DT);
If the other way around, then
SELECT *
FROM b
JOIN a
ON a.DT = date(b.DT);
No need for a virtual column.

Virtually any function call is "not sargable " That is, CAST(b.USER AS DECIMAL ) prevents the use of an index.
Do not mix strings and ints in comparisons. The string will be converted to numeric. If the string is a literal, such as '123' then the Optimizer is smart enough to do that once. If it is a column name, it must check every row.
Tip: If you are likely to test for one user and a range of dates, then this works better than the opposite order.
INDEX(user, dt)`
(You may need an index starting with dt for other queries.)

MySQL INSERT INTO ... SELECT ... GROUP BY is too slow

I have a table with about 50M rows and format:
CREATE TABLE `big_table` (
`id` BIGINT NOT NULL,
`t1` DATETIME NOT NULL,
`a` BIGINT NOT NULL,
`type` VARCHAR(10) NOT NULL,
`b` BIGINT NOT NULL,
`is_c` BOOLEAN NOT NULL,
PRIMARY KEY (`id`),
INDEX `a_b_index` (a,b)
) ENGINE=InnoDB;
I then define the table t2, with no indices:
Create table `t2` (
`id` BIGINT NOT NULL,
`a` BIGINT NOT NULL,
`b` BIGINT NOT NULL,
`t1min` DATETIME NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I then populate t2 using a query from big_table (this will add about 12M rows).
insert into opportunities
(id, a,b,t1min)
SELECT id,a,b,min(t1)
FROM big_table use index (a_b_index)
where type='SUBMIT' and is_c=1
GROUP BY a,b;
I find that it takes this query about a minute to process 5000 distinct (a,b) in big_table.
Since there are 12M distinct (a,b) in big_table then it would take about 40 hours to run
the query on all of big_table.
What is going wrong?
If I just do SELECT ... then the query does 5000 lines in about 2s. If I SELECT ... INTO OUTFILE ..., then the query still takes 60s for 5000 lines.
EXPLAIN SELECT ... gives:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,stdnt_intctn_t,index,NULL,a_b_index,16,NULL,46214255,"Using where"

I found that the problem was that the GROUP_BY resulted in too many random-access reads of big_table. The following strategy allows one sequential trip through big_table. First, we add a key to t2:
Create table `t2` (
`id` BIGINT NOT NULL,
`a` BIGINT NOT NULL,
`b` BIGINT NOT NULL,
`t1min` DATETIME NOT NULL,
PRIMARY KEY (a,b),
INDEX `id` (id)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Then we fill t2 using:
insert into t2
(id, a,b,t1min)
SELECT id,a,b,t1
FROM big_table
where type='SUBMIT' and is_c=1
ON DUPLICATE KEY UPDATE
t1min=if(t1<t1min,t1,t1min),
id=if(t1<t1min,big_table.id,t2.id);
The resulting speed-up is several orders of magnitude.

The group by might be part of the issue. You are using an index on (a,b), but your where is not being utilized. I would have an index on
(type, is_c, a, b )
Also, you are getting the "ID", but not specifying which... you probably want to do a MIN(ID) for a consistent result.

Update statement causes fields to be updated with NULL or maximum value

If you had to pick one of the two following queries, which would you choose and why:
UPDATE `table1` AS e
SET e.points = e.points+(
SELECT points FROM `table2` AS ep WHERE e.cardnbr=ep.cardnbr);
or:
UPDATE `table1` AS e
INNER JOIN
(
SELECT points, cardnbr
FROM `table2`
) AS ep ON (e.cardnbr=ep.cardnbr)
SET e.points = e.points+ep.points;
Tables' definitions:
CREATE TABLE `table1` (
`cardnbr` int(10) DEFAULT NULL,
`name` varchar(50) DEFAULT NULL,
`points` decimal(7,3) DEFAULT '0.000',
`email` varchar(50) NOT NULL DEFAULT 'user#company.com',
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=25205 DEFAULT CHARSET=latin1$$
CREATE TABLE `table2` (
`cardnbr` int(10) DEFAULT NULL,
`id` int(11) NOT NULL AUTO_INCREMENT,
`points` decimal(7,3) DEFAULT '0.000',
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=4 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci$$
UPDATE: BOTH are causing problems the first is causing non matched rows to update into NULL.
The second is causing them to update into the max value 999.9999 (decimal 7,3).
PS the cardnbr field is NOT a key

I prefer the second one..reason for that is
When using JOIN the databse can create an execution plan that is better for your query and save time whereas subqueries (like your first one ) will run all the queries and load all the datas which may take time.
i think subqueries is easy to read but performance wise JOIN is faster...

First, the two statements are not equivalent, as you found out yourself. The first one will update all rows of table1, putting NULL values for those rows that have no related rows in table2.
So the second query looks better because it doesn't update all rows of table1. It could be written in a more simpel way, like this though:
UPDATE table1 AS e
INNER JOIN table2 AS ep
ON e.cardnbr = ep.cardnbr
SET e.points = e.points + ep.points ;
So, the 2nd query would be the best to use, if cardnbr was the primary key of table2. Is it?
If it isn't, then which values from table2 should be used for the update of table1 (added to points)? All of them? You could use this:
UPDATE table1 AS e
INNER JOIN
( SELECT SUM(points) AS points, cardnbr
FROM table2
GROUP BY cardnbr
) AS ep ON e.cardnbr = ep.cardnbr
SET
e.points = e.points + ep.points ;
Just one of them? That would require some other derived table, depending on what you want.

Super slow query with CROSS JOIN

I have two tables named table_1 (1GB) and reference (250Mb).
When I query a cross join on reference it takes 16hours to update table_1 .. We changed the system files EXT3 for XFS but still it's taking 16hrs.. WHAT AM I DOING WRONG??
Here is the update/cross join query :
mysql> UPDATE table_1 CROSS JOIN reference ON
-> (table_1.start >= reference.txStart AND table_1.end <= reference.txEnd)
-> SET table_1.name = reference.name;
Query OK, 17311434 rows affected (16 hours 36 min 48.62 sec)
Rows matched: 17311434 Changed: 17311434 Warnings: 0
Here is a show create table of table_1 and reference:
CREATE TABLE `table_1` (
`strand` char(1) DEFAULT NULL,
`chr` varchar(10) DEFAULT NULL,
`start` int(11) DEFAULT NULL,
`end` int(11) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
`name2` varchar(255) DEFAULT NULL,
KEY `annot` (`start`,`end`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;
CREATE TABLE `reference` (
`bin` smallint(5) unsigned NOT NULL,
`name` varchar(255) NOT NULL,
`chrom` varchar(255) NOT NULL,
`strand` char(1) NOT NULL,
`txStart` int(10) unsigned NOT NULL,
`txEnd` int(10) unsigned NOT NULL,
`cdsStart` int(10) unsigned NOT NULL,
`cdsEnd` int(10) unsigned NOT NULL,
`exonCount` int(10) unsigned NOT NULL,
`exonStarts` longblob NOT NULL,
`exonEnds` longblob NOT NULL,
`score` int(11) DEFAULT NULL,
`name2` varchar(255) NOT NULL,
`cdsStartStat` enum('none','unk','incmpl','cmpl') NOT NULL,
`cdsEndStat` enum('none','unk','incmpl','cmpl') NOT NULL,
`exonFrames` longblob NOT NULL,
KEY `chrom` (`chrom`,`bin`),
KEY `name` (`name`),
KEY `name2` (`name2`),
KEY `annot` (`txStart`,`txEnd`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;

You should index table_1.start, reference.txStart, table_1.end and reference.txEnd table fields:
ALTER TABLE `table_1` ADD INDEX ( `start` ) ;
ALTER TABLE `table_1` ADD INDEX ( `end` ) ;
ALTER TABLE `reference` ADD INDEX ( `txStart` ) ;
ALTER TABLE `reference` ADD INDEX ( `txEnd` ) ;

Cross joins are Cartesian Products, which are probably one of the most computationally expensive things to compute (they don't scale well).
For each table T_i for i = 1 to n, the number of rows generated by crossing tables T_1 to T_n is the size of each table multiplied by the size of each other table, ie
|T_1| * |T_2| * ... * |T_n|
Assuming each table has M rows, the resulting cost of computing the cross join is then
M_1 * M_2 ... M_n = O(M^n)
which is exponential in the number of tables involved in the join.

I see 2 problems with the UPDATE statement.
There is no index for the End fields. The compound indexes (annot) you have will be used only for the start fields in this query. You should add them as suggested by Emre:
ALTER TABLE `table_1` ADD INDEX ( `end` ) ;
ALTER TABLE `reference` ADD INDEX ( `txEnd` ) ;
Second, the JOIN may (and probably does) find many rows of table reference that are related to a row of table_1. So some (or all) rows of table_1 that are updated, are updated many times. Check the result of this query, to see if it is the same as your updated rows count (17311434):
SELECT COUNT(*)
FROM table_1
WHERE EXISTS
( SELECT *
FROM reference
WHERE table_1.start >= reference.txStart
AND table_1.`end` <= reference.txEnd
)
There can be other ways to write this query but the lack of a PRIMARY KEY on both tables makes it harder. If you define a primary key on table_1, try this, replacing id with the primary key.
Update: No, do not try it on a table with 34M rows. Check the execution plan and try with smaller tables first.
UPDATE table_1 AS t1
JOIN
( SELECT t2.id
, r.name
FROM table_1 AS t2
JOIN
( SELECT name, txStart, txEnd
FROM reference
GROUP BY txStart, txEnd
) AS r
ON t2.start >= r.txStart
AND t2.`end` <= r.txEnd
GROUP BY t2.id
) AS good
ON good.id = t1.id
SET t1.name = good.name;
You can check the query plan by running EXPLAIN on the equivalent SELECT:
EXPLAIN
SELECT t1.id, t1.name, good.name
FROM table_1 AS t1
JOIN
( SELECT t2.id
, r.name
FROM table_1 AS t2
JOIN
( SELECT name, txStart, txEnd
FROM reference
GROUP BY txStart, txEnd
) AS r
ON t2.start >= r.txStart
AND t2.`end` <= r.txEnd
GROUP BY t2.id
) AS good
ON good.id = t1.id ;

Try this:
UPDATE table_1 SET
table_1.name = (
select reference.name
from reference
where table_1.start >= reference.txStart
and table_1.end <= reference.txEnd)

Somebody already offered you to add some indexes. But I think the best performance you may get with these two indexes:
ALTER TABLE `test`.`time`
ADD INDEX `reference_start_end` (`txStart` ASC, `txEnd` ASC),
ADD INDEX `table_1_star_end` (`start` ASC, `end` ASC);
Only one of them will be used by MySQL query, but MySQL will decide which is more useful automatically.

Optimizing MySQL query for the Tatoeba project

I've downloaded the Tatoeba project databases and I'm trying to query them but queries with a subquery are taking way too long.
-- 800.000 rows approx.
CREATE TABLE `sentences` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`language` char(3) DEFAULT NULL,
`text` mediumtext,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=912551 DEFAULT CHARSET=utf8
-- 1.5 million rows approx.
CREATE TABLE `links` (
`sentenceId` int(10) unsigned NOT NULL,
`translatedId` int(10) unsigned NOT NULL,
PRIMARY KEY (`sentenceId`,`translatedId`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
Basically the links table joins two sentences in the sentences table together (the original sentence and one translation). A sentence can have zero or more translations. So I have an id of a sentence I want to work with and want to grab ALL the translations available.
This query gets me what I want but takes almost 18 seconds to complete.
SELECT * FROM `sentences` WHERE `id` IN (SELECT `translatedId` FROM `links` WHERE `sentenceId` = 157967);
Running both queries by themselves just takes an instant.
What am I doing wrong?

SELECT `sentences`.* FROM
`sentences` JOIN
`links` ON `id` = `translatedId`
WHERE `sentenceId` = 157967;
Some versions of MySQL are known not to use indexes in sub-queries.

Try this(using an EXISTS clause):
SELECT *
FROM `sentences` a
WHERE EXISTS (SELECT 1 FROM `links` b WHERE `sentenceId` = 157967 AND b.`translatedId`=a.`id`);
If the translatedId is unique in the links you can go for a inner join as given below
SELECT a.*
FROM `sentences` a INNER JOIN `links` b
ON b.`translatedId`=a.`id`

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Deleting Duplicates in MySQL - mysql

I would use MySQL's multi-table delete syntax: DELETE q2 FROM query q1 JOIN query q2 USING (searchquery, datetime) WHERE q1.id < q2.id;

Related

Use Indexes For Join on Indexed DATETIME and Indexed DATE columns

MySQL INSERT INTO ... SELECT ... GROUP BY is too slow

Update statement causes fields to be updated with NULL or maximum value

Super slow query with CROSS JOIN

Optimizing MySQL query for the Tatoeba project

Categories

Resources