delete duplicate rows from large dB - mysql

This is a second post to my original question posted here.
My setup:
amazon RDS using MySQL Workbench with connection timeout set to max
I am trying to DELETE duplicate rows from my dB which has close to 1MIL rows.
the table looks like this, mytext is a mediumtext blob. id is AUTO_INCREMENT
+---+-----+-----+------+-------+
|id |fname|lname|mytext|morevar|
|---|-----|-----|------|-------|
| 1 | joe | min | abc | 123 |
| 2 | joe | min | abc | 123 |
| 3 | mar | kam | def | 789 |
| 4 | kel | smi | ghi | 456 |
+------------------------------+
I would like to end up with a table like this
+---+-----+-----+------+-------+
|id |fname|lname|mytext|morevar|
|---|-----|-----|------|-------|
| 1 | joe | min | abc | 123 |
| 3 | mar | kam | def | 789 |
| 4 | kel | smi | ghi | 456 |
+------------------------------+
This solution started woking but after about 10,000 rows the process takes longer and eventualy hangs.
I let this run for over 20 hours, settings at 10 thousand rows with a WHERE condition (i thought deleting in chunks would be safer).
But even with the WHERE clause the system hangs then I have to Reboot RDS to access the dB.
DELETE
FROM yourTable
WHERE id>40000
AND id<=50000
AND id NOT IN
(
SELECT MAXID FROM
(
SELECT MAX(id) as MAXID
FROM yourTable
GROUP BY mytext
) as temp_table
)
heres the create statement
CREATE TABLE `yourTable` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`fname` varchar(45) DEFAULT NULL,
`lname` varchar(45) DEFAULT NULL,
`mytext` mediumtext,
`morevar` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1$$
Question
Is this sql command ok for handeling large amounts of rows and what I am trying to achieve? Or is there a better solution.
How long would it normally take to process 1MIL rows?
Is there a setting like in php.ini inside amazon for large data set manipulation?
Or would it make more sense to create a new table and insert all rows excluding duplicates?

I really wouldn't use NOT IN.
I would ensure that there is an index on myText, id and then try this...
DELETE
FROM
yourTable
WHERE
id > 40000
AND id <= 50000
AND EXISTS (SELECT *
FROM yourTable AS lookup
WHERE lookup.myText = yourTable.myText
AND lookup.id > yourTable.id
)
This way you only check the myText values that you are potentially deleting.
Where as your sub-query will return ids for myTexts that don't even appear in the range you are checking.

Related

Fetch latest row for a specific identifier

I have a table that looks like this
ID | identifier | data | created_at
------------------------------------
1 | 500 | test1 | 2011-08-30 15:27:29
2 | 501 | test1 | 2011-08-30 15:27:29
3 | 500 | test2 | 2011-08-30 15:28:29
4 | 865 | test3 | 2011-08-30 15:29:29
5 | 501 | test2 | 2011-08-30 15:31:29
6 | 500 | test3 | 2011-08-30 15:31:29
What I need is the most up to date entry for each identifier, that could be decided by either the ID or the date in created_at. I assumed ID is the better choice due to the indexing.
I would expect this result set:
4 | 865 | test3 | 2011-08-30 15:29:29
5 | 501 | test2 | 2011-08-30 15:31:29
6 | 500 | test3 | 2011-08-30 15:31:29
The result should be ordered by either date or ID in ascending order.
It's important that this is a table that contains ~ 8 millions of rows.
I tried quite some approaches now with self joining and sub queries. Unfortunately all of those came out with either wrong results or half a decade of run time.
To provide an example:
SELECT lo1.*
FROM table lo1
INNER JOIN
(
SELECT MAX(id) MaxID, identifier, id
FROM table
GROUP BY identifier
) lo2
ON lo1.identifier= lo2.identifier
AND lo1.id = lo2.MaxID
ORDER BY lo1.id DESC
LIMIT 10
The above query takes very long and does sometimes not return the latest result for an identifier, not quite sure why though.
Does anyone have an approach that is able to fetch the required result sets and preferably does not take a decade?
As asked, here is the create code:
CREATE TABLE `table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`identifier` int(11) NOT NULL,
`data` varchar(200) COLLATE latin1_bin NOT NULL,
`created_at` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `identifier` (`identifier`),
KEY `created_at` (`created_at`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin
The correct query that gives the correct results but it won't scale on larger tables.
Query
SELECT
`table`.*
FROM
`table`
INNER JOIN
(
SELECT
MAX(id) AS MaxID
, identifier
FROM
`table`
GROUP BY
identifier
#disables GROUP BY Sorting might make the query faster.
ORDER BY
NULL
) `table_group`
ON
`table`.ID = `table_group`.MaxID
ORDER BY
`table`.ID DESC
LIMIT 10
Result
| id | identifier | data | created_at |
|----|------------|-------|----------------------|
| 6 | 500 | test3 | 2011-08-30T15:31:29Z |
| 5 | 501 | test2 | 2011-08-30T15:31:29Z |
| 4 | 865 | test3 | 2011-08-30T15:29:29Z |
see demo http://www.sqlfiddle.com/#!9/7f4401/4
But when you check "View Execution Plan" you can see "Using where; Using temporary; Using filesort" in the extra column meaning MySQL needs to use a quicksort algorithm "Using temporary;" means the quicksort algorithm first will be run on a memory temporary table.
If the memory temporary table becomes to large it will be converted to a MyISAM on disk temporary table.
Meaning the quicksort will need disk based random i/o to sort which is slow on disks.
So this method will not scale on the table with ~8 millions of rows.
This query below also gives the same results but it should be more optimized
Query
SELECT
`table`.*
FROM
`table`
INNER JOIN (
SELECT
`table`.ID
FROM
`table`
INNER JOIN
(
SELECT
MAX(id) AS MaxID
, identifier
FROM
`table`
GROUP BY
identifier
#disables GROUP BY Sorting might make the query faster.
ORDER BY
NULL
)
AS `table_group`
ON
`table`.ID = `table_group`.MaxID
)
AS `table_group_max`
ON
`table`.ID = `table_group_max`.ID
ORDER BY
`table`.ID DESC
LIMIT 10
Result
| id | identifier | data | created_at |
|----|------------|-------|----------------------|
| 6 | 500 | test3 | 2011-08-30T15:31:29Z |
| 5 | 501 | test2 | 2011-08-30T15:31:29Z |
| 4 | 865 | test3 | 2011-08-30T15:29:29Z |
see demo http://www.sqlfiddle.com/#!9/7f4401/21
When you check "View Execution Plan" there is no more "Using temporary; Using filesort" meaning the query should be more optimal then the previous query and should in theory execute faster.
Because the combination "Using temporary; Using filesort" can really be a performance killer like explained.

MySQL: Strange behavior of UPDATE query (ERROR 1062 Duplicate entry)

I have a MySQL database the stores news articles with the publications date (just day information), the source, and category. Based on these I want to generate a table that holds the article counts w.r.t. to these 3 parameters.
Since for some combinations of these 3 parameters there might be no article, a simple GROUP BY won't do. I therefore first generate a table news_article_counts with all possible combinations of the 3 parameters, and an default article_count of 0 -- like this:
SELECT * FROM news_article_counts;
+--------------+------------+----------+---------------+
| published_at | source | category | article_count |
+------------- +------------+----------+---------------+
| 2016-08-05 | 1826089206 | 0 | 0 |
| 2016-08-05 | 1826089206 | 1 | 0 |
| 2016-08-05 | 1826089206 | 2 | 0 |
| 2016-08-05 | 1826089206 | 3 | 0 |
| 2016-08-05 | 1826089206 | 4 | 0 |
| ... | ... | ... | ... |
+--------------+------------+----------+---------------+
For testing, I now created a temporary table tmp as the GROUP BY result from the original news article table:
SELECT * FROM tmp LIMIT 6;
+--------------+------------+----------+-----+
| published_at | source | category | cnt |
+--------------+------------+----------+-----+
| 2016-08-05 | 1826089206 | 3 | 1 |
| 2003-09-19 | 1826089206 | 4 | 1 |
| 2005-08-08 | 1826089206 | 3 | 1 |
| 2008-07-22 | 1826089206 | 4 | 1 |
| 2008-11-26 | 1826089206 | 8 | 1 |
| ... | ... | ... | ... |
+--------------+------------+----------+-----+
Given these two tables, the following query works as expected:
SELECT * FROM news_article_counts c, tmp t
WHERE c.published_at = t.published_at AND c.source = t.source AND c.category = t.category;
But now I need to update the article_count of table news_article_counts with the values in table tmp where the 3 parameters match up. For this I'm using the following query (I've tried different ways but with the same results):
UPDATE
news_article_counts c
INNER JOIN
tmp t
ON
c.published_at = t.published_at AND
c.source = t.source AND
c.category = t.category
SET
c.article_count = t.cnt;
Executing this query yields this error:
ERROR 1062 (23000): Duplicate entry '2018-04-07 14:46:17-1826089206-1' for key 'uniqueIndex'
uniqueIndex is a joint index over published_at, source, category of table news_article_counts. But this shouldn't be a problem since I do not -- as far as I can tell -- update any of those 3 values, only article_count.
What confuses me most is that in the error it mentions the timestamp I executed the query (here: 2018-04-07 14:46:17). I have no absolutely idea where this comes into play. In fact, some rows in news_article_counts now have 2018-04-07 14:46:17 as value for published_at. While this explains the error, I cannot see why published_at gets overwritten with the current timestamp. There is no ON UPDATE CURRENT_TIMESTAMP on this column; see:
CREATE TABLE IF NOT EXISTS `test`.`news_article_counts` (
`published_at` TIMESTAMP NOT NULL,
`source` INT UNSIGNED NOT NULL,
`category` INT UNSIGNED NOT NULL,
`article_count` INT UNSIGNED NOT NULL DEFAULT 0,
UNIQUE INDEX `uniqueIndex` (`published_at` ASC, `source` ASC, `category` ASC))
ENGINE = MyISAM
DEFAULT CHARACTER SET = utf8mb4;
What am I missing here?
UPDATE 1: I actually checked the table definition of news_article_counts in the database. And there's indeed the following:
mysql> SHOW COLUMNS FROM news_article_counts;
+---------------+------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+-------------------+-----------------------------+
| published_at | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| source | int(10) unsigned | NO | | NULL | |
| category | int(10) unsigned | NO | | NULL | |
| article_count | int(10) unsigned | NO | | 0 | |
+---------------+------------------+------+-----+-------------------+-----------------------------+
But why is on update CURRENT_TIMESTAMP set. I double and triple-checked my CREATE TABLE statement. I removed the joint index, I added an artificial primary key (auto_increment). Nothing help. I've even tried to explicitly remove these attributes from published_at with:
ALTER TABLE `news_article_counts` CHANGE `published_at` `published_at` TIMESTAMP NOT NULL;
Nothing seems to work for me.
It looks like you have the explicit_defaults_for_timestamp system variable disabled. One of the effects of this is:
The first TIMESTAMP column in a table, if not explicitly declared with the NULL attribute or an explicit DEFAULT or ON UPDATE attribute, is automatically declared with the DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP attributes.
You could try enabling this system variable, but that could potentially impact other applications. I think it only takes effect when you're actually creating a table, so it shouldn't affect any existing tables.
If you don't to make a system-level change like this, you could add an explicit DEFAULT attribute to the published_at column of this table, then it won't automatically add ON UPDATE.

Restrict insertion based on a count

So, I need to safely restrict the insertion of entries in a table based on the count of other entries in that same table. Say we have the following table:
resource:(id, foreign_key)
I need to create up to a number of entries based on the foreign key. So, as soon as I reach a count, let's say 100 for our example, I want to restrict creating more entries.
The obvious answer would be something like that:
count the entries with the specified foreign key.
if count < limit insert the new entry
And in fact, that's what I have been using. The thing is, this approach is not fail-proof since between 1 and 2 there might occur another insertion. I considered the possibility of using transactions but (unless I'm completely misunderstanding transactions) this has the same issue:
start transaction
insert the new entry
if entries have exceeded the limit, rollback. otherwise commit
Now, say we already have 99/100 entries and two transactions run at the same time. They both will commit since they don't see each-other's entries.
Short of actually creating the entry and then delete it if it's invalid (which feels kindof messy in my mind) I can't think of a way to solve this issue. Any ideas?
edit: upon request I'm providing sample data:
table1
+-------------+------------------+------+-----+----------------+
| Field | Type | Null | Key | Extra |
+-------------+------------------+------+-----+----------------+
| id | int(10) unsigned | NO | PRI | auto_increment |
| limit | int(10) unsigned | NO | MUL | |
+-------------+------------------+------+-----+----------------+
table2
+-------------+------------------+------+-----+----------------+
| Field | Type | Null | Key | Extra |
+-------------+------------------+------+-----+----------------+
| id | int(10) unsigned | NO | PRI | auto_increment |
| foreign_id | int(10) unsigned | NO | MUL | |
+-------------+------------------+------+-----+----------------+
and some sample data:
table1
+----+----------+
| id | limit |
+----+----------+
| 1 | 5 |
+----+----------+
table2
+----+---------------+
| id | foreign_id |
+----+---------------+
| 1 | 1 |
+----+---------------+
| 2 | 1 |
+----+---------------+
| 3 | 1 |
+----+---------------+
| 4 | 1 |
+----+---------------+
At this point, let's say that two users attempt to create table2 entries. The first one will have to be accepted and the 2nd rejected.
With the first approach, if both users go through step 1 (counting the old entries) and then through step 2 (insert the new entry) both entries will be created.
With the second approach, if both of them run at the same time, they both will count 4 slots before themselves and commit instead of one of them rollbacking.
Halo Mate, a Stored Procedure similar to this structure may help you
UPDATE
DROP PROCEDURE IF EXISTS sp_insert_record;
DELIMITER //
CREATE PROCEDURE sp_insert_record(
IN insert_value1 INT(9),
IN chosen_id INT(9)
)
BEGIN
SELECT id, `limit`
INTO #id, #limit
FROM table1
WHERE id = chosen_id;
START TRANSACTION;
INSERT INTO table2 (id, foreign_id)
VALUES (insert_value1, chosen_id);
SELECT COUNT(id)
INTO #count
FROM table2
WHERE foreign_id = #id;
IF #count <= #limit THEN
COMMIT;
ELSE
ROLLBACK;
END IF;
END//
DELIMITER ;
By using a Stored Procedure, you can also add any validation or process based on your requirements.
Hope this can be of help, cheers!

MySQL range query is slow

I have read different links like http://goo.gl/1nr3s2, http://goo.gl/gv4Vlc and other stackoverflow questions, but none of them help me with this problem.
This problem interacts with multiple tables, but the EXPLAIN method help me identify range is the main problem with the query.
First I need to explain that I have this table with this sample data (I will not use ids in any table to simplify the process)
+-------+----------+----------------+--------------+---------------+----------------+
| marca | submarca | modelo_inicial | modelo_final | motor | texto_articulo |
+-------+----------+----------------+--------------+---------------+----------------+
| Buick | Century | 1993 | 1996 | 4 Cil 2.2 Lts | BE1254AG4 |
| Buick | Century | 1993 | 1996 | 4 Cil 2.2 Lts | 854G4 |
+-------+----------+----------------+--------------+---------------+----------------+
This table has more than 1.5 Million rows and I have created a index that integrates initial_year and end_year in one and also initial_year has an index and end_year has another index independently like this structure.
CREATE TABLE `general` (
`id_general` int(11) NOT NULL AUTO_INCREMENT,
`id_marca_submarca` int(11) NOT NULL,
`id_modelo_inicial` int(11) NOT NULL,
`id_modelo_final` int(11) NOT NULL,
`id_motor` int(11) NOT NULL,
`id_articulo` int(11) NOT NULL,
PRIMARY KEY (`id_general`),
KEY `fk_general_articulo` (`id_articulo`),
KEY `modelo_inicial_final` (`id_modelo_inicial`,`id_modelo_final`),
KEY `indice_motor` (`id_motor`),
KEY `indice_marca_submarca` (`id_marca_submarca`),
KEY `indice_modelo_inicial` (`id_modelo_inicial`),
KEY `indice_modelo_final` (`id_modelo_final`),
CONSTRAINT `fk_general_articulo` FOREIGN KEY (`id_articulo`) REFERENCES `articulo` (`id_articulo`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1191853 DEFAULT CHARSET=utf8
I have another table that contains different years like this sample data:
+---------+----------------+
| id_modelo | texto_modelo |
+-----------+--------------+
| 76 | 2014 |
| 75 | 2013 |
............................
| 1 | 1939 |
+-----------+--------------+
I created a query that contains subquery to obtain specific data but took a lot of time. I will put some queries I have tried but none of them have worked properly for me
SELECT DISTINCT M.texto_modelo
FROM general G
INNER JOIN parque_vehicular.modelo M ON G.id_modelo_inicial <= M.id_modelo AND G.id_modelo_final >= M.id_modelo
WHERE EXISTS
(
SELECT DISTINCT A.id_articulo
...subquery...
WHERE A.id_articulo = G.id_articulo AND AD.id_distribuidor = 1
)
ORDER BY M.texto_modelo DESC;
And this query took a lot of seconds, so I use EXPLAIN and report is:
This is another query I tried.
SELECT DISTINCT M.texto_modelo
FROM general G
INNER JOIN parque_vehicular_rigs.modelo M ON M.id_modelo BETWEEN G.id_modelo_inicial AND G.id_modelo_final
WHERE EXISTS
(
SELECT DISTINCT A.id_articulo
...subquery WHERE A.id_articulo = G.id_articulo AND AD.id_distribuidor = 1
)
ORDER BY M.texto_modelo DESC;
Some operations you could do to change the query plan:
OP1: Get rid of all the keys or indexes in table general.
OP2: Use SELECT 1 instead of SELECT DISTINCT A.id_articulo in the sub query in EXISTS.
Do these operations separately, compare the differences.

Difficult MySQL Update Query with Self-Join

Our website has listings. We use a connections table with the following structure to connect our members with these various listings:
CREATE TABLE `connections` (
`cid1` int(9) unsigned NOT NULL DEFAULT '0',
`cid2` int(9) unsigned NOT NULL DEFAULT '0',
`type` char(2) NOT NULL,
`created` datetime DEFAULT NULL,
`updated` datetime DEFAULT NULL,
PRIMARY KEY (`cid1`,`cid2`,`type`,`cid3`),
KEY `cid1` (`cid1`,`type`),
KEY `cid2` (`cid2`,`type`)
);
The problem we've run into is when we have to combine duplicate listings from time to time we also need to update our member connections and have been using the following query which breaks if a member is connected to both listings:
update connections set cid2=100000
where type IN ('MC','MT','MW') AND cid2=100001;
What I can't figure out is how to do the following which would solve this issue:
update connections set cid2=100000
where type IN ('MC','MT','MW') AND cid2=100001 AND cid1 NOT IN (
select cid1 from connections
where type IN ('MC','MT','MW') AND cid2=100000
);
When I try to run that query I get the following error:
ERROR 1093 (HY000): You can't specify target table 'connections' for update in FROM clause
Here is some sample data. Notice the update conflict for cid1 = 10025925
+----------+--------+------+---------------------+---------------------+
| cid1 | cid2 | type | created | updated |
+----------+--------+------+---------------------+---------------------+
| 10010388 | 100000 | MC | 2010-08-05 18:04:51 | 2011-06-16 16:26:17 |
| 10025925 | 100000 | MC | 2010-10-31 09:21:25 | 2010-10-31 16:21:25 |
| 10027662 | 100000 | MC | 2011-06-13 16:31:12 | NULL |
| 10038375 | 100000 | MW | 2011-02-05 05:32:35 | 2011-02-05 19:51:58 |
| 10065771 | 100000 | MW | 2011-04-24 17:06:35 | NULL |
| 10025925 | 100001 | MC | 2010-10-31 09:21:45 | 2010-10-31 16:21:45 |
| 10034884 | 100001 | MC | 2011-01-20 18:54:51 | NULL |
| 10038375 | 100001 | MC | 2011-02-04 05:00:35 | NULL |
| 10041989 | 100001 | MC | 2011-02-26 09:33:18 | NULL |
| 10038259 | 100001 | MC | 2011-05-07 13:34:20 | NULL |
| 10027662 | 100001 | MC | 2011-06-13 16:33:54 | NULL |
| 10030855 | 100001 | MT | 2010-12-31 20:40:18 | NULL |
| 10038375 | 100001 | MT | 2011-02-04 05:00:36 | NULL |
+----------+--------+------+---------------------+---------------------+
I'm hoping that someone can suggest the right way to run the above query. Thanks in advance!
The reason for the error in your query is because in MySQL you cannot SELECT from the table you are trying to UPDATE in the same query.
Use UPDATE IGNORE to avoid duplicate conflicts.
I think you should try reading INSERT ON DUPLICATE KEY. The idea is that you frame an INSERT query that always create a DUPLICATE conflict and then the UPDATE part will do its part.
One possible way would be to use a temporary table for your subquery then select from the temp table. Efficiency could break down fast if you need to execute a lot of these queires though.
create temporary table subq as select cid1 from connections where type IN ('MC','MT','MW') AND cid2=100000
And
update connections set cid2=100000 where type IN ('MC','MT','MW') AND cid2=100001 AND cid1 NOT IN (select cid1 from subq);
UPDATE connections cn1
LEFT JOIN connections cn2 ON cn1.cid1 != cn2.cid1
AND cn2.type IN ('MC','MT','MW')
AND cn2.cid2=100000
SET cn1.cid2=100000
WHERE cn1.TYPE IN ('MC','MT','MW')
AND cn1.cid2=100001
AND cn2.cid1 IS NULL -- i.e. there is no matching record
I was thinking something like the following, but I'm not 100% sure how your data looks before and after to determine accuracy. The idea is to join the table to itself on your subquery's where clause and the exclusion where cid1 must not match.
update connections c1 left outer join connections c2
on (c2.cid2 = 100000 and c2.type in ('MC','MT','MW') and c1.cid1 != c2.cid1)
set c1.cid2 = 100000
where c1.type in ('MC', 'MT', 'MW') and c1.cid2=100001 and c2.cid1 is null;
As near as I can tell, it'll work. I used your create table (minus the cid3 in the primary key) and made sure I had 2 rows with the same cid1 and different cid2 (one 100000 and the other as 100001) and that the statement only affected 1 row.