Remove duplicates from MySQL DB - mysql

I've got a database with over 7000 records. As it turns out, there are several duplicates within those records. I found several suggestions on how to delete duplicates and keep only 1 record.
But in my case things are a bit more complicated: cases are not simply duplicates if they hold the same data as another record. Instead, several cases ar perfectly okay holding the same data. They are marked as duplicate only when they hold the same data AND are both inserted within 30 seconds.
Therefore I need a SQL statement that deletes duplicates (eg: all fields, except id and datetime) if they have been inserted within a 40 seconds range (eg: evaluating the datetime field).
Since I'm everything but a SQL expert and can't find a suitable solution online, I truly hope some of you might help me out and point me in the right direction. That would be very appreciated!
The table structure is as following:
CREATE TABLE IF NOT EXISTS `wp_ttr_results` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) NOT NULL,
`schoolyear` varchar(10) CHARACTER SET utf8 DEFAULT NULL,
`datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`area` varchar(15) CHARACTER SET utf8 NOT NULL,
`content` varchar(10) CHARACTER SET utf8 NOT NULL,
`types` varchar(100) CHARACTER SET utf8 NOT NULL,
`tasksWrong` varchar(300) DEFAULT NULL,
`tasksRight` varchar(300) DEFAULT NULL,
`tasksData` longtext CHARACTER SET utf8,
`parent_id` varchar(20) DEFAULT NULL,
UNIQUE KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=68696 ;
So just to clarify again, a duplicate case is a case that:
[1]holds the same data as another case for all fields, except the id and datetime field
[2]is inserted in the DB, according to the datetime field, within 40 seconds of another record with the same values
If both conditions are met, all cases except one, should be deleted.

As #Juru pointed out in the comments, we need quite a surgical knive to cut this one. It is however possible to do this in an iterative way via a stored procedure.
First we use a self-join to identify the first duplicate for every record, that itself is not a duplicate:
SELECT DISTINCT
MIN(postdups.id AS id)
FROM wp_ttr_results AS base
INNER JOIN wp_ttr_results AS postdups
ON base.id<postdups.id
AND UNIX_TIMESTAMP(postdups.datetime)-UNIX_TIMESTAMP(base.datetime)<40
AND base.user_id=postdups.user_id
AND base.schoolyear=postdups.schoolyear
AND base.area=postdups.area
AND base.content=postdups.content
AND base.types=postdups.types
AND base.tasksWrong=postdups.tasksWrong
AND base.tasksRight=postdups.tasksRight
AND base.parent_id=postdups.user_id
LEFT JOIN wp_ttr_results AS predups
ON base.id>predups.id
AND UNIX_TIMESTAMP(base.datetime)-UNIX_TIMESTAMP(predups.datetime)<40
AND base.user_id=predups.user_id
AND base.schoolyear=predups.schoolyear
AND base.area=predups.area
AND base.content=predups.content
AND base.types=predups.types
AND base.tasksWrong=predups.tasksWrong
AND base.tasksRight=predups.tasksRight
AND base.parent_id=predups.user_id
WHERE predups.id IS NULL
GROUP BY base.id
;
This selects the lowest id of all later records (base.id<postdups.id), that have the same payload as an existing record and are within a 40s window (UNIX_TIMESTAMP(dups.datetime)-UNIX_TIMESTAMP(base.datetime)<40), but skips those base records, that are duplicates themselves. In #Juru's example, the :30 record would be hit, as it is a duplicate of the :00 record, which itself is not a duplicate, but the :41 record would not be hit, as it is a duplicate only to :30, which itself is a duplicate of :00.
We have
Now we have to remove this record - since MySQL can't delete from a table it is reading, we must use a variable to achieve that:
CREATE TEMPORARY TABLE cleanUpDuplicatesTemp SELECT DISTINCT
-- as above
;
DELETE FROM wp_ttr_results
WHERE id IN
(SELECT id FROM cleanUpDuplicatesTemp)
;
DROP TABLE cleanUpDuplicatesTemp
;
Until now we will have removed the first duplicate for each record, in the process possibly changing, what would be considered a duplicate ...
Finally we must loop through this process, exiting the loop if the SELECT DISTINCT returns nothing.
Putting it all together into a stored proceedure:
DELIMITER ;;
CREATE PROCEDURE cleanUpDuplicates()
BEGIN
DECLARE numDuplicates INT;
iterate: LOOP
DROP TABLE IF EXISTS cleanUpDuplicatesTemp;
CREATE TEMPORARY TABLE cleanUpDuplicatesTemp
SELECT DISTINCT
MIN(postdups.id AS id)
FROM wp_ttr_results AS base
INNER JOIN wp_ttr_results AS postdups
ON base.id<postdups.id
AND UNIX_TIMESTAMP(postdups.datetime)-UNIX_TIMESTAMP(base.datetime)<40
AND base.user_id=postdups.user_id
AND base.schoolyear=postdups.schoolyear
AND base.area=postdups.area
AND base.content=postdups.content
AND base.types=postdups.types
AND base.tasksWrong=postdups.tasksWrong
AND base.tasksRight=postdups.tasksRight
AND base.parent_id=postdups.user_id
LEFT JOIN wp_ttr_results AS predups
ON base.id>predups.id
AND UNIX_TIMESTAMP(base.datetime)-UNIX_TIMESTAMP(predups.datetime)<40
AND base.user_id=predups.user_id
AND base.schoolyear=predups.schoolyear
AND base.area=predups.area
AND base.content=predups.content
AND base.types=predups.types
AND base.tasksWrong=predups.tasksWrong
AND base.tasksRight=predups.tasksRight
AND base.parent_id=predups.user_id
WHERE predups.id IS NULL
GROUP BY base.id;
SELECT COUNT(*) INTO numDuplicates FROM cleanUpDuplicatesTemp;
IF numDuplicates<=0 THEN
LEAVE iterate;
END IF;
DELETE FROM wp_ttr_results
WHERE id IN
(SELECT id FROM cleanUpDuplicatesTemp)
END LOOP iterate;
DROP TABLE IF EXISTS cleanUpDuplicatesTemp;
END;;
DELIMITER ;
Now a simple CALL cleanUpDuplicates; should do the trick.

This might work, but it probably won't be very fast...
DELETE FROM dupes
USING wp_ttr_results AS dupes
INNER JOIN wp_ttr_results AS origs
ON dupes.field1 = origs.field1
AND dupes.field2 = origs.field2
AND ....
AND AS dupes.id <> origs.id
AND dupes.`datetime` BETWEEN orig.`datetime` AND (orig.`datetime` + INTERVAL 40 SECOND)
;

Related

update all rows after insertion in mysql

I am working with mysql and I have a table with the following structure (a summary):
CREATE TABLE `costs` (
`id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`utility` DECIMAL(9,2) UNSIGNED NOT NULL,
`tax` DECIMAL(9,2) UNSIGNED NOT NULL,
`active` TINYINT(1) UNSIGNED NOT NULL DEFAULT '1',
`created_at` TIMESTAMP NULL DEFAULT NULL,
`updated_at` TIMESTAMP NULL DEFAULT NULL,
PRIMARY KEY (`id`),
)
where the active field defaults to 1 when inserting, then i would like when saving a new record all other rows update the active field as 0, so i try to create a trigger for this but i am getting a mysql error.
DELIMITER //
CREATE TRIGGER after_costs_insert AFTER INSERT ON costs FOR EACH ROW
BEGIN
UPDATE costs SET active = 0 WHERE id <> NEW.id;
END;
//
DELIMITER ;
I think it is not possible to do this, so how can I update these rows?
A trigger cannot action the table it was fired upon. That's a typical limitation in SQL, that is mainly meant to prevent infinite loop on invokation (a query fires a trigger, that executes a query, that fires the trigger, and so on).
Here, instead of actually storing this derived information, I would actually recommend using a view that computes the column on the fly when queried.
If you are running MySQL 8.0:
create view costs_view as
select
id,
utility,
tax,
row_number() over(order by id desc) = 1 active,
created_at,
updated_at
from costs
In earlier versions:
create view costs_view as
select
id,
utility,
tax,
id = (select max(id) from costs) active,
created_at,
updated_at
from costs
This gives you an always up-to-date column, that you just don't need to maintain.
If you want only the most recent row, then you can use:
select c.*
from costs c
order by id desc -- or created_at
limit 1;
This will work in a view.
More often, the situation is that you have one active per something -- such as a "utility" or whatever. In that case, you can use a secondary table to store one row per "something" along with the id of the active account. The trigger can set this idea.
In your case, you have only one active row in the costs table, so a secondary table might be considered overkill. You can easily get the current active value using the above query.

MySQL group by query with subselect optimization

I have the following tables in MySQL:
CREATE TABLE `events` (
`pv_name` varchar(60) COLLATE utf8mb4_unicode_ci NOT NULL,
`time_stamp` bigint(20) unsigned NOT NULL,
`event_type` varchar(40) COLLATE utf8mb4_unicode_ci NOT NULL,
`value` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin,
`value_type` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`value_count` bigint(20) DEFAULT NULL,
`alarm_status` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`alarm_severity` varchar(40) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
PRIMARY KEY (`pv_name`,`time_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;
CREATE TEMPORARY TABLE `matching_pv_names` (
`pv_name` varchar(60) NOT NULL,
PRIMARY KEY (`pv_name`)
) ENGINE=Memory DEFAULT CHARSET=latin1;
The matching_pv_names table holds a subset of the unique events.pv_name values.
The following query runs using the 'loose index scan' optimization:
SELECT events.pv_name, MAX(events.time_stamp) AS time_stamp
FROM events
WHERE events.time_stamp <= time_stamp_in
GROUP BY events.pv_name;
Is it possible to improve the time of this query by restricting the events.pv_name values to those in the matching_pv_names table without losing the 'loose index scan' optimization?
Try one of the below queries to limit your output to matching values found in matching_pv_names.
Query 1:
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events e
INNER JOIN matching_pv_names pv ON e.pv_name = pv.pv_name
WHERE e.time_stamp <= time_stamp_in
GROUP BY e.pv_name;
Query 2:
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events e
WHERE e.time_stamp <= time_stamp_in
AND EXISTS ( select 1 from matching_pv_names pv WHERE e.pv_name = pv.pv_name )
GROUP BY e.pv_name;
Let me quote manual here, since I think it applies to your case (bold emphasis mine):
If the WHERE clause contains range predicates (...), a loose index scan looks up the first key of each group
that satisfies the range conditions, and again reads the least
possible number of keys. This is possible under the following
conditions:
The query is over a single table.
Knowing this, I believe Query 1 would not be able to use a loose index scan, but probably second query could do that. If that is still not the case, you could also give a try for third method proposed which uses a derived table.
Query 3:
SELECT e.*
FROM (
SELECT e.pv_name, MAX(e.time_stamp) AS time_stamp
FROM events e
WHERE e.time_stamp <= time_stamp_in
GROUP BY e.pv_name
) e
INNER JOIN matching_pv_names pv ON e.pv_name = pv.pv_name;
Your query is very efficient. You can 'prove' it by doing this:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
Most numbers refer to "rows touched", either in the index or in the data. You will see very low numbers. If the biggest one is about the number of rows returned, that is very good. (I tried a similar query and got about 2x; I don't know why.)
With that few rows touched then either
Outputting the rows will overwhelm the run time. So, who cares about the efficiency; or
You were I/O-bound because of leapfrogging through the index (actually, the table in your case). Run it a second time; it will be fast because of caching.
The only way to speed up leapfrogging is to somehow move the desired rows next to each other. That seems unreasonable for this query.
As for playing games with another table -- Maybe. Will the JOIN significantly decrease the number of events to look at? Then Maybe. Otherwise, I say "a very efficient query is not going to get faster by adding complexity".

Duplicate values in reference tables

Our application calls a stored procedure to normalize it's data to reference tables, after which it inserts a record into the main table partially containing values and partially containing ids that map to the reference tables. This is one of the stored procedures:
CREATE PROCEDURE `sp_name`(IN valueIn varchar(100), OUT valueOut int)
BEGIN
declare maxid int;
declare countid int;
select max(id) into valueOut from tableName where fieldName=valueIn;
IF valueOut is NULL
THEN
start transaction with consistent snapshot;
select count(*) into countid from tableName where fieldName=valueIn;
IF countid=0
THEN
insert into tableName (fieldName) values (valueIn);
select last_insert_id() into valueOut;
ELSE
select max(id) into valueOut from tableName where fieldName=ValueIn;
end IF;
commit;
end IF;
END
When called manually this works fine but, when being called in production we end up with multiple duplicate values in the reference tables.
Transaction isolation level is REPEATABLE_READ.
Ref table:
CREATE TABLE `tableName` (
`id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
`fieldName` varchar(45) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=100 DEFAULT CHARSET=utf8
Using a unique key constraint on the field fieldName isn't a good option. We have tried this but then instead of getting duplicate values, we see that the auto increment skips ID's. And we are trying to preserve ID's so that we do not need to over allocate when it comes to data types. Our main table is huge (multi billion) so we have to make efficient use of data types.
Anybody out there that understands this phenomenon?
There are a lot of hurdles you have to clear if you want to build your own replacement for auto_increment. You'll find problems of serializability, concurrency, performance (usually related to locking), etc.
I think the simplest solution might be to use auto_increment on a column of type bigint unsigned. The maximum value of an unsigned integer is 4,294,967,295: roughly 4x10^9. The maximum value of an unsigned bigint is 18,446,744,073,709,551,615: roughly 1.8x10^19.
The auto_increment will still skip id numbers, but that's by design, and it shouldn't cause trouble with a range of 1.8x10^19.
Before you commit to this path, test big numbers with your client software. Some still don't deal gracefully with bigint, signed or not.

Update/Assign a foreign key to a pre-existing table row and not have it overwritten [mysql]

I have a table called promotion_codes
CREATE TABLE promotion_codes (
id int(10) UNSIGNED NOT NULL auto_increment,
created_at datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
code varchar(255) NOT NULL,
order_id int(10) UNSIGNED NULL DEFAULT NULL,
allocated_at datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
This table is pre-populated with available codes that will be assigned to orders that meet a specific criteria.
What I need to ensure is that after the ORDER is created, that I obtain an available promotion code and update its record to reflect that it has been allocated.
I am not 100% sure how to not grab the same record twice if simultaneous requests come in.
I have tried locking the row during a select and locking the row during a update - both still seem to allow a second (simultaneous) attempt to grab the same record - which is what I want to avoid
UPDATE promotion_code
SET allocated_at = "' . $db_now . '", order_id = ' . $donation->id . '
WHERE order_id IS NULL LIMIT 1
You can add a second table which holds all used codes. So you can use an unique constraint in the assignment table to make sure that one code is not assigned twice.
CREATE TABLE `used_codes` (`usage` INTEGER PRIMARY KEY auto_increment,
`id` INTEGER NOT NULL UNIQ, -- This makes sure, that there are no two assignments of one code
allocated_at datetime NOT NULL);
You add the ID of an used code into the used_codes table, and query which code you used afterwards. When this two operations are in one transaction, the entire transaction will fail when there is a second try to use the same code.
I did not test the following code, you might to adjust it.
Also you need to make sure that you have your server meets the requirements for transactions.
-- There are changes which have to be atomic, so don't use autocommit
SET autocommit = 0;
BEGIN TRANSACTION
INSERT INTO `used_codes` (`id`, `allocated_at`) VALUES
(SELECT `id` FROM `promotion_codes`
WHERE NOT `id` in (SELECT `id` FROM `used_codes`)
LIMIT 1), now());
SELECT `code` FROM `promotion_codes` WHERE `id` =
-- You might need to adjust the extraction of insertion ID, since
-- I don't know if parallel running transactions can see the maximum
-- their maximum IDs. But there should be a way to extract the last assigned
-- ID within this transaction.
(SELECT `id` FROM `used_codes` HAVING `usage` = max(`usage`));
COMMIT
You can use the returned code if the transaction sucseeded. If there where more than one processes running to use the same code, only one of them succed, while the rest fails with insert errors about the duplicated row. In your software you need to distinguish between the duplicated row error and other errors, and reexecute the statement on duplication errors.

Insert if not exists

How to have only 3 rows in the table and only update them?
I have the settings table and at first run there is nothing so I want to insert 3 records like so:
id | label | Value | desc
--------------------------
1 start 10 0
2 middle 24 0
3 end 76 0
After this from PHP script I need to update this settings from one query.
I have researched REPLACE INTO but I end up with duplicate rows in DB.
Here is my current query:
$query_insert=" REPLACE INTO setari (`eticheta`, `valoare`, `disabled`)
VALUES ('mentenanta', '".$mentenanta."', '0'),
('nr_incercari_login', '".$nr_incercari_login."', '0'),
('timp_restrictie_login', '".$timp_restrictie_login."', '0')
";
Any ideas?
Here is the create table statement. Just so you can see in case I'm missing something.
CREATE TABLE `setari` (
`id` int(10) unsigned NOT NULL auto_increment,
`eticheta` varchar(200) NOT NULL,
`valoare` varchar(250) NOT NULL,
`disabled` tinyint(1) unsigned NOT NULL default '0',
`data` datetime default NULL,
`cod` varchar(50) default NULL,
PRIMARY KEY (`eticheta`,`id`,`valoare`),
UNIQUE KEY `id` (`eticheta`,`id`,`valoare`)
) ENGINE=MyISAM
As explained in the manual, need to create a UNIQUE index on (label,value) or (label,value,desc) for REPLACE INTO determine uniqueness.
What you want is to use 'ON DUPLICATE KEY UPDATE' syntax. Read through it for the full details but, essentially you need to have a unique or primary key for one of your fields, then start a normal insert query and add that code (along with what you want to actually update) to the end. The db engine will then try to add the information and when it comes across a duplicate key already inserted, it already knows to just update all the fields you tell it to with the new information.
I simply skip the headache and use a temporary table. Quick and clean.
SQL Server allows you to select into a non-existing temp table by creating it for you. However mysql requires you to first create the temp db and then insert into it.
1.
Create empty temp table.
CREATE TEMPORARY TABLE IF NOT EXISTS insertsetari
SELECT eticheta, valoare, disabled
FROM setari
WHERE 1=0
2.
Insert data into temp table.
INSERT INTO insertsetari
VALUES
('mentenanta', '".$mentenanta."', '0'),
('nr_incercari_login', '".$nr_incercari_login."', '0'),
('timp_restrictie_login', '".$timp_restrictie_login."', '0')
3.
Remove rows in temp table that are already found in target table.
DELETE a FROM insertsetari AS a INNER JOIN setari AS b
WHERE a.eticheta = b.eticheta
AND a.valoare = b.valoare
AND a.disabled = b.disabled
4.
Insert temp table residual rows into target table.
INSERT INTO setari
SELECT * FROM insertsetari
5.
Cleanup temp table.
DELETE insertsetari
Comments:
You should avoid replacing when the
new data and the old data is the
same. Replacing should only be for
situations where there is high
probability for detecting key values
that are the same but the non-key
values are different.
Placing data into a temp table allows
data to be massaged, transformed and modified
easily before inserting into target
table.
Deleting rows from temp table is
faster.
If anything goes wrong, temp table
gives you an additional debugging
stage to find out what went wrong.
Should consider doing it all in a single transaction.