Remove continuous duplicated values with different IDs in MySQL - mysql

I know there is a ton of same questions about finding and removing duplicate values in mySQL but my question is a bit different:
I have a table with columns as ID, Timestamp and price. A script scrapes data from another webpage and saves it in the database every 10 seconds. Sometimes data ends up like this:
| id | timestamp | price |
|----|-----------|-------|
| 1 | 12:13 | 100 |
| 2 | 12:14 | 120 |
| 3 | 12:15 | 100 |
| 4 | 12:16 | 100 |
| 5 | 12:17 | 110 |
As you see there are 3 duplicated values and removing the price with ID = 4 will shrink the table without damaging data integrity. I need to remove continuous duplicated records except the first one (which has the lowest ID or Timestamp).
Is there a sufficient way to do it? (there is about a million records)
I edited my scraping script so it checks for duplicated price before adding it but I need to shrink and maintain my old data.

Since MySQL 8.0 you can use window function LAG() in next way:
delete tbl.* from tbl
join (
-- use lag(price) for get value from previous row
select id, lag(price) over (order by id) price from tbl
) l
-- join rows with same previous price witch will be deleted
on tbl.id = l.id and tbl.price = l.price;
fiddle

I am just grouping based on price and filtering only one record per group.The lowest id gets displayed.Hope the below helps.
select id,timestamp,price from yourTable group by price having count(price)>0;

My query is based on #Tim Biegeleisen one.
-- delete records
DELETE
FROM yourTable t1
-- where exists an older one with the same price
WHERE EXISTS (SELECT 1
FROM yourTable t2
WHERE t2.price = t1.price
AND t2.id < t1.id
-- but does not exists any between this and the older one
AND NOT EXISTS (SELECT 1
FROM yourTable t3
WHERE t1.price <> t3.price
AND t3.id > t2.id
AND t3 < t1.id));
It deletes records where exists an older one with same price but does not exists any different between
It could be checked by timestamp column if id column is not numeric and ascending.

Related

Time difference between adjacent rows in one column of one mysql table

I have a table with some 100.000 rows having this structure:
+------+---------------------+-----------+
| id | timestamp | eventType |
+------+---------------------+-----------+
| 12 | 2015-07-01 16:45:47 | 3001 |
| 103 | 2015-07-10 19:30:14 | 3001 |
| 1174 | 2015-09-03 12:57:08 | 3001 |
+------+---------------------+-----------+
For each row, I would like to calculate the days between the timestamp of this and the previous row.
As you can see, the id is not continuous, this the table contains different events and I would like to compare only the timestamp of one specific event over time.
I know, that for the comparison of tow datas, DATEDIFF can be used, and I would define the two rows with a query, that selects the row by the specific id.
But as I have many 1000 rows, I am searching for a way to somehow loop through the whole table.
Unfortunately my sql knowledge is limited and searching did not reveal an example, close enough to my question, that I would continue form there.
I would be very thankful for any hint.
If you are running MySQL 8.0, you can just use lag(). Say you want the difference in seconds:
select t.*,
timestampdiff(
second,
lag(timestamp) over(partition by eventtype order by id),
timestamp
) diff
from mytable t
In earlier versions, one alternative is a correlated subquery:
select t.*,
timestampdiff(
second,
(select timestamp from mytable t1 where t1.eventtype = t.eventtype and t1.id < t.id order by t1.id desc limit 1),
timestamp
) diff
from mytable t

How to Delete Duplicate Rows Based on 3 Column Values and Length in MySQL

I wanted some help in regards to understanding how I can delete duplicate records from my database table. I have a table of 1 million records which has been collected over a 2 year period hence there is a number of records that need to be deleted as they have been added numerous times into the database.
The following is a query that I wrote based on the three columns that I am matching for duplicates, taking a count and I have also added a length of one of the columns as this will determine whether I delete all the records or just the duplicates.
SELECT
Ref_No,
End_Date,
Filename,
count(*) as cnt,
length(Ref_No)
FROM
master_table
GROUP BY
Ref_No,
End_Date,
Filename,
length(Ref_No)
HAVING
COUNT(*) > 1
;
This then gives me an output like the following:
Ref_No | End_Date | Filename | cnt | length(Ref_No)
05011384 | 2018-07-01 | File1 | 2 | 8
1234 | 2018-12-31 File2 | 11 | 4
1000002975625 | 2018-12-31 | File3 | 13
123456789123456789 | 2019-02-06 | File3 | 18
Now I have a list of rules to follow based on the length column and this will determine whether I leave the records as they are with the duplicates, delete the duplicates or delete all the records and this is where I am stuck.
My rules are the following:
If length is between 0 and 4 - Keep all records with duplicates
If length is between 5 and 10 - Delete Duplicates, keep 1 record
If length equals 13 - Delete Duplicates, keep 1 record
If length is 11, 12, 14-30 - Delete all records
I would really appreciate if some could advice on how I go about completing this task.
Thanks.
I have managed to create a temporary table in which I add a unique id. The only thing is that I am running the query twice with the length part changed for my requirements.
INSERT INTO UniqueIDs
(
SELECT
T1.ID
FROM
master_table T1
LEFT JOIN
master_table T2
ON
(
T1.Ref_No = T2.Ref_No
AND
T1.End_Date = T2.End_Date
AND
T1.Filename = T2.Filename
AND
T1.ID > T2.ID
)
WHERE T2.ID IS NULL
AND
LENGTH(T1.Ref_No) BETWEEN 5 AND 10
)
;
I then just run the following delete to keep the unique ids in the table and remove the rest.
DELETE FROM master_table WHERE id NOT IN (SELECT ID FROM UniqueIDs);
That's it.

Remove duplicates from a table based on a non-duplicated field

I need to remove some duplicate data based on a non-duplicated field.
Sample Data:
|status|Size|id# |income|scan_date |
| 0 | 3 |123456| 1000 |2015-10-16|
| 1 | 3 |123456| 1000 |2015-10-16|
| 1 | 4 |112345| 900 |2015-09-05|
| 0 | 7 |122345| 700 |2015-10-01|
When the id# and scan_date are the same, I need to remove only the rows where the status is "0".
delete from table
where status=0
and (id, scan_date) in
(select id, scan_date from
(select id, scan_date from table
group by id, scan_date
having count(*) >=2) t1)
You need the extra subquery in mysql, since mysql does not allow to select from the table being updated or deleted from.
Update: see this sqlfiddle I created. I think you removed the subquery with the t1 alias, even though I explicitly warned you that it is important!

swap "two" column values in two different rows using mysql

I have two rows each of which contain a week a day and an event. An auto increment primary key is used to distinguish the rows.
Here is an example:
ID Week Day event
-------------------------------
1 | 1 | 2 | house keeping
2 | 2 | 3 | house viewing
What i want to do is swap the week and day of the two rows specified so that it looks like this:
ID Week Day event
-------------------------------
1 | 2 | 3 | house keeping
2 | 1 | 2 | house viewing
But the Id must remain the same
Ive been reading through other peoples posts and found this solution which uses temporary variables to swap only one columns values from each row.
UPDATE my_table SET a=#tmp:=a, a=b, b=#tmp;
Could anyone help me swap two columns instead of just the one?
thanks
I assume you have just 2 rows in your table.
If not, you need to modify slightly the JOIN conditions.
Here is one possible approach.
CREATE TEMPORARY TABLE T(ID int, Week int, Day int)
INSERT INTO T(ID, Week, Day)
SELECT ID, Week, Day from TableName;
UPDATE TableName t1
JOIN T t2 on t1.ID <> t2.ID
SET
t1.Week = t2.Week,
t1.Day = t2.Day;
DROP TEMPORARY TABLE T;
And here is a better one.
UPDATE
tablename AS t1
JOIN tablename AS t2 ON
( t1.id <> t2.id )
SET
t1.week = t2.week,
t2.week = t1.week,
t1.day = t2.day,
t2.day = t1.day;

MySQL How can I add values of a column together and remove the duplicate rows?

Good day,
I have a MySQL table which has some duplicate rows that have to be removed while adding a value from one column in the duplicated rows to the original.
The problem was caused when another column had the wrong values and that is now fixed but it left the balances split among different rows which have to be added together. The newer rows that were added must then be removed.
In this example, the userid column determines if they are duplicates (or triplicates). userid 6 is duplicated and userid 3 is triplicated.
As an example for userid 3 it has to add up all balances from rows 3, 11 and 13 and has to put that total into row 3 and then remove rows 11 and 13. The balance columns of both of those have to be added together into the original, lower ID row and the newer, higher ID rows must be removed.
ID | balance | userid
---------------------
1 | 10 | 1
2 | 15 | 2
3 | 300 | 3
4 | 80 | 4
5 | 0 | 5
6 | 65 | 6
7 | 178 | 7
8 | 201 | 8
9 | 92 | 9
10 | 0 | 10
11 | 140 | 3
12 | 46 | 6
13 | 30 | 3
I hope that is clear enough and that I have provided enough info. Thanks =)
Two steps.
1. Update:
UPDATE
tableX AS t
JOIN
( SELECT userid
, MIN(id) AS min_id
, SUM(balance) AS sum_balance
FROM tableX
GROUP BY userid
) AS c
ON t.userid = c.userid
SET
t.balance = CASE WHEN t.id = c.min_id
THEN c.sum_balance
ELSE 0
END ;
2. Remove the extra rows:
DELETE t
FROM
tableX AS t
JOIN
( SELECT userid
, MIN(id) AS min_id
FROM tableX
GROUP BY userid
) AS c
ON t.userid = c.userid
AND t.id > c.min_id
WHERE
t.balance = 0 ;
Once you have this solved, it would be good to add a UNIQUE constraint on userid as it seems you want to be storing the balance for each user here. That will avoid any duplicates in the future. You could also remove the (useless?) id column.
SELECT SUM(balance)
FROM your_table
GROUP BY userid
Should work, but the comment saying fix the table is really the best approach.
You can create a table with the same structure and transfer the data to it with this query
insert into newPriceTable(id, userid, balance)
select u.id, p.userid, sum(balance) as summation
from price p
join (
select userid, min(id) as id from price group by userid
) u ON p.userid = u.userid
group by p.userid
Play around the query here: http://sqlfiddle.com/#!2/4bb58/2
Work is mainly done in MSSQL but you should be able to convert the syntax.
Using a GROUP BY UserID you can SUM() the Balance, join that back to your main table to update the balance across all the duplicates. Finally you can use RANK() to order your duplicate Userids and preserve only the earliest values.
I'd select all this into a new table and if it looks good, deprecate your old table and rename then new one.
http://sqlfiddle.com/#!3/068ee/2