I need to remove some duplicate data based on a non-duplicated field.
Sample Data:
|status|Size|id# |income|scan_date |
| 0 | 3 |123456| 1000 |2015-10-16|
| 1 | 3 |123456| 1000 |2015-10-16|
| 1 | 4 |112345| 900 |2015-09-05|
| 0 | 7 |122345| 700 |2015-10-01|
When the id# and scan_date are the same, I need to remove only the rows where the status is "0".
delete from table
where status=0
and (id, scan_date) in
(select id, scan_date from
(select id, scan_date from table
group by id, scan_date
having count(*) >=2) t1)
You need the extra subquery in mysql, since mysql does not allow to select from the table being updated or deleted from.
Update: see this sqlfiddle I created. I think you removed the subquery with the t1 alias, even though I explicitly warned you that it is important!
Related
I know there is a ton of same questions about finding and removing duplicate values in mySQL but my question is a bit different:
I have a table with columns as ID, Timestamp and price. A script scrapes data from another webpage and saves it in the database every 10 seconds. Sometimes data ends up like this:
| id | timestamp | price |
|----|-----------|-------|
| 1 | 12:13 | 100 |
| 2 | 12:14 | 120 |
| 3 | 12:15 | 100 |
| 4 | 12:16 | 100 |
| 5 | 12:17 | 110 |
As you see there are 3 duplicated values and removing the price with ID = 4 will shrink the table without damaging data integrity. I need to remove continuous duplicated records except the first one (which has the lowest ID or Timestamp).
Is there a sufficient way to do it? (there is about a million records)
I edited my scraping script so it checks for duplicated price before adding it but I need to shrink and maintain my old data.
Since MySQL 8.0 you can use window function LAG() in next way:
delete tbl.* from tbl
join (
-- use lag(price) for get value from previous row
select id, lag(price) over (order by id) price from tbl
) l
-- join rows with same previous price witch will be deleted
on tbl.id = l.id and tbl.price = l.price;
fiddle
I am just grouping based on price and filtering only one record per group.The lowest id gets displayed.Hope the below helps.
select id,timestamp,price from yourTable group by price having count(price)>0;
My query is based on #Tim Biegeleisen one.
-- delete records
DELETE
FROM yourTable t1
-- where exists an older one with the same price
WHERE EXISTS (SELECT 1
FROM yourTable t2
WHERE t2.price = t1.price
AND t2.id < t1.id
-- but does not exists any between this and the older one
AND NOT EXISTS (SELECT 1
FROM yourTable t3
WHERE t1.price <> t3.price
AND t3.id > t2.id
AND t3 < t1.id));
It deletes records where exists an older one with same price but does not exists any different between
It could be checked by timestamp column if id column is not numeric and ascending.
I have a table with some 100.000 rows having this structure:
+------+---------------------+-----------+
| id | timestamp | eventType |
+------+---------------------+-----------+
| 12 | 2015-07-01 16:45:47 | 3001 |
| 103 | 2015-07-10 19:30:14 | 3001 |
| 1174 | 2015-09-03 12:57:08 | 3001 |
+------+---------------------+-----------+
For each row, I would like to calculate the days between the timestamp of this and the previous row.
As you can see, the id is not continuous, this the table contains different events and I would like to compare only the timestamp of one specific event over time.
I know, that for the comparison of tow datas, DATEDIFF can be used, and I would define the two rows with a query, that selects the row by the specific id.
But as I have many 1000 rows, I am searching for a way to somehow loop through the whole table.
Unfortunately my sql knowledge is limited and searching did not reveal an example, close enough to my question, that I would continue form there.
I would be very thankful for any hint.
If you are running MySQL 8.0, you can just use lag(). Say you want the difference in seconds:
select t.*,
timestampdiff(
second,
lag(timestamp) over(partition by eventtype order by id),
timestamp
) diff
from mytable t
In earlier versions, one alternative is a correlated subquery:
select t.*,
timestampdiff(
second,
(select timestamp from mytable t1 where t1.eventtype = t.eventtype and t1.id < t.id order by t1.id desc limit 1),
timestamp
) diff
from mytable t
I wanted some help in regards to understanding how I can delete duplicate records from my database table. I have a table of 1 million records which has been collected over a 2 year period hence there is a number of records that need to be deleted as they have been added numerous times into the database.
The following is a query that I wrote based on the three columns that I am matching for duplicates, taking a count and I have also added a length of one of the columns as this will determine whether I delete all the records or just the duplicates.
SELECT
Ref_No,
End_Date,
Filename,
count(*) as cnt,
length(Ref_No)
FROM
master_table
GROUP BY
Ref_No,
End_Date,
Filename,
length(Ref_No)
HAVING
COUNT(*) > 1
;
This then gives me an output like the following:
Ref_No | End_Date | Filename | cnt | length(Ref_No)
05011384 | 2018-07-01 | File1 | 2 | 8
1234 | 2018-12-31 File2 | 11 | 4
1000002975625 | 2018-12-31 | File3 | 13
123456789123456789 | 2019-02-06 | File3 | 18
Now I have a list of rules to follow based on the length column and this will determine whether I leave the records as they are with the duplicates, delete the duplicates or delete all the records and this is where I am stuck.
My rules are the following:
If length is between 0 and 4 - Keep all records with duplicates
If length is between 5 and 10 - Delete Duplicates, keep 1 record
If length equals 13 - Delete Duplicates, keep 1 record
If length is 11, 12, 14-30 - Delete all records
I would really appreciate if some could advice on how I go about completing this task.
Thanks.
I have managed to create a temporary table in which I add a unique id. The only thing is that I am running the query twice with the length part changed for my requirements.
INSERT INTO UniqueIDs
(
SELECT
T1.ID
FROM
master_table T1
LEFT JOIN
master_table T2
ON
(
T1.Ref_No = T2.Ref_No
AND
T1.End_Date = T2.End_Date
AND
T1.Filename = T2.Filename
AND
T1.ID > T2.ID
)
WHERE T2.ID IS NULL
AND
LENGTH(T1.Ref_No) BETWEEN 5 AND 10
)
;
I then just run the following delete to keep the unique ids in the table and remove the rest.
DELETE FROM master_table WHERE id NOT IN (SELECT ID FROM UniqueIDs);
That's it.
EDIT2: Solved Thanks all for fast reply, appreciate ur help. Specially to Mr Jeremy Smyth for the working solution.
I'm fairly new to sql and cant find a solution to make an update query. I have the following table
Table: order
id | cid | pid
1 | 1 | a1
2 | 1 | a2
3 | 2 | a2
4 | 2 | a3
5 | 2 | a4
I want the cid of 2 to become 1, BUT not updating rows which have same pid i.e(id.2 & id.3).
The result i want is:
id | cid | pid
1 | 1 | a1
2 | 1 | a2
3 | 2 | a2
4 | '1' | a3
5 | '1' | a4
pseudo query example: UPDATE order SET cid=1 WHERE cid=2 AND 1.pid <> 2.pid;
EDIT1: not to confuse pid values with cid and id i changed them with 'a' in start. as suggested i'll not use order as table name.
On update I simply dont want duplicate pid for cid
Sorry for bad English.
I hope I understood you right:
UPDATE `order`
SET cid = 1
WHERE cid = 2
AND cid <> pid
What do you think?
Please notice: ORDER is a reserved word, read more.
I think you need something like this.
UPDATE order SET cid=1 WHERE cid=2 AND cid <> pid;
This can only be done in multiple steps (i.e. not a single UPDATE statement) in MySQL, because of the following points
Point 1: To get a list of rows that do not have the same pid as other rows, you would need to do a query before your update. For example:
SELECT id FROM `order`
WHERE pid NOT IN (
SELECT pid FROM `order`
GROUP BY pid
HAVING COUNT(*) > 1
)
That'll give you the list of IDs that don't share a pid with other rows. However we have to deal with Point 2, from http://dev.mysql.com/doc/refman/5.6/en/subquery-restrictions.html:
In general, you cannot modify a table and select from the same table in a subquery.
That means you can't use such a subquery in your UPDATE statement. You're going to have to use a staging table to store the pids and UPDATE based on that set.
For example, the following code creates a temporary table called badpids that contains all pids that appear multiple times in the orders table. Then, we execute the UPDATE, but only for rows that don't have a pid in the list of badpids:
CREATE TEMPORARY TABLE badpids (pid int);
INSERT INTO badpids
SELECT pid FROM `order`
GROUP BY pid
HAVING COUNT(*) > 1;
UPDATE `order` SET cid = 1
WHERE cid= 2
AND pid NOT IN (SELECT pid FROM badpids);
Good day,
I have a MySQL table which has some duplicate rows that have to be removed while adding a value from one column in the duplicated rows to the original.
The problem was caused when another column had the wrong values and that is now fixed but it left the balances split among different rows which have to be added together. The newer rows that were added must then be removed.
In this example, the userid column determines if they are duplicates (or triplicates). userid 6 is duplicated and userid 3 is triplicated.
As an example for userid 3 it has to add up all balances from rows 3, 11 and 13 and has to put that total into row 3 and then remove rows 11 and 13. The balance columns of both of those have to be added together into the original, lower ID row and the newer, higher ID rows must be removed.
ID | balance | userid
---------------------
1 | 10 | 1
2 | 15 | 2
3 | 300 | 3
4 | 80 | 4
5 | 0 | 5
6 | 65 | 6
7 | 178 | 7
8 | 201 | 8
9 | 92 | 9
10 | 0 | 10
11 | 140 | 3
12 | 46 | 6
13 | 30 | 3
I hope that is clear enough and that I have provided enough info. Thanks =)
Two steps.
1. Update:
UPDATE
tableX AS t
JOIN
( SELECT userid
, MIN(id) AS min_id
, SUM(balance) AS sum_balance
FROM tableX
GROUP BY userid
) AS c
ON t.userid = c.userid
SET
t.balance = CASE WHEN t.id = c.min_id
THEN c.sum_balance
ELSE 0
END ;
2. Remove the extra rows:
DELETE t
FROM
tableX AS t
JOIN
( SELECT userid
, MIN(id) AS min_id
FROM tableX
GROUP BY userid
) AS c
ON t.userid = c.userid
AND t.id > c.min_id
WHERE
t.balance = 0 ;
Once you have this solved, it would be good to add a UNIQUE constraint on userid as it seems you want to be storing the balance for each user here. That will avoid any duplicates in the future. You could also remove the (useless?) id column.
SELECT SUM(balance)
FROM your_table
GROUP BY userid
Should work, but the comment saying fix the table is really the best approach.
You can create a table with the same structure and transfer the data to it with this query
insert into newPriceTable(id, userid, balance)
select u.id, p.userid, sum(balance) as summation
from price p
join (
select userid, min(id) as id from price group by userid
) u ON p.userid = u.userid
group by p.userid
Play around the query here: http://sqlfiddle.com/#!2/4bb58/2
Work is mainly done in MSSQL but you should be able to convert the syntax.
Using a GROUP BY UserID you can SUM() the Balance, join that back to your main table to update the balance across all the duplicates. Finally you can use RANK() to order your duplicate Userids and preserve only the earliest values.
I'd select all this into a new table and if it looks good, deprecate your old table and rename then new one.
http://sqlfiddle.com/#!3/068ee/2