remove duplicate rows based on one column value - mysql

I have the below table and now I need to delete the rows which are having duplicate "refIDs" but have atleast one row with that ref, i.e i need to remove row 4 and 5. please help me on this
+----+-------+--------+--+
| ID | refID | data | |
+----+-------+--------+--+
| 1 | 1023 | aaaaaa | |
| 2 | 1024 | bbbbbb | |
| 3 | 1025 | cccccc | |
| 4 | 1023 | ffffff | |
| 5 | 1023 | gggggg | |
| 6 | 1022 | rrrrrr | |
+----+-------+--------+--+

This is similar to Gordon Linoff's query, but without the subquery:
DELETE t1 FROM table t1
JOIN table t2
ON t2.refID = t1.refID
AND t2.ID < t1.ID
This uses an inner join to only delete rows where there is another row with the same refID but lower ID.
The benefit of avoiding a subquery is being able to utilize an index for the search. This query should perform well with a multi-column index on refID + ID.

I would do:
delete from t where
ID not in (select min(ID) from table t group by refID having count(*) > 1)
and refID in (select refID from table t group by refID having count(*) > 1)
criteria is refId is among the duplicates and ID is different from the min(id) from the duplicates. It would work better if refId is indexed
otherwise and provided you can issue multiple times the following query until it does not delete anything
delete from t
where
ID in (select max(ID) from table t group by refID having count(*) > 1)

Some another variant, in some cases a bit faster than Marcus and NJ73 answers:
DELETE ourTable
FROM ourTable JOIN
(SELECT ID,targetField
FROM ourTable
GROUP BY targetField HAVING COUNT(*) > 1) t2
ON ourTable.targetField = t2.targetField AND ourTable.ID != t2.ID;
Hope that will help someone. On big tables Marcus answer stalls.

In MySQL, you can do this with a join in delete:
delete t
from table t left join
(select min(id) as id
from table t
group by refId
) tokeep
on t.id = tokeep.id
where tokeep.id is null;
For each RefId, the subquery calculates the minimum of the id column (presumed to be unique over the whole table). It uses a left join for the match, so anything that doesn't match has a NULL value for tokeep.id. These are the ones that are deleted.

Related

Select only latest record for every employees and for specific employee in MySQL

I have a MySQL DB and in it there's a table with activity logs of employees.
+-------------------------------------------------+
| log_id | employee_id | date_time | action_type |
+-------------------------------------------------+
| 1 | 1 | 2015/02/03 | action1 |
| 2 | 2 | 2015/02/01 | action1 |
| 3 | 2 | 2017/01/02 | action2 |
| 4 | 3 | 2016/02/12 | action1 |
| 5 | 1 | 2016/10/12 | action2 |
+-------------------------------------------------+
And I would need 2 queries. First, to get for every employee his last action. So from this example table I would need to get row 3,4 and 5 with all columns. And second, get the latest action only for specified employee.
Any ideas how to achieve this? I'm using Spring Data JPA, but raw SQL Query would be also great.
Thank you in advance.
Ready for a fred ed...
SELECT x.*
FROM my_table x
JOIN
( SELECT employee_id
, MAX(date_time) date_time
FROM my_table
GROUP
BY employee_id
) y
ON y.employee_id = x.employee_id
AND y.date_time = x.date_time;
For your first query. Simply
SELECT t1.*
FROM tableName t1
WHERE t1.log_id = (SELECT MAX(t2.log_id)
FROM tableName t2
WHERE t2.employee_id = t1.employee_id)
For the second one
SELECT t1.*
FROM tableName t1
WHERE t1.employee_id=X and t1.log_id = (SELECT MAX(t2.log_id)
FROM tableName t2
WHERE t2.employee_id = t1.employee_id);
You can get the expected output by doing a self join
select a.*
from demo a
left join demo b on a.employee_id = b.employee_id
and a.date_time < b.date_time
where b.employee_id is null
Note it may return multiple rows for single employee if there are rows with same date_time you might need a CASE statement and another attribute to decide which row should be picked to handle this kind of situation
Demo

Mysql select with IN, limit rows to 1 for each match

lets say I have this table:
| id | record_id | date_updated |
|----|-----------|--------------|
| 1 | 1 | 19-03-2015 |
| 2 | 1 | 18-03-2015 |
| 3 | 1 | 17-03-2014 |
| 4 | 2 | 01-01-2015 |
| 5 | 2 | 05-02-2015 |
so the results I am looking for are :
| id | record_id | date_updated |
|----|-----------|--------------|
| 1 | 1 | 19-03-2015 |
| 4 | 2 | 01-01-2015 |
I have array with record ids.
$records = [1,2];
So I can do something like:
select * from `mytable`
WHERE `record_id` IN ($records)
AND mytable.date_update > 01-01-2014
AND mytable.date_updated < 12-12-2015
so mysql will select records wich match date_updated criteria ( and record id ofc ), which are more then 1 for each record ID, basically I want to make him limit the rows for each $record_id to 1
If it is even possible.
//it is super hard to explain the problem, the real case is that this is a sub query of another query, but the real example is 10 rows query and 100 columns table, so it will be even more hard to explain the situation and for someone to read it / udnerstands it. Hopefully someone will understand my problem, if not I will try to explain more.
Thanks
You can try using the group by clause
SELECT *
FROM `mytable`
WHERE id IN (
SELECT min(id)
FROM `mytable`
WHERE `record_id` IN ($records)
AND mytable.date_update > 01-01-2014
AND mytable.date_updated < 12-12-2015
group by record_id
);
There are many ways to get the record per group, and since you need only once you can easily do as below
select t1.* from table_name t1
where (
select count(*) from table_name t2
where t1.record_id = t2.record_id
) > =0
and
t1.date_updated > '2014-01-01' and date_updated < '2015-12-12'
group by t1.record_id ;
There are other way too using left join
select t1.* from table_name t1
left join table_name t2 on t1.record_id = t2.record_id
and t1.id >t2.id where t2.id is null
This will give you data with asc order with id
If you need data with max(id) for a record_id you can use
t1.id < t2.id
instead of
t1.id >t2.id
The same comparison you can do with first query.

SELECTING SUM based on conditional if from another table

Database: mysql > ver 5.0
table 1: type_id (int), type
table 2: name_id, name, is_same_as = table2.name_id or NULL
table 3: id, table2.name_id, table1.type_id, value (float)
I want to sum values, and count values in table 3 where table2.name_id are same and also include the values of id where is_same_is=name_id. I want to select all data in table3 for all values in table2.
Apologize if my question is not very clear, and if it has already been answered but I am unable to find a relevant answer. Or dont exactly know what to look for.
[data]. table1
id | type
=========
1 | test1
2 | test2
[data].table2
name_id | name | is_same_as
==============================
1 | tb_1 | NULL
2 | tb_2 | 1
3 | tb_3 | NULL
4 | tb_4 | 1
[data].table3
id | name_id | type_id | value
======================================
1 | 1 | 1 | 1.5
2 | 2 | 1 | 0.5
3 | 2 | 2 | 1.0
output:
name_id| type_id|SUM(value)
=======================================================
1 | 1 |2.0 < because in table2, is_same_as = 1
2 | 2 |1.0
I think the following does what you want:
select coalesce(t2.is_same_as, t2.name_id) as name_id, t3.type_id, sum(value)
from table_3 t3 join
table_2 t2
on t3.name_id = t2.name_id
group by coalesce(t2.is_same_as, t2.name_id), t3.type_id
order by 1, 2
It joins the table on name_id. However, it then uses the is_same_as column, if present, or the name_id if not, for summarizing the data.
This might be what you are looking for: (I haven't tested it in MySQL, so there may be a typo)
with combined_names_tab (name_id, name_id_ref) as
(
select name_id, name_id from table2
union select t2a.name_id, t2b.name_id
from table2 t2a
join table2 t2b
on (t2a.name_id = t2b.is_same_as)
)
select cnt.name_id, t3.type_id, sum(t3.value) sum_val
from combined_names_tab cnt
join table3 t3
on ( cnt.name_id_ref = t3.name_id )
group by cnt.name_id, t3.type_id
having sum(t3.value) / count(t3.value) >= 3
Here's what the query does:
First, it creates 'combined_names_tab' which is a join of all the table2 rows that you want to GROUP BY using the "is_same_as" column to make that determination. I make sure to include the "parent" row by doing a UNION.
Second, once you have those rows above, it's a simply join to table3 with a GROUP BY and a SUM.
Note: table1 was unnecessary (I believe).
Let me know if this works!
john...

How to `SELECT` and manufacture missing rows from previous values?

I have the following (simplified) result from SELECT * FROM table ORDER BY tick,refid:
tick refid value
----------------
1 1 11
1 2 22
1 3 33
2 1 1111
2 3 3333
3 3 333333
Note the "missing" rows for refid 1 (tick 3) and refid 2 (ticks 2 and 3)
If possible, how can I make a query to add these missing rows using the most recent prior value for that refid? "Most recent" means the value for the row with the same refid as the missing row and largest tick such that the tick is less than the tick for the missing row. e.g.
tick refid value
----------------
1 1 11
1 2 22
1 3 33
2 1 1111
2 2 22
2 3 3333
3 1 1111
3 2 22
3 3 333333
Additional conditions:
All refids will have values at tick=1.
There may be many 'missing' ticks for a refid in sequence, (as above for refid 2).
There are many refids and it's not known which will have sparse data where.
There will be many ticks beyond 3, but all sequential. In the correct result, each refid will have a result for each tick.
Missing rows are not known in advance - this will be run on multiple databases, all with the same structure, and different "missing" rows.
I'm using MySQL and cannot change db just now. Feel free to post answer in another dialect, to help discussion, but I'll select an answer in MySQL dialect over others.
Yes, I know this can be done in the code, which I've implemented. I'm just curious if it can be done with SQL.
What value should be returned when a given tick-refid combination does not exist? In this solution, I simply returned the lowest value for that given refid.
Revision
I've updated the logic to determine what value to use in the case of a null. It should be noted that I'm assuming that ticks+refid is unique in the table.
Select Ticks.tick
, Refs.refid
, Case
When Table.value Is Null
Then (
Select T2.value
From Table As T2
Where T2.refid = Refs.refId
And T2.tick = (
Select Max(T1.tick)
From Table As T1
Where T1.tick < Ticks.tick
And T1.refid = T2.refid
)
)
Else Table.value
End As value
From (
Select Distinct refid
From Table
) As Refs
Cross Join (
Select Distinct tick
From Table
) As Ticks
Left Join Table
On Table.tick = Ticks.tick
And Table.refid = Refs.refid
If you know in advance what your 'tick' and 'refid' values are,
Make a helper table that contains all possible tick and refid values.
Then left join from the helper table on tick and refid to your data table.
If you don't know exactly what your 'tick' and 'refid' values are, you maybe could still use this method, but instead of a static helper table, it would have to be dynamically generated.
The following has too many sub-selects for my taste, but it generates the desired result in MySQL, as long as every tick and every refid occurs separately at least once in the table.
Start with a query that generates every pair of tick and refid. The following uses the table to generate the pairs, so if any tick never appears in the underlying table, it will also be missing from the generated pairs. The same holds true for refids, though the restriction that "All refids will have values at tick=1" should ensure the latter never happens.
SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
Using this, generate every missing tick, refid pair, along with the largest tick that exists in the table by equijoining on refid and θ≥-joining on tick. Group by the generated tick, refid since only one row for each pair is desired. The key to filtering out existing tick, refid pairs is the HAVING clause. Strictly speaking, you can leave out the HAVING; the resulting query will return existing rows with their existing values.
SELECT tr.tick, tr.refid, MAX(c.tick) AS ctick
FROM
(SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
) AS tr
JOIN chadwick AS c ON tr.tick >= c.tick AND tr.refid=c.refid
GROUP BY tr.tick, tr.refid
HAVING tr.tick > MAX(c.tick)
One final select from the above as a sub-select, joined to the original table to get the value for the given ctick, returns the new rows for the table.
INSERT INTO chadwick
SELECT missing.tick, missing.refid, c.value
FROM (SELECT tr.tick, tr.refid, MAX(c.tick) AS ctick
FROM
(SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
) AS tr
JOIN chadwick AS c ON tr.tick >= c.tick AND tr.refid=c.refid
GROUP BY tr.tick, tr.refid
) AS missing
JOIN chadwick AS c ON missing.ctick = c.tick AND missing.refid=c.refid
;
Performance on the sample table, along with (tick, refid) and (refid, tick) indices:
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 1 | PRIMARY | c | ALL | tick_ref,ref_tick | NULL | NULL | NULL | 6 | Using where; Using join buffer |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 9 | Using temporary; Using filesort |
| 2 | DERIVED | c | ref | tick_ref,ref_tick | ref_tick | 5 | tr.refid | 1 | Using where; Using index |
| 3 | DERIVED | <derived4> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 3 | DERIVED | <derived5> | ALL | NULL | NULL | NULL | NULL | 3 | Using join buffer |
| 5 | DERIVED | chadwick | index | NULL | tick_ref | 10 | NULL | 6 | Using index |
| 4 | DERIVED | chadwick | ref | tick_ref | tick_ref | 5 | | 2 | Using where; Using index |
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
As I said, too many sub-selects. A temporary table may help matters.
To check for missing ticks:
SELECT clo.tick+1 AS missing_tick
FROM chadwick AS chi
RIGHT JOIN chadwick AS clo ON chi.tick = clo.tick+1
WHERE chi.tick IS NULL;
This will return at least one row with tick equal to 1 + the largest tick in the table. Thus, the largest value in this result can be ignored.
In order to have the list of pairs (tick, refid) to insert get a whole list:
SELECT a.tick, b.refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
Now substract from that query the existing ones:
SELECT a.tick tick, b.refid refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
MINUS
SELECT DISTINCT tick, refid FROM t
Now you can join with t to obtain the final query (note that I use inner join + left join to obtain previous result but you could adapt):
INSERT INTO t(tick, refid, value)
SELECT c.tick, c.refid, t1.value
FROM ( SELECT a.tick tick, b.refid refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
MINUS
SELECT DISTINCT tick, refid FROM t
) c
INNER JOIN t t1 ON t1.refid = c.refid and t1.tick < c.tick
LEFT JOIN t t2 ON t2.refid = c.refid AND t1.tick < t2.tick AND t2.tick < c.tick
WHERE t2.tick IS NULL

Select the fields are duplicated in mysql

Assuming that I have the below customer_offer table.
My question is:
How to select all the rows where the key(s) are duplicated in that table?
+---------+-------------+------------+----------+--------+---------------------+
| link_id | customer_id | partner_id | offer_id | key | date_updated |
+---------+-------------+------------+----------+--------+---------------------+
| 1 | 99 | 11 | 14 | mmmmmq | 2011-09-21 12:40:46 |
| 2 | 100 | 11 | 14 | qmmmmq | 2011-09-21 12:40:46 |
| 3 | 101 | 11 | 14 | 8mmmmq | 2011-09-21 12:40:46 |
| 4 | 99 | 11 | 14 | Dmmmmq | 2011-09-21 12:59:28 |
| 5 | 100 | 11 | 14 | Nmmmmq | 2011-09-21 12:59:28 |
+---------+-------------+------------+----------+--------+---------------------+
UPDATE:
Thanks so much for all your answer. There are many answers are good. Now I got the solution to do.
select *
from customer_offer
where key in
(select key from customer_offer group by key having count(*) > 1)
Update:
As mentioned from #Scorpi0, if with a big table, it is better to use join. And from mysql6.0 the new optimizer will convert this kind of subqueries into joins.
Self join
SELECT * FROM customer_offer c1 inner join customer_offer c2
on c1.key = c2.key
or group by the field then take when count > 1
SELECT COUNT(key),link_id FROM customer_offer c1
group by key, link_id
having COUNT(Key) > 1
SELECT DISTINCT c1.*
FROM customer_offer c1
INNER JOIN customer_offer c2
ON c1.key = c2.key
AND c1.link_id != c2.link_id
Assuming link_id is a primary key.
Use a sub-query to do the count check, and the main query to select the rows. The count check query is simply:
SELECT `link_id` FROM `customer_offer` GROUP BY `key` HAVING COUNT(`key`) > 1
Then the outer query will use this by joining into it:
SELECT customer_offer.* FROM customer_offer
INNER JOIN (SELECT `link_id` FROM `customer_offer` GROUP BY `key` HAVING COUNT(`key`) > 1) AS count_check
ON customer_offer.link_id = count_check.link_id
There are many threads on the mysql website which explains how to do this. This link will explain how to do this using mysql: http://forums.mysql.com/read.php?10,180556,180567#msg-180567
As a brief example the code below is from the link with a slight modification which better suits your example.
SELECT *
FROM tbl
GROUP BY key
HAVING COUNT(key)>1;
You can also use a joing which is my prefered method, as this removes the slower count method:
SELECT *
FROM this_table t
inner join this_table t1 on t.key = t1.key
SELECT link_id, key, count(key) as Occurrences
FROM table
GROUP BY key
HAVING COUNT(key)>1;