How to delete duplicates in SQL table based on multiple fields

How to delete duplicates in SQL table based on multiple fields - mysql

I have a table of games, which is described as follows:
+---------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| date | date | NO | | NULL | |
| time | time | NO | | NULL | |
| hometeam_id | int(11) | NO | MUL | NULL | |
| awayteam_id | int(11) | NO | MUL | NULL | |
| locationcity | varchar(30) | NO | | NULL | |
| locationstate | varchar(20) | NO | | NULL | |
+---------------+-------------+------+-----+---------+----------------+
But each game has a duplicate entry in the table somewhere, because each game was in the schedules for two teams. Is there a sql statement I can use to look through and delete all the duplicates based on identical date, time, hometeam_id, awayteam_id, locationcity, and locationstate fields?

You should be able to do a correlated subquery to delete the data. Find all rows that are duplicates and delete all but the one with the smallest id. For MYSQL, an inner join (functional equivalent of EXISTS) needs to be used, like so:
delete games from games inner join
(select min(id) minid, date, time,
hometeam_id, awayteam_id, locationcity, locationstate
from games
group by date, time, hometeam_id,
awayteam_id, locationcity, locationstate
having count(1) > 1) as duplicates
on (duplicates.date = games.date
and duplicates.time = games.time
and duplicates.hometeam_id = games.hometeam_id
and duplicates.awayteam_id = games.awayteam_id
and duplicates.locationcity = games.locationcity
and duplicates.locationstate = games.locationstate
and duplicates.minid <> games.id)
To test, replace delete games from games with select * from games. Don't just run a delete on your DB :-)

You can try such query:
DELETE FROM table_name AS t1
WHERE EXISTS (
SELECT 1 FROM table_name AS t2
WHERE t2.date = t1.date
AND t2.time = t1.time
AND t2.hometeam_id = t1.hometeam_id
AND t2.awayteam_id = t1.awayteam_id
AND t2.locationcity = t1.locationcity
AND t2.id > t1.id )
This will leave in database only one example of each game instance which has the smallest id.

The best thing that worked for me was to recreate the table.
CREATE TABLE newtable SELECT * FROM oldtable GROUP BY field1,field2;
You can then rename.

To get list of duplicate entried matching two fields
select t.ID, t.field1, t.field2
from (
select field1, field2
from table_name
group by field1, field2
having count(*) > 1) x, table_name t
where x.field1 = t.field1 and x.field2 = t.field2
order by t.field1, t.field2
And to delete all the duplicate only
DELETE x
FROM table_name x
JOIN table_name y
ON y.field1= x.field1
AND y.field2 = x.field2
AND y.id < x.id;

select orig.id,
dupl.id
from games orig,
games dupl
where orig.date = dupl.date
and orig.time = dupl.time
and orig.hometeam_id = dupl.hometeam_id
and orig. awayteam_id = dupl.awayeam_id
and orig.locationcity = dupl.locationcity
and orig.locationstate = dupl.locationstate
and orig.id < dupl.id
this should give you the duplicates; you can use it as a subquery to specify IDs to delete.

AS long as you are not getting id (primary key) of the table in your select query and the other data is exact same you can use SELECT DISTINCT to avoid getting duplicate results.

delete from games
where id not in
(select max(id) from games
group by date, time, hometeam_id, awayteam_id, locationcity, locationstate
);
Workaround
select max(id) id from games
group by date, time, hometeam_id, awayteam_id, locationcity, locationstate
into table temp_table;
delete from games where id in (select id from temp);

DELETE FROM table
WHERE id =
(SELECT t.id
FROM table as t
JOIN (table as tj ON (t.date = tj.data
AND t.hometeam_id = tj.hometeam_id
AND t.awayteam_id = tj.awayteam_id
...))

DELETE FROM tbl
USING tbl, tbl t2
WHERE tbl.id > t2.id
AND t2.field = tbl.field;
in your case:
DELETE FROM games
USING games tbl, games t2
WHERE tbl.id > t2.id
AND t2.date = tbl.date
AND t2.time = tbl.time
AND t2.hometeam_id = tbl.hometeam_id
AND t2.awayteam_id = tbl.awayteam_id
AND t2.locationcity = tbl.locationcity
AND t2.locationstate = tbl.locationstate;
reference: https://dev.mysql.com/doc/refman/5.7/en/delete.html

Related

Problems using SQL ALL operator

I'm having trouble using/understanding the SQL ALL operator. I have a table FOLDER_PERMISSION with the following columns:
+----+-----------+---------+----------+
| ID | FOLDER_ID | USER_ID | CAN_READ |
+----+-----------+---------+----------+
| 1 | 34353 | 45453 | 0 |
| 2 | 46374 | 342532 | 1 |
| 3 | 46374 | 32352 | 1 |
+----+-----------+---------+----------+
I want to select the folders where all the users have permission to read, how could I do it?

Use aggregation and having:
select folder_id
from t
group by folder_id
having min(can_read) = 1;

Gordon's answer seems better but for the sake of completeness, using ALL a query could look like:
SELECT x1.folder_id
FROM (SELECT DISTINCT
fp1.folder_id
FROM folder_permission fp1) x1
WHERE 1 = ALL (SELECT fp2.can_read
FROM folder_permission fp2
WHERE fp2.folder_id = x1.folder_id);
If you have a table for the folders themselves replace the derived table (aliased x1) with it.
But this only respects users present in folder_permissions. If not all users have a reference in that table you possibly won't get the folders really all users can read.

You can do aggregation :
SELECT fp.FOLDER_ID
FROM folder_permission fp
GROUP BY fp.FOLDER_ID
HAVING SUM( can_read = 0 ) = 0;
You can also express it :
SELECT fp.FOLDER_ID
FROM folder_permission fp
GROUP BY fp.FOLDER_ID
HAVING MIN(CAN_READ) = MAX(CAN_READ) AND MIN(CAN_READ) = 1;

If you wanted to return the full matching records, you could try using some exists logic:
SELECT ID, FOLDER_ID, USER_ID, CAN_READ
FROM yourTable t1
WHERE NOT EXISTS (SELECT 1 FROM yourTable t2
WHERE t2.FOLDER_ID = t1.FOLDER_ID AND t2.CAN_READ = 0);
Demo
The existence of a matching record in the above exists subquery would imply that there exist one or more users for that folder who do not have read access rights.

Create trigger for several rows

I have table users AND orders. After every UPDATE row in orders. I want update DATA in users table namely concat(OLD.DATA + ID which was updated).
Table 'users'.
ID NAME DATA
1 John 1|2
2 Michael 3|4
3 Someone 5
Table 'orders'.
ID USER CONTENT
1 1 ---
2 1 ---
3 2 ---
4 2 ---
5 3 ---
For example:
SELECT `data` from `users` where `id` = 2; // Result: 3|4
UPDATE `orders` SET '...' WHERE `id` > 0;
**NEXT LOOP**
UPDATE `users` SET `data` = concat(OLD.data, ID.rowUpdated) WHERE `user` = 1;
UPDATE `users` SET `data` = concat(OLD.data, ID.rowUpdated) WHERE `user` = 1;
UPDATE `users` SET `data` = concat(OLD.data, ID.rowUpdated) WHERE `user` = 2;
UPDATE `users` SET `data` = concat(OLD.data, ID.rowUpdated) WHERE `user` = 2;
UPDATE `users` SET `data` = concat(OLD.data, ID.rowUpdated) WHERE `user` = 3;
Result:
SELECT data from users where id = 1; // Result: 1|2|1|2
SELECT data from users where id = 2; // Result: 3|4|3|4
SELECT data from users where id = 3; // Result: 5|5
How can I do it?

I think you are making the same mistake I made not too long ago, ie storing an array/object in a column.
I would recommend using the following tables in your scenario:
users
+-----------+-----------+
| id | user_name |
+-----------+-----------+
| 1 | John |
+-----------+-----------+
| 2 | Michael |
+-----------+-----------+
orders
+-----------+-----------+------------+
| id | user_id |date_ordered|
+-----------+-----------+------------+
| 1 | 1 | 2019-03-05 |
+-----------+-----------+------------+
| 2 | 2 | 2019-03-05 |
+-----------+-----------+------------+
Where user_id is the foreign key to users
sales
+-----------+-----------+------------+------------+------------+
| id | order_id | item_sku | qty | price |
+-----------+-----------+------------+------------+------------+
| 1 | 1 | 1001 | 1 | 2.50 |
+-----------+-----------+------------+------------+------------+
| 2 | 1 | 1002 | 2 | 3.00 |
+-----------+-----------+------------+------------+------------+
| 3 | 2 | 1001 | 2 | 2.00 |
+-----------+-----------+------------+------------+------------+
where order_id is the foreign key to orders
Now for the confusing part. You will need to use a series of JOINs to access the relevant data for each user.
SELECT
t3.id AS user_id,
t3.user_name,
t1.id AS order_id,
t1.date_ordered,
SUM((t2.price * t2.qty)) AS order_total
FROM orders t1
JOIN sales t2 ON (t2.order_id = t1.id)
LEFT JOIN users t3 ON (t1.user_id = t3.id)
WHERE user_id=1
GROUP BY order_id;
This will return:
+-----------+--------------+------------+------------+--------------+
| user_id | user_name | order_id |date_ordered| order_total |
+-----------+--------------+------------+------------+--------------+
| 1 | John | 1 | 2019-03-05 | 8.50 |
+-----------+--------------+------------+------------+--------------+
These type of JOIN statements should come up in basically any project using a relational database (that is, if you are designing your DB correctly). Typically I create a view for each of these complicated queries, which can then be accessed with a simple SELECT * FROM orders_view
For example:
CREATE
ALGORITHM = UNDEFINED
DEFINER = `root`#`localhost`
SQL SECURITY DEFINER
VIEW orders_view AS (
SELECT
t3.id AS user_id,
t3.user_name,
t1.id AS order_id,
t1.date_ordered,
SUM((t2.price * t2.qty)) AS order_total
FROM orders t1
JOIN sales t2 ON (t2.order_id = t1.id)
LEFT JOIN users t3 ON (t1.user_id = t3.id)
GROUP BY order_id
)
This can then be accessed by:
SELECT * FROM orders_view WHERE user_id=1;
Which would return the same results as the query above.
Depending on your needs, you will probably need to add a few more tables (addresses, products etc.) and several more rows to each of these tables. Very often you will find that you need to JOIN 5+ tables into a view, and sometimes you might need to JOIN the same table twice.
I hope this helps despite it not exactly answering your question!

It is probably a bad idea to update the USERS table after inserting into (or updating) the ORDERS table. Avoid storing data twice. In your case: you can always get all "order ids" for a user by querying the ORDERS table. Thus, you don't need to store them in the USERS table (again). Example (tested with MySQL 8.0, see dbfiddle):
Tables and data
create table users( id integer primary key, name varchar(30) ) ;
insert into users( id, name ) values
(1, 'John'),(2, 'Michael'),(3, 'Someone') ;
create table orders(
id integer primary key
, userid integer
, content varchar(3) references users (id)
);
insert into orders ( id, userid, content ) values
(101, 1, '---'),(102, 1, '---')
,(103, 2, '---'),(104, 2, '---'),(105, 3, '---') ;
Maybe a VIEW - similar to the one below - will do the trick. (Advantage: you don't need additional columns or tables.)
-- View
-- Inner SELECT: group order ids per user (table ORDERS).
-- Outer SELECT: fetch the user name (table USERS)
create or replace view userorders (
userid, username, userdata
)
as
select
U.id, U.name, O.orders_
from (
select
userid
, group_concat( id order by id separator '|' ) as orders_
from orders
group by userid
) O join users U on O.userid = U.id ;
Once the view is in place, you can just SELECT from it, and you will always get the current "userdata" eg
select * from userorders ;
-- result
userid username userdata
1 John 101|102
2 Michael 103|104
3 Someone 105
-- add some more orders
insert into orders ( id, userid, content ) values
(1000, 1, '***'),(4000, 1, '***'),(7000, 1, '***')
,(2000, 2, ':::'),(5000, 2, ':::'),(8000, 2, ':::')
,(3000, 3, '###'),(6000, 3, '###'),(9000, 3, '###') ;
select * from userorders ;
-- result
userid username userdata
1 John 101|102|1000|4000|7000
2 Michael 103|104|2000|5000|8000
3 Someone 105|3000|6000|9000

Writing more better SQL

I've got a query here that's painfully slow. Part of the problem may be that tableA in the sub-query has a quite substantial size in comparison to the other tables.
TABLES STRUCTURE
*-------------------*------------------*-------------------*
| ID_TABLE | DATA_TABLE | DATA_TABLE_EXT |
*-------------------*------------------*-------------------*
| id n<|>1 id 1<|>n owner_id |
| foreign_id | owner_id | information |
| foreign_id_source | date_field | ... |
| ... | ... | |
*-------------------*------------------*-------------------*
QUERY
SELECT ID_TABLE.foreign_id_source, count(ID_TABLE.id) as count
FROM DATA_TABLE
LEFT JOIN ID_TABLE ON DATA_TABLE.id = ID_TABLE.id
WHERE DATA_TABLE.owner_id = 'some_id'
AND DATA_TABLE.date_field > 'some_date'
AND DATA_TABLE.id IN (
SELECT DATA_TABLE_EXT.owner_id FROM DATA_TABLE_EXT
JOIN DATA_TABLE ON DATA_TABLE_EXT.owner_id = DATA_TABLE.id
WHERE DATA_TABLE.owner_id = 'some_id'
GROUP BY DATA_TABLE.id
HAVING SUM(ABS(DATA_TABLE_EXT.information)) <> 0
)
GROUP BY ID_TABLE.foreign_id_source
ORDER BY count ASC
REQUIRED RESULT
*-------------------*-------------*
| foreign_id_source | count |
*-------------------*-------------*
| source1 | 45 |
| source2 | 10 |
| ... | |
*-------------------*-------------*
Each id in DATA_TABLE may have multiple records in ID_TABLE.
many records in DATA_TABLE may have the same owner_id.
I'm looking for the number of records in data_table with a foreign_id_source, grouped by that foreign_id_source, where the record is after 'some_date' and it's DATA_TABLE_EXT records do not all have a value of 0 in the information field.
Short of creating indexes or other database manipulation is there a way to improve this query in terms of performance?
Any other suggestions are also welcome.

The point is: SUM(ABS(DATA_TABLE_EXT.information)) <> 0 can only be true if at least one DATA_TABLE_EXT.information is non-zero. So we don't have to sum() them, we only only need to check if a non-zero one exists.
[ I don't know if mysql is smart enough to handle the exists(), but in theory it is cheaper, and can be faster]
SELECT it.foreign_id_source, count(it.id) as count
FROM DATA_TABLE dt
LEFT JOIN ID_TABLE it ON dt.id = it.id
WHERE dt.owner_id = 'some_id'
AND dt.date_field > 'some_date'
AND EXISTS (
SELECT *
FROM DATA_TABLE_EXT x
JOIN DATA_TABLE dt2 ON x.owner_id = dt2.id
WHERE x.id =dt.id
AND dt2.owner_id = 'some_id'
AND x.information <> 0
)
GROUP BY it.foreign_id_source
ORDER BY count ASC
;

Often moving the subquery to the FROM will help:
SELECT ID_TABLE.foreign_id_source, count(DATA_TABLE.id) as count
FROM ID_TABLE LEFT JOIN
DATA_TABLE
ON DATA_TABLE.id = ID_TABLE.id JOIN
(SELECT DATA_TABLE.id
FROM DATA_TABLE_EXT JOIN
DATA_TABLE
ON DATA_TABLE_EXT.owner_id = DATA_TABLE.id
WHERE DATA_TABLE.owner_id = 'some_value'
GROUP BY DATA_TABLE.id
HAVING SUM(ABS(DATA_TABLE_EXT.information)) <> 0
) xx
ON DATA_TABLE.id = xx.id
WHERE DATA_TABLE.owner_id = 'some_value' AND
DATA_TABLE.date_field > 'some_date'
GROUP BY x.field1
ORDER BY count ASC;
Then, you can think about indexes. These would be tableX(field2, fieldZ, field1, fieldX), tableI(field1), tableX(field2, field1, fieldB), andtableA(field1)`.

Find longest range of free ids in mysql, using a query?

This can certainly be done using a simple script and reading the database.
I am interested to know if the same is possible using some MySQL query.
Table Schema :
+--------+-------------------------------+
| doc_id | doc_title |
+--------+-------------------------------+
| 40692 | hello |
| 13873 | isaac |
| 37739 | einstei |
| 36042 | cricket |
| 96249 | astronaut |
| 81931 | discovery |
| 28447 | scooby |
| 99632 | popeye |
+--------+-------------------------------+
Here doc_id is a random number between 1 to 99999 , the distribution is sparse. I would want to know longest ( or all of the longest ) unused number ranges in my mysql table.
i.e. if 71000 to 83000 is the longest such range, there will be no record having doc_id lying between these two values.

create table documents ( id int, name varchar(255) );
insert into documents (id, name ) values
( 40692,'hello'),
( 13873,'isaac'),
( 37739,'einstei'),
( 36042,'cricket'),
( 96249,'astronaut'),
( 81931,'discovery'),
( 28447,'scooby'),
( 99632,'popeye')
select
d1.id,
min( d2.id ),
min( d2.id ) - d1.id as 'gap'
from
documents d1
join documents d2 on d2.id > d1.id
group by
d1.id
order by
3 desc;

Try this query -
SELECT start_doc_id, doc_id end_doc_id, delta FROM (
SELECT
doc_id,
#d start_doc_id,
IF(#d IS NULL, 0, doc_id - #d) delta,
#d:=doc_id
FROM
doc, (SELECT #d:= NULL) t
ORDER BY doc_id
) doc
ORDER BY delta DESC

How about something like this
SELECT t1.doc_id,
MAX(t1.doc_id-IFNULL(t2.doc_id,0)) AS difference
FROM `table` t1
LEFT JOIN `table` t2 ON t1.doc_id>t2.doc_id
LEFT JOIN `table` t3 ON (t1.doc_id>t3.doc_id AND t3.doc_id>t2.doc_id)
WHERE t3.doc_id IS NULL
GROUP BY t1.doc
ORDER BY difference DESC

I'd look for a different solution from Wolfgang's if your documents table gets big (hundreds of thousands or millions of records) because a join with "on d2.id > d1.id" is going to create a huge temporary table and take lots of time.
If you create the temporary table, you can do it quickly:
create table tdoc (id int, nextid int);
insert into tdoc (id) (select id from documents);
update tdoc join documents as d1
on d1.id = tdoc.id
set nextid =
(select id from documents
where id > tdoc.id order by id limit 1);
select id, max(nextid-id) from tdoc;
There's no self-join and no heavy crunching; this solution scales to a large documents table if necessary

Mysql unique values query

I have a table with name-value pairs and additional attribute. The same name can have more than one value. If that happens I want to return the row which has a higher attribute value.
Table:
ID | name | value | attribute
1 | set1 | 1 | 0
2 | set2 | 2 | 0
3 | set3 | 3 | 0
4 | set1 | 4 | 1
Desired results of query:
name | value
set2 | 2
set3 | 3
set1 | 4
What is the best performing sql query to get the desired results?

the best performing query would be as follows:
select
s.set_id,
s.name as set_name,
a.attrib_id,
a.name as attrib_name,
sav.value
from
sets s
inner join set_attribute_values sav on
sav.set_id = s.set_id and sav.attrib_id = s.max_attrib_id
inner join attributes a on sav.attrib_id = a.attrib_id
order by
s.set_id;
+--------+----------+-----------+-------------+-------+
| set_id | set_name | attrib_id | attrib_name | value |
+--------+----------+-----------+-------------+-------+
| 1 | set1 | 3 | attrib3 | 20 |
| 2 | set2 | 0 | attrib0 | 10 |
| 3 | set3 | 0 | attrib0 | 10 |
| 4 | set4 | 4 | attrib4 | 10 |
| 5 | set5 | 2 | attrib2 | 10 |
+--------+----------+-----------+-------------+-------+
obviously for this to work you're gonna also have to normalise your design and implement a simple trigger:
drop table if exists attributes;
create table attributes
(
attrib_id smallint unsigned not null primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists sets;
create table sets
(
set_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null,
max_attrib_id smallint unsigned not null default 0,
key (max_attrib_id)
)
engine=innodb;
drop table if exists set_attribute_values;
create table set_attribute_values
(
set_id smallint unsigned not null,
attrib_id smallint unsigned not null,
value int unsigned not null default 0,
primary key (set_id, attrib_id)
)
engine=innodb;
delimiter #
create trigger set_attribute_values_before_ins_trig
before insert on set_attribute_values
for each row
begin
update sets set max_attrib_id = new.attrib_id
where set_id = new.set_id and max_attrib_id < new.attrib_id;
end#
delimiter ;
insert into attributes values (0,'attrib0'),(1,'attrib1'),(2,'attrib2'),(3,'attrib3'),(4,'attrib4');
insert into sets (name) values ('set1'),('set2'),('set3'),('set4'),('set5');
insert into set_attribute_values values
(1,0,10),(1,3,20),(1,1,30),
(2,0,10),
(3,0,10),
(4,4,10),(4,2,20),
(5,2,10);

This solution will probably perform the best:
Select ...
From Table As T
Left Join Table As T2
On T2.name = T.name
And T2.attribute > T1.attribute
Where T2.ID Is Null
Another solution which may not perform as well (you would need to evaluate against your data):
Select ...
From Table As T
Where Not Exists (
Select 1
From Table As T2
Where T2.name = T.name
And T2.attribute > T.attribute
)

select name,max(value)
from table
group by name

SELECT name, value
FROM (SELECT name, value, attribute
FROM table_name
ORDER BY attribute DESC) AS t
GROUP BY name;

There is no easy way to do this.
A similar question was asked here.
Edit: Here's a suggestion:
SELECT `name`,`value` FROM `mytable` ORDER BY `name`,`attribute` DESC
This isn't quite what you asked for, but it'll at least give you the higher attribute values first, and you can ignore the rest.
Edit again: Another suggestion:
If you know that value is a positive integer, you can do this. It's yucky, but it'll work.
SELECT `name`,CAST (GROUP_CONCAT(`value` ORDER by `attribute` DESC) as UNSIGNED) FROM `mytable` GROUP BY `name`
To include negative integers you could change UNSIGNED to SIGNED.

Might want to benchmark all these options, here's another one.
SELECT t1.name, t1.value
FROM temp t1
WHERE t1.attribute IN (
SELECT MAX(t2.attribute)
FROM temp t2
WHERE t2.name = t1.name);

How about:
SELECT ID, name, value, attribute
FROM table A
WHERE A.attribute = (SELECT MAX(B.attribute) FROM table B WHERE B.NAME = A.NAME);
Edit: Seems like someones said the same already.

Did not benchmark them, but here is how it is doable:
TableName = temm
1) Row with maximum value of attribute :
select t.name, t.value
from (
select name, max(attribute) as maxattr
from temm group by name
) as x inner join temm as t on t.name = x.name and t.attribute = x.maxattr;
2) Top N rows with maximum attribute value :
select name, value
from temm
where (
select count(*) from temm as n
where n.name = temm.name and n.attribute > temm.attribute
) < 1 ; /* 1 can be changed to 2,3,4 ..N to get N rows */

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to delete duplicates in SQL table based on multiple fields - mysql

The best thing that worked for me was to recreate the table. CREATE TABLE newtable SELECT * FROM oldtable GROUP BY field1,field2; You can then rename.

AS long as you are not getting id (primary key) of the table in your select query and the other data is exact same you can use SELECT DISTINCT to avoid getting duplicate results.

DELETE FROM table WHERE id = (SELECT t.id FROM table as t JOIN (table as tj ON (t.date = tj.data AND t.hometeam_id = tj.hometeam_id AND t.awayteam_id = tj.awayteam_id ...))

Related

Problems using SQL ALL operator

Create trigger for several rows

Writing more better SQL

Find longest range of free ids in mysql, using a query?

Mysql unique values query

Categories

Resources