Mysql Query to delete duplicates - mysql

I have duplicate results like below where some column may have data and may not
| contact_info | icon | id | title | lastmodified_by |
+--------------+------+-----+---------------+------------------+
| 169 | 305 | 123 | Whakarewarewa | 2011100400305262 |
| NULL | NULL | 850 | Whakarewarewa | NULL |
+--------------+------+-----+---------------+----------------
| contact_info | icon | id | title | lastmodified_by |
+--------------+------+-----+---------------+------------------+
| NULL | NULL | 123 | Paris | NULL |
| NULL | NULL | 850 | Paris | NULL |
+--------------+------+-----+---------------+----------------
I want to delete record having less Data and if the all the Field values are exact same then delete any row.
There are thousand records like this.

Try this two-step solution:
Run this query to vew all duplicates - record having less Data -
SELECT t1.* FROM table t1
JOIN (
SELECT
title,
MIN(IF(contact_info IS NULL, 0, 1) + IF(contact_info IS NULL, 0, 1) + IF(lastmodified_by IS NULL, 0, 1)) min_value_data,
MAX(IF(contact_info IS NULL, 0, 1) + IF(contact_info IS NULL, 0, 1) + IF(lastmodified_by IS NULL, 0, 1)) max_value_data
FROM table GROUP BY title HAVING min_value_data <> max_value_data
) t2
ON t1.title = t2.title AND IF(t1.contact_info IS NULL, 0, 1) + IF(t1.contact_info IS NULL, 0, 1) + IF(t1.lastmodified_by IS NULL, 0, 1) <> t2.max_value_data
Rewrite it to DELETE statement and execute.
Then run this query to remove all duplicates except min ID:
DELETE t1 FROM table t1
JOIN (SELECT MIN(id) id, title FROM table GROUP BY title) t2
ON t1.id <> t2.id AND t1.title = t2.title;

Use this to select duplicates, feel free to alter this to a delete statement:
SELECT * FROM `test`,
(SELECT title, count( title ) AS ttl
FROM `test`
GROUP BY title
HAVING ttl >1) AS sub
WHERE test.title = sub.title
AND contact_info IS NULL AND lastmodified_by IS NULL

Main table = tes1
Create temp
CREATE TEMPORARY TABLE my_temp ( id INT(20) NOT NULL ) ENGINE=MEMORY;
Fill with id's to remove
INSERT INTO my_temp (id) SELECT id FROM tes1 AS main, ( SELECT title, count( title ) AS ttl FROM tes1 GROUP BY
title HAVING ttl >1 ) AS sub WHERE main.title = sub.title AND
main.contact_info IS NULL AND main.lastmodified_by IS NULL GROUP BY
main.contact_info, main.icon, main.title, main.lastmodified_by;
Delete!
DELETE FROM tes1 WHERE id IN (select id from my_temp);
Cleanup, note: do we really need this?
DROP TABLE my_temp;

Related

Creating summary VIEW from fields from multiple tables

I am trying to write a select query for creating a view in MySQL. Each row in the view should display weekly summary (sum, avg) for user values collected from multiple tables. The tables are similar to each-other but not identical. The view should include rows also in case other table doesn't have a values for that week. Something like this:
| week_year | sum1 | avg1 | sum2 | user_id |
| --------- | ---- | ---- | ---- | ------- |
| 201840 | | | 3 | 1 |
| 201844 | 45 | 55 | | 1 |
| 201845 | 55 | 65 | | 1 |
| 201849 | 65 | 75 | | 1 |
| 201849 | 75 | 85 | 3 | 2 |
The tables (simplified) are as follows:
CREATE TABLE IF NOT EXISTS `t1` (
`user_id` INT NOT NULL AUTO_INCREMENT,
`date` DATE NOT NULL,
`value1` int(3) NOT NULL,
`value2` int(3) NOT NULL,
PRIMARY KEY (`user_id`,`date`)
) DEFAULT CHARSET=utf8;
CREATE TABLE IF NOT EXISTS `t2` (
`id` INT NOT NULL AUTO_INCREMENT,
`date` DATE NOT NULL,
`value3` int(3) NOT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8;
CREATE TABLE IF NOT EXISTS `t3` (
`t3_id` INT NOT NULL,
`user_id` INT NOT NULL
) DEFAULT CHARSET=utf8;
My current solution doesn't seem reasonable and I am not sure how it would perform in case of thousands of rows:
select ifnull(yearweek(q1.date1), yearweek(q1.date2)) as week_year,
sum(value1) as sum1,
avg(value2) as avg1,
sum(value3) as sum2,
q1.user_id
from (select t2.date as date2,
t1.date as date1,
ifnull(t3.user_id, t1.user_id) as user_id,
t1.value1,
t1.value2,
t2.value3
from t2
join t3 on t3.t3_id=t2.id
left join t1 on yearweek(t1.date) = yearweek(t2.date) and t1.user_id = t3.user_id
union
select t2.date as date2,
t1.date as date1,
ifnull(t3.user_id, t1.user_id) as user_id,
t1.value1,
t1.value2,
t2.value3
from t2
join t3 on t3.t3_id=t2.id
right join t1 on yearweek(t1.date) = yearweek(t2.date) and t1.user_id = t3.user_id) as q1
group by week_year, user_id;
DB Fiddle
Is the current solution okay performance wise or are there better options? In case of in the future third (or fourth) table is added, how would I manage the query? Should I consider creating a separate table, that is updated with triggers?
Thanks in advance.
Another way you can do it is to union all the data and then group it. You'll have to perf test to see which is better:
SELECT
yearweek(date),
SUM(value1) as sum1,
AVG(value2) as avg1,
SUM(value3) as sum2
FROM
(
SELECT user_id, date, value1, value2, CAST(null as INT) as value3 FROM t1
UNION ALL
SELECT user_id, date, null, null, value3 FROM t2 INNER JOIN t3 ON t2.id = t3.t3_id
)
GROUP BY
user_id,
yearweek(date)
Hopefully mysql won't take issue with casting null to an int..

SQL improvement in MySQL

I have these tables in MySQL.
CREATE TABLE `tableA` (
`id_a` int(11) NOT NULL,
`itemCode` varchar(50) NOT NULL,
`qtyOrdered` decimal(15,4) DEFAULT NULL,
:
PRIMARY KEY (`id_a`),
KEY `INDEX_A1` (`itemCode`)
) ENGINE=InnoDB
CREATE TABLE `tableB` (
`id_b` int(11) NOT NULL AUTO_INCREMENT,
`qtyDelivered` decimal(15,4) NOT NULL,
`id_a` int(11) DEFAULT NULL,
`opType` int(11) NOT NULL, -- '0' delivered to customer, '1' returned from customer
:
PRIMARY KEY (`id_b`),
KEY `INDEX_B1` (`id_a`)
KEY `INDEX_B2` (`opType`)
) ENGINE=InnoDB
tableA shows how many quantity we received order from customer, tableB shows how many quantity we delivered to customer for each order.
I want to make a SQL which counts how many quantity remaining for delivery on each itemCode.
The SQL is as below. This SQL works, but slow.
SELECT T1.itemCode,
SUM(IFNULL(T1.qtyOrdered,'0')-IFNULL(T2.qtyDelivered,'0')+IFNULL(T3.qtyReturned,'0')) as qty
FROM tableA AS T1
LEFT JOIN (SELECT id_a,SUM(qtyDelivered) as qtyDelivered FROM tableB WHERE opType = '0' GROUP BY id_a)
AS T2 on T1.id_a = T2.id_a
LEFT JOIN (SELECT id_a,SUM(qtyDelivered) as qtyReturned FROM tableB WHERE opType = '1' GROUP BY id_a)
AS T3 on T1.id_a = T3.id_a
WHERE T1.itemCode = '?'
GROUP BY T1.itemCode
I tried explain on this SQL, and the result is as below.
+----+-------------+------------+------+----------------+----------+---------+-------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+----------------+----------+---------+-------+-------+----------------------------------------------+
| 1 | PRIMARY | T1 | ref | INDEX_A1 | INDEX_A1 | 152 | const | 1 | Using where |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 21211 | |
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 10 | |
| 3 | DERIVED | tableB | ref | INDEX_B2 | INDEX_B2 | 4 | | 96 | Using where; Using temporary; Using filesort |
| 2 | DERIVED | tableB | ref | INDEX_B2 | INDEX_B2 | 4 | | 55614 | Using where; Using temporary; Using filesort |
+----+-------------+-------------------+----------------+----------+---------+-------+-------+----------------------------------------------+
I want to improve my query. How can I do that?
First, your table B has int for opType, but you are comparing to string via '0' and '1'. Leave as numeric 0 and 1. To optimize your pre-aggregates, you should not have individual column indexes, but a composite, and in this case a covering index. INDEX table B ON (OpType, ID_A, QtyDelivered) as a single index. The OpType to optimize the WHERE, ID_A to optimize the group by, and QtyDelivered for the aggregate in the index without going to the raw data pages.
Since you are looking for the two types, you can roll them up into a single subquery testing for either in a single pass result. THEN, Join to your tableA results.
SELECT
T1.itemCode,
SUM( IFNULL(T1.qtyOrdered, 0 )
- IFNULL(T2.qtyDelivered, 0)
+ IFNULL(T2.qtyReturned, 0)) as qty
FROM
tableA AS T1
LEFT JOIN ( SELECT
id_a,
SUM( IF( opType=0,qtyDelivered, 0)) as qtyDelivered,
SUM( IF( opType=1,qtyDelivered, 0)) as qtyReturned
FROM
tableB
WHERE
opType IN ( 0, 1 )
GROUP BY
id_a) AS T2
on T1.id_a = T2.id_a
WHERE
T1.itemCode = '?'
GROUP BY
T1.itemCode
Now, depending on the size of your tables, you might be better doing a JOIN on your inner table to table A so you only get those of the item code you are expectin. If you have 50k items and you are only looking for items that qualify = 120 items, then your inner query is STILL qualifying based on the 50k. In that case would be overkill. In this case, I would suggest an index on table A by ( ItemCode, ID_A ) and adjust the inner query to
LEFT JOIN ( SELECT
b.id_a,
SUM( IF( b.opType = 0, b.qtyDelivered, 0)) as qtyDelivered,
SUM( IF( b.opType = 1, b.qtyDelivered, 0)) as qtyReturned
FROM
( select distinct id_a
from tableA
where itemCode = '?' ) pqA
JOIN tableB b
on PQA.id_A = b.id_a
AND b.opType IN ( 0, 1 )
GROUP BY
id_a) AS T2
My Query against your SQLFiddle

Log affected rows into another table in MySQL

Given the table:
CREATE TABLE `records` (
`id_type` varchar(50) NOT NULL,
`old_id` INT,
`new_id` INT,
) ENGINE=InnoDB;
And the data:
id_type | old_id | new_id
USER | 11 | NULL
USER | 15 | NULL
USER | 56 | NULL
USER | NULL | 500
USER | NULL | 523
USER | NULL | 800
I want to perform a query that will return:
id_type | old_id | new_id
USER | 11 | 500
USER | 15 | 523
USER | 56 | 800
Create table records_old
(
id_type varchar(20) primary key,
old_id int not null
);
Create table records_new
(
id_type varchar(20),
new_id int not null
);
insert into records_old(id_type,old_id) values ('USER1',11);
insert into records_old(id_type,old_id) values ('USER2',12);
insert into records_old(id_type,old_id) values ('USER3',13);
insert into records_new(id_type,new_id) values ('USER1',500);
insert into records_new(id_type,new_id) values ('USER2',600);
insert into records_new(id_type,new_id) values ('USER3',700);
select * from records_old;
select * from records_new;
select a.id_type,a.old_id,b.new_id from records_old a
inner join records_new b
where a.id_type=b.id_type;
SET #old_row_number = 0;
SET #new_row_number = 0;
SELECT OldData.id_type, OldData.old_id, NewData.new_id
FROM (SELECT id_type, old_id, (#old_row_number:=#old_row_number + 1) AS OldRowNumber
FROM `records`
WHERE old_id IS NOT NULL) OldData
JOIN (SELECT id_type, new_id, (#new_row_number:=#new_row_number + 1) AS NewRowNumber
FROM `records`
WHERE new_id IS NOT NULL) NewData ON NewData.NewRowNumber = OldData.OldRowNumber
Filter with id is not null and separate as two sub-queries and add a row number for each row then join will help in your case.
Working Demo

How to rewrite a NOT IN subquery as join

Let's assume that the following tables in MySQL describe documents contained in folders.
mysql> select * from folder;
+----+----------------+
| ID | PATH |
+----+----------------+
| 1 | matches/1 |
| 2 | matches/2 |
| 3 | shared/3 |
| 4 | no/match/4 |
| 5 | unreferenced/5 |
+----+----------------+
mysql> select * from DOC;
+----+------+------------+
| ID | F_ID | DATE |
+----+------+------------+
| 1 | 1 | 2000-01-01 |
| 2 | 2 | 2000-01-02 |
| 3 | 2 | 2000-01-03 |
| 4 | 3 | 2000-01-04 |
| 5 | 3 | 2000-01-05 |
| 6 | 3 | 2000-01-06 |
| 7 | 4 | 2000-01-07 |
| 8 | 4 | 2000-01-08 |
| 9 | 4 | 2000-01-09 |
| 10 | 4 | 2000-01-10 |
+----+------+------------+
The columns ID are the primary keys and the column F_ID of table DOC is a not-null foreign key that references the primary key of table FOLDER. By using the 'DATE' of documents in the where clause, I would like to find which folders contain only the selected documents. For documents earlier than 2000-01-05, this could be written as:
SELECT DISTINCT d1.F_ID
FROM DOC d1
WHERE d1.DATE < '2000-01-05'
AND d1.F_ID NOT IN (
SELECT d2.F_ID
FROM DOC d2 WHERE NOT (d2.DATE < '2000-01-05')
);
and it correctly returns '1' and '2'. By reading
http://dev.mysql.com/doc/refman/5.5/en/rewriting-subqueries.html
the performance for big tables could be improved if the subquery is replaced with a join. I already found questions related to NOT IN and JOINS but not exactly what I was looking for. So, any ideas of how this could be written with joins ?
The general answer is:
select t.*
from t
where t.id not in (select id from s)
Can be rewritten as:
select t.*
from t left outer join
(select distinct id from s) s
on t.id = s.id
where s.id is null
I think you can apply this to your situation.
select distinct d1.F_ID
from DOC d1
left outer join (
select F_ID
from DOC
where date >= '2000-01-05'
) d2 on d1.F_ID = d2.F_ID
where d1.date < '2000-01-05'
and d2.F_ID is null
If I understand your question correctly, that you want to find the F_IDs representing folders which only contains documents from before '2000-01-05', then simply
SELECT F_ID
FROM DOC
GROUP BY F_ID
HAVING MAX(DATE) < '2000-01-05'
Sample Table and Insert Statements
CREATE TABLE `tleft` (
`id` int(2) NOT NULL,
`name` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
CREATE TABLE `tright` (
`id` int(2) NOT NULL,
`t_left_id` int(2) DEFAULT NULL,
`description` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
INSERT INTO `tleft` (`id`, `name`)
VALUES
(1, 'henry'),
(2, 'steve'),
(3, 'jeff'),
(4, 'richards'),
(5, 'elon');
INSERT INTO `tright` (`id`, `t_left_id`, `description`)
VALUES
(1, 1, 'sample'),
(2, 2, 'sample');
Left Join : SELECT l.id,l.name FROM tleft l LEFT JOIN tright r ON l.id = r.t_left_id ;
Returns Id : 1, 2, 3, 4, 5
Right Join : SELECT l.id,l.name FROM tleft l RIGHT JOIN tright r ON l.id = r.t_left_id ;
Returns Id : 1,2
Subquery Not in tright : select id from tleft where id not in ( select t_left_id from tright);
Returns Id : 3,4,5
Equivalent Join For above subquery :
SELECT l.id,l.name FROM tleft l LEFT JOIN tright r ON l.id = r.t_left_id WHERE r.t_left_id IS NULL;
AND clause will be applied during the JOIN and WHERE clause will be applied after the JOIN .
Example : SELECT l.id,l.name FROM tleft l LEFT JOIN tright r ON l.id = r.t_left_id AND r.description ='hello' WHERE r.t_left_id IS NULL ;
Hope this helps

Join two tables where table A has a date value and needs to find the next date in B below the date in A

I got this table "A":
| id | date |
===================
| 1 | 2010-01-13 |
| 2 | 2011-04-19 |
| 3 | 2011-05-07 |
| .. | ... |
and this table "B":
| date | value |
======================
| 2009-03-29 | 0.5 |
| 2010-01-30 | 0.55 |
| 2011-08-12 | 0.67 |
Now I am looking for a way to JOIN those two tables having the "value" column in "B" mapped to the dates in "A". The tricky part for me here is that table "B" only stores the change date and the new value. Now when I need this value in table "A" the SQL needs to look back what date is the next below the date it is asking the value for.
So in the end the JOIN of those tables should look like this:
| id | date | value |
===========================
| 1 | 2010-01-13 | 0.5 |
| 2 | 2011-04-19 | 0.55 |
| 3 | 2011-05-07 | 0.55 |
| .. | ... | ... |
How can I do this?
-- Create and fill first table
CREATE TABLE `id_date` (
`id` int(11) NOT NULL auto_increment,
`iddate` date NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
INSERT INTO `id_date` VALUES(1, '2010-01-13');
INSERT INTO `id_date` VALUES(2, '2011-04-19');
INSERT INTO `id_date` VALUES(3, '2011-05-07');
-- Create and fill second table
CREATE TABLE `date_val` (
`mydate` date NOT NULL,
`myval` varchar(4) collate utf8_bin NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
INSERT INTO `date_val` VALUES('2009-03-29', '0.5');
INSERT INTO `date_val` VALUES('2010-01-30', '0.55');
INSERT INTO `date_val` VALUES('2011-08-12', '0.67');
-- Get the result table as asked in question
SELECT iddate, t2.mydate, t2.myval
FROM `id_date` t1
JOIN date_val t2 ON t2.mydate <= t1.iddate
AND t2.mydate = (
SELECT MAX( t3.mydate )
FROM `date_val` t3
WHERE t3.mydate <= t1.iddate )
What we're doing:
for each date in the id_date table (your table A),
we find the date in the date_val table (your table B)
which is the highest date in the date_val table (but still smaller than the id_date.date)
You could use a subquery with limit 1 to look up the latest value in table B:
select id
, date
, (
select value
from B
where B.date < A.date
order by
B.date desc
limit 1
) as value
from A
I have been inspired by the other answers but ended with my own solution using common table expressions:
WITH datecombination (id, adate, bdate) AS
(
SELECT id, A.date, MAX(B.Date) as Bdate
FROM tableA A
LEFT JOIN tableB B
ON B.date <= A.date
GROUP BY A.id, A.date
)
SELECT DC.id, DC.adate, B.value FROM datecombination DC
LEFT JOIN tableB B
ON DC.bdate = B.bdate
The INNER JOIN return rows when there is at least one match in both tables. Try this.
Select A.id,A.date,b.value
from A inner join B
on A.date=b.date