Mysql - sql query to delete duplicate rows based on condition - mysql

I have a database table with nearly 1 million records - when I wrote a query to see how many of them are duplicates - there are close 90K records that are duplicates - By duplicate I mean records with the same email address - Like for one email address - there could be 10 records.
Sample data
ID | Name | Email | phone
1 | abc | abc#gmail.com | 12345
2 | def | def#gmail.com | 12533
3 | abc | abc#gmail.com |
4 | hij | hij#gmail.com | 50633
5 | abc | abc#gmail.com | 12345
6 | def | def#gmail.com |
1) ID is the autoincrement primary key of the table
2) If there are two records present like def#gmail.com - I need to keep the record that has the phone and delete the other record
3) Now incase of abc#gmail.com - there are 3 records - the one without phone gets deleted - now out of the remaining two - although both have all data - keep the first one and delete the second
Is it possible to write a delete statement based on a condition or is there an easier way to accomplish this.
A SQLfiddle to play around with - http://sqlfiddle.com/#!2/cf8c7
thank much

DELETE FROM phoney ph
WHERE ph.zphone IS NULL
AND EXISTS (SELECT *
FROM phoney ex
WHERE ex.zname = ph.zname
AND ex.zemail = ph.zemail
AND ex.zphone IS NOT NULL
);
DELETE FROM phoney ph
WHERE ph.zphone IS NOT NULL
AND EXISTS (SELECT *
FROM phoney ex
WHERE ex.zname = ph.zname
AND ex.zemail = ph.zemail
AND ex.id < ph.id
);
SELECT * FROM phoney;
RESULT:
DELETE 2
DELETE 1
id | zname | zemail | zphone
----+-------+---------------+--------
1 | abc | abc#gmail.com | 12345
2 | def | def#gmail.com | 12533
4 | hij | hij#gmail.com | 50633
NOTE: You could combine the two delete-queries, but that will result in a messy kludge of AND/OR conditions in the WHERE CLAUSE, which is very error-prone.

Try below query:
DELETE b.* FROM table1 a INNER JOIN table1 b ON a.name = b.name AND a.id < b.id

Related

MySQL query based on results of another query

I have a table in MySQL which looks like this.
+---------+------------+--------------+
| user_id | key | value |
+---------+------------+--------------+
| 1 | full_name | John Smith |
+---------+------------+--------------+
| 1 | is_active | 1 |
+---------+------------+--------------+
| 1 | user_level |Administrator |
+---------+------------+--------------+
I need to get value of key full_name where user_id is 1, but only if value of key is_active is 1. I can do it with 2 separate queries, but I would like to know if it is possible to do it in a single query.
Note: I cannot change the structure of the table.
One method is to use joins:
select tn.value
from t tn join
t ta
on tn.user_id = ta.user_id and ta.key = 'active'
where tn.key = 'fullname';
i think you need below query by using exists
select t.value from your_table t where
exists ( select 1 from your_table t1
where t1.user_id=t.user_id
and t1.key='is_active'
) and t.key='full_name'
DEMO IN MYSQL 8
value
john smith

MySQL - Select statement of a table that contains 2 same-foreign-key columns

I can't really explain it in words because it is hard for me. I'll just show you what I need to accomplish.
So lets say I have 2 tables, admin and records.
admin table with sample data:
a_id | a_name
1 | haime
2 | joseph
Record table with sample data:
r_id | r_amount | r_a_create_by | r_a_update_by
1 | 99 | 1 | 2
So I have a transaction record that is created by admin with ID of 1 and updated by ID of 2. Now how can I make a select query of that? If I want the output of something like :
1 | 99 | haime | joseph
You need to join the admin 2 times. This should work:
select r.r_id, r.r_amount, a.a_name, b.a_name
from records r
join admin a on r.r_a_create_by = a.a_id
join admin b on r.r_a_update_by = b.a_id

Removing duplicates based on one column, and keeping the row that has value in different column, and if there isn't any, keep lowest ID row

Using MySQL 5.7 on Google Cloud, I'm trying to deduplicate MySQL data based on an "EmailAddress" column, but some of the rows have a value in the "FullName" column and some of them don't. I want to keep the ones that have a value in the FullName column, but if none of the rows with that EmailAddress value a FullName value, then just keep the duplicate with the lowest ID number (first column - primary key).
I've finally broken it down into two separate queries, one to first remove the rows with no value in the FullName column IF there's another duplicate row that does have a value in the FullName column:
DELETE
FROM customer_info
WHERE id IN
(
SELECT *
FROM
(
SELECT c1.id
FROM customer_info c1
INNER JOIN customer_info c2 on c1.EmailAddress=c2.EmailAddress and c1.id!=c2.id
WHERE
(trim(c1.FullName)='' or c1.FullName is NULL)
and c2.FullName is not NULL
and length(trim(c2.FullName))!=0
) t
)
and another query to remove the rows with the bigger IDs where no value was found in the FullName column:
DELETE
FROM customer_info
WHERE id IN
(
SELECT *
FROM
(
SELECT c1.id
FROM customer_info c1
INNER JOIN customer_info c2 on c1.EmailAddress=c2.EmailAddress and c1.id>c2.id
) t
)
This "works", but not really. It worked one time when I left it running overnight for a smaller segment of the data, and when I woke up there was an error, but I looked at the data and it was complete.
Am I missing something in my query that's making it highly inefficient, or is it just par for the course for this type of query, and there's no optimization possible in my code that would make a tangible improvement? I've maxed out a Google Cloud SQL instance to their db-n1-highmem-32 size, with 32 GB of memory and 1000 GB of storage space, and it still chokes up and spits out a 2013 error after running for an hour. I need to do this for a total of a little over 3 million rows.
For example, this:
id | FullName | EmailAddress |
----------------------------------------------
1 | John Doe | john.doe#email.com |
2 | null | janedoe#box.com |
3 | null | billybob#bobby.com |
4 | null | john.doe#email.com |
5 | John Lennon | jlennon#yoohoo.com |
6 | null | james.smith#coolmail.com|
7 | null | billybob#bobby.com |
8 | Jane Doe | janedoe#box.com |
would result in this:
id | FullName | EmailAddress |
----------------------------------------------
1 | John Doe | john.doe#email.com |
3 | null | billybob#bobby.com |
5 | John Lennon | jlennon#yoohoo.com |
6 | null | james.smith#coolmail.com|
8 | Jane Doe | janedoe#box.com |
using exists() might be simpler in this situation
delete
from customer_info c
where (trim(c.FullName)='' or c.FullName is null)
and exists (
select 1
from customer_info i
where i.Email = c.EmailAddress
and trim(i.FullName)>''
)
delete
from customer_info c
where exists (
select 1
from customer_info i
where i.Email = c.EmailAddress
and i.id < c.id
)

Update Statement Using Max Date and userid as the criteria

So I am trying to Update a contract table where the Contract Start Date is the latest date and the relevant employee id. The Contract Table stores all past information about the employee.
eg.
contract_tbl
+------------+------------+--------------------+-----------------+---------------+
|Contractid |EmployeeId |ContractStartDate |ContractEndDate | Position |
+------------+------------+--------------------+-----------------+---------------+
| 1 | 1 | 2012-12-13 | 2013-12-12 | Data Entry |
+------------+------------+--------------------+-----------------+---------------+
| 2 | 1 | 2014-01-26 | 2015-01-25 | Data Entry |
+------------+------------+--------------------+-----------------+---------------+
| 3 | 2 | 2014-01-26 | 2015-01-25 | Data Entry |
+------------+------------+--------------------+-----------------+---------------+
This is the SQL that I have but it does not work. (using a mysql db)
UPDATE contract_tbl
SET Position='Data Analyst'
WHERE EmployeeId = 1 And ContractStartDate= (
select max(ContractStartDate
FROM contract_tbl))
So it should Update the second row shown above with Data Analyst in the Position column but I am getting an error.
Does anybody have any idea how to fix this?
Thanks in advance
This will also do:
UPDATE contract_tbl a
JOIN (
SELECT MAX(ContractStartDate) m
FROM contract_tbl
WHERE EmployeeId = 1) b ON a.ContractStartDate = b.m AND a.EmployeeId = 1
SET a.Position='Data Analyst';
Probably this is what you want:
UPDATE contract_tbl c1
SET Position='Data Analyst'
WHERE EmployeeId = 1 And ContractStartDate= (
SELECT max(ContractStartDate)
FROM contract_tbl c2
WHERE c2.EmployeeId = c1.EmployeeId
)

MySQL Select all differences between 2 tables?

I have 3 tables, 'old', 'new' and a 'result' table (from a phonebook database), they have the same structure and nearly the same entries.
old:
ID | name | number | email | ...
----+--------------------+--------+-------+-----
1 | foo | 123 | ...
2 | bar | 456 |
3 | entrry with typo | 012345 |
4 | John Doe | 123345 |
new:
ID | name | number | email | ...
----+--------------------+--------+-------+-----
1 | foo | 123 | ...
2 | bar | 456 |
3 | entry without typo | 012345 |
4 | John Doe | 12345 |
5 | newly added entry | 09876 |
From this 'new' table I would like to select all rows that are different from the 'old' table, so the result would be:
result:
ID | name | number | email | ...
----+--------------------+--------+-------+-----
3 | entry without typo | 012345 | ...
4 | John Doe | 12345 |
5 | newly added entry | 09876 |
including all entries that have changed data plus all entries that don't appear in 'old' table...
Not only to make it more complicated, there are about 10 columns in those tables (including ID, name, number, email and several flags and other info).
Is there any most performant solution for doing this or will I have to compare each column with a new query..?
You'll have to do some comparison on the old records for correctness but I think this is the most straight forward solution.
Update I was a little confused about icluding all entries that have changed data plus all entries that don't appear in 'old' table... So I added the where and modified the join clause
insert into result (id, name, number, email, ...)
select new.id, new.name, new.number, new.email, ...
from new
LEFT JOIN old
ON new.ID = old.id
WHERE
old.ID is null
OR
( new.name <> old.name
or
new.number <> old.number
or
new.email <> new.email
...)
SELECT new.*
FROM new
JOIN old ON new.id = old.id
WHERE (CONCAT(new.ID,new.name,new.number,etc...) <> CONCAT(old.ID,old.name,old.number,etc...))
That should pull up any records in the new table where at least one its fields differs from the equivalent record in the old table.
Assuming the IDs must match up in order to make the comparisons legitimate:
select n.*
from new n
left join old o on o.id = n.id
where o.id is null
or not (
and o.name = n.name
and o.number = n.number
and o.email = n.email
and ...)
Note, this solution handles the case where some of the fields can be NULL. If you use (o.name <> n.name) instead of not (o.name = n.name) you won't correctly consider NULLs to be different from non-nulls.