Update duplicate email addresses on mysql database table

Update duplicate email addresses on mysql database table - mysql

I have a huge database that I have almost over 10k row in my user table and there are 2700 duplicate email addresses.
Basically the application did not limit the users from registering their accounts with the same email address over and over again. I have cleaned the multiple ones -more than 2 times- manually, there weren't many, but there are 2700 email addresses with duplicate value occur at least 2 times. So I want to update the duplicate email addresses and change the email address with a smaller id number to something like from "email#mail.com" to "1email#mail.com", basically adding "1" to the beginning of all duplicate email addresses. I can select and display the duplicate email addresses but could not find the way to update only one of the email addresses and leave the other on untouched.
My table structure is like id username email password.

If you do not have MySQL 8:
Here I am just prepending the id of the row to the email address:
UPDATE my_table JOIN (
SELECT email, MAX(id) AS max_id, COUNT(*) AS cnt FROM my_table
GROUP BY email
HAVING cnt > 1
) sq ON my_table.email = sq.email AND my_table.id <> sq.max_id
SET my_table.email = CONCAT( my_table.id, my_table.email)
;
See DB-Fiddle
The inner query:
SELECT email, MAX(id) AS max_id, COUNT(*) AS cnt FROM my_table
GROUP BY email
HAVING cnt > 1
looks for all emails that that are duplicated (i.e. there is more than one row with the same email address) and computes the row that has the maximum id value for each email address. For the sample data in my DB-Fiddle demo, it would return the following:
| email | max_id | cnt |
| ---------------- | ------ | --- |
| emaila#dummy.com | 3 | 3 |
| emailb#dummy.com | 5 | 2 |
The above inner query is aliased as table sq.
Now if I join my_table with the above query as follows:
SELECT my_table.* from my_table join (
SELECT email, MAX(id) AS max_id, COUNT(*) AS cnt FROM my_table
GROUP BY email
HAVING cnt > 1
) sq on my_table.email = sq.email and my_table.id <> sq.max_id
I get:
| id | email |
| --- | ---------------- |
| 1 | emaila#dummy.com |
| 2 | emaila#dummy.com |
| 4 | emailb#dummy.com |
because I am selecting from my_table all rows that have duplicate email addresses (condition my_table.email = sq.email except for the rows that have the highest value of id for each email address (condition my_table.id <> sq.max_id).
It is the ids from the above join whose email addresses are to be modified.

WITH cte AS ( SELECT id,
email,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) rn
FROM sourcetable )
UPDATE sourcetable src, cte
SET src.email = CONCAT(rn - 1, src.email)
WHERE src.id = cte.id
AND cte.rn > 1;
fiddle
I want to update the duplicate email addresses and change the email address with a smaller id number
If so the ordering in window function must be reversed:
WITH cte AS ( SELECT id,
email,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id DESC) rn
FROM sourcetable )
UPDATE sourcetable src, cte
SET src.email = CONCAT(rn - 1, src.email)
WHERE src.id = cte.id
AND cte.rn > 1;
fiddle

Related

Update columns of a table which data contains from another table mysql

I am about to update records which contain data from another table. From the given example table data below, I want to update the NULL values of company_name and domain of log table:
User Table
id | email_address
1 test#gmail.com
2 test#yahoo.com
Log table
id | user_id | company_name | domain
1 1 NULL | NULL
2 1 NULL | NULL
3 2 Yahoo | yahoo.com
4 1 Google Inc | gmail.com
Company_domain table
id | company | domain
1 | Google Inc | google.com,gmail.com,gmail.com.us
2 | Yahoo | yahoo.com,yahoomail.com
The company_name should be based on the domain of user email address. From the example log table above, the company_name of id #3 is Yahoo since the user_id=2 which is test#yahoo.com. This should also reflect on log.domain
My sql query below does not match with the company.
UPDATE user_log AS log INNER JOIN user AS u ON u.id=log.user_id
SET log.domain = (
select (SUBSTR(u.email_address, INSTR(u.email_address, '#') + 1))
),
log.company_name = (
SELECT company FROM company_domain
WHERE find_in_set(
(
SELECT (SUBSTR(log.domain, INSTR(log.domain, '#') + 1))
),
domain
)
);
Does anybody know?

I got this sql query working on my local. I notice that the domain should come from the email address since the domain from log table is null.
UPDATE user_log ul
INNER JOIN user u ON u.id = ul.user_id
SET
ul.domain = (
SELECT (SUBSTR(u.email_address, INSTR(u.email_address, '#') + 1))
),
ul.company_name = (
SELECT company FROM company_domain
WHERE FIND_IN_SET(
(
SELECT (SUBSTR(u.email_address, INSTR(u.email_address, '#') + 1))
),
domain
) LIMIT 1
);

MySQL Query with the count, group by

Table: statistics
id | user | Message
----------------------
1 | user1 |message1
2 | user2 |message2
3 | user1 |message3
I am able to find the count of messages sent by each user using this query.
select user, count(*) from statistics group by user;
How to show message column data along with the count? For example
user | count | message
------------------------
user1| 2 |message1
|message3
user2| 1 |message2

You seem to want to show Count by user, which message sent by user.
If your mysql version didn't support window functions, you can do subquery to make row_number in select subquery, then only display rn=1 users and count
CREATE TABLE T(
id INT,
user VARCHAR(50),
Message VARCHAR(100)
);
INSERT INTO T VALUES(1,'user1' ,'message1');
INSERT INTO T VALUES(2,'user2' ,'message2');
INSERT INTO T VALUES(3,'user1' ,'message3');
Query 1:
SELECT (case when rn = 1 then user else '' end) 'users',
(case when rn = 1 then cnt else '' end) 'count',
message
FROM (
select
t1.user,
t2.cnt,
t1.message,
(SELECT COUNT(*) from t tt WHERE tt.user = t1.user and t1.id >= tt.id) rn
from T t1
join (
select user, count(*) cnt
from T
group by user
) t2 on t1.user = t2.user
) t1
order by user,message
Results:
| users | count | message |
|-------|-------|----------|
| user1 | 2 | message1 |
| | | message3 |
| user2 | 1 | message2 |

select user, count(*) as 'total' , group_concat(message) from statistics group by user;

You could join the result of your group by with the full table (or vice versa)?
Or, depending on what you want, you could use group_concat() using \n as separator.

Use Group_concat
select user, count(0) as ct,group_concat(Message) from statistics group by user;
This will give you message in csv format
NOTE: GROUP_CONCAT has size limit of 1024 characters by default in mysql.
For UTF it goes to 1024/3 and utfmb4 255(1024/4).
You can use group_concat_max_len global variable to set its max length as per need but take into account memory considerations on production environment
SET group_concat_max_len=100000000
Update:
You can use any separator in group_concat
Group_concat(Message SEPARATOR '----')

Try grouping with self-join:
select s1.user, s2.cnt, s1.message
from statistics s1
join (
select user, count(*) cnt
from statistics
group by user
) s2 on s1.user = s2.user

Groupwise maximum in larger query

Really struggling with a query that uses groupwise maximum, any help would be much appreciated. Feel free to point out if I should not be using groupwise maximum.
I have two tables application and email, one application can have many emails. What I'm trying to do in my query is get all details from application and join the email table (I'm actually only getting a foreign key from email for another table which indicates if the email has been replied to), getting the last email sent based on the max(timestamp), which is why I am trying to use groupwise maximum.
I've tried this, but it seems to make a duplicate of each row:
SELECT `application` . * , `email1`.`student_email_id` AS `email_student_email_id`
FROM `application`
LEFT JOIN (
SELECT MAX( tstamp ) AS tstamp, id, student_email_id, application_id
FROM email
GROUP BY id, student_email_id, application_id
) AS email1 ON `email1`.`application_id` = `application`.`id`
WHERE `application`.`status` = 'returned'
This is what seemed to work at first but is causing issues now and I'm sure it's pretty sloppy code:
select `application`.*, `email1`.`student_email_id` as `email_student_email_id`
from `application`
left join (
select student_email_id, max(tstamp) as tstamp, application_id
from email
group by application_id, tstamp
order by tstamp desc
limit 1) as email1 on `email1`.`application_id` = `application`.`id`
where `application`.`status` = 'returned'
Any guidance would be highly appreciated, if you need to see more code please ask! Thanks.
Further clarity if needed for my db set up and what should be happening (left out unimportant parts):
Application Table
+----+----------+
| id | status |
+----+----------+
| 1 | returned |
+----+----------+
Email Table
+----+------------+----------------+------------------+
| id | tstamp | application_id | student_email_id |
+----+------------+----------------+------------------+
| 1 | 2014-12-26 | 1 | NULL |
| 2 | 2014-12-27 | 1 | 3 |
+----+------------+----------------+------------------+
The query should be showing the following:
+----+----------+------------------------+
| id | status | email_student_email_id |
+----+----------+------------------------+
| 1 | returned | 3 |
+----+----------+------------------------+
First solution above shows duplicates of everything (maybe I'm nearly there) and second one shows null for the joined table columns, although I'm sure it did work at one stage or in isolation at least!

You're looking for the latest row in your Email table for each distinct application_id.
Your subquery to get that isn't quite right. Here's how you get that.
SELECT s.application_id, e.student_email_id
FROM email e
JOIN (
SELECT MAX(tstamp) tstamp, application_id
FROM email
GROUP BY application_id
) s ON e.application_id = s.application_id AND e.tstamp = s.tstamp
There's another way to do this, that might be more efficient. It will work if the id column is an autoincrement column.
SELECT s.application_id, e.student_email_id
FROM email e
JOIN (
SELECT MAX(id) id
FROM email
GROUP BY application_id
) s ON e.id = s.id
Either of these preceding subqueries gets the latest student_email_id for each application_id. The second one uses the JOIN to extract only the highest id number for each application_id, and uses that id to find the latest student_email_id.
Your subquery was this. It doesn't get what you hoped for.
SELECT MAX( tstamp ) AS tstamp, id, student_email_id, application_id /*wrong*/
FROM email
GROUP BY id, student_email_id, application_id
You grouped this by id. That means you're going to get all the detail rows. That's not what you want. Even this
SELECT MAX( tstamp ) AS tstamp, student_email_id, application_id /*wrong*/
FROM email
GROUP BY student_email_id, application_id
will give you more than one record for each application_id value.
So the query you need is:
SELECT application.* , email1.student_email_id AS email_student_email_id
FROM application
LEFT JOIN (
SELECT s.application_id, e.student_email_id
FROM email e
JOIN (
SELECT MAX(id) id
FROM email
GROUP BY application_id
) s ON e.id = s.id
) AS email1 ON email1.application_id = application.id
WHERE application.status = 'returned'
When you're designing queries like this, it's smart to test from the inside out, starting with the innermost subquery.

How to select SQL rows where a spesific column is not the same value as another row?

I have a table with some columns, one of these are "email".
I want to select the rows in this table, where there is no duplicate value in "email".
Meaning if the table was like this:
id - email
10 - hello#hello.com
11 - bro#lift.com
12 - hello#hello.com
13 - hey#hello.com
The query would return only id 11 and 13, as 10 and 12 are duplicates.

I'll recommend the query that uses JOIN.
SELECT *
FROM tableName
WHERE email IN
(
SELECT email
FROM tableName
GROUP BY email
HAVING COUNT(*) = 1
)
SQLFiddle Demo
or using JOIN
SELECT a.*
FROM tableName a
INNER JOIN
(
SELECT email
FROM tableName
GROUP BY email
HAVING COUNT(*) = 1
) b ON a.email = b.email
SQLFiddle Demo
for better performance, you use define an index on column email

Try this:
SELECT *
FROM Emails
WHERE email NOT IN(SELECT email
FROM emails
GROUP BY email
HAVING COUNT(email) > 1);
This will give you:
| ID | EMAIL |
----------------------
| 11 | bro#lift.com |
| 13 | hey#hello.com |
SQL Fiddle Demo

MySQL getting the lowest ID for a certain user -or- the ID of the entry with the highest urgency for each row

I have the following database
id | user | urgency | problem | solved
The information in there has different users, but these users all have multiple entries
1 | marco | 0 | MySQL problem | n
2 | marco | 0 | Email problem | n
3 | eddy | 0 | Email problem | n
4 | eddy | 1 | MTV doesn't work | n
5 | frank | 0 | out of coffee | y
What I want to do is this: Normally I would check everybody's oldest problem first. I use this query to get the ID's of the oldest problem.
select min(id) from db group by user
this gives me a list of the oldest problem ID's. But I want people to be able to make a certain problem more urgent. I want the ID with the highest urgency for each user, or ID of the problem with the highest urgency
Getting the max(urgency) won't give the ID of the problem, it will give me the max urgency.
To be clear: I want to get this as a result
row | id
0 | 1
1 | 4
The last entry should be in the results since it's solved

Select ...
From SomeTable As T
Join (
Select T1.User, Min( T1.Id ) As Id
From SomeTable As T1
Join (
Select T2.User, Max( T2.Urgency ) As Urgency
From SomeTable As T2
Where T2.Solved = 'n'
Group By T2.User
) As MaxUrgency
On MaxUrgency.User = T1.User
And MaxUrgency.Urgency = T1.Urgency
Where T1.Solved = 'n'
Group By T1.User
) As Z
On Z.User = T.User
And Z.Id = T.Id

There are lots of esoteric ways to do this, but here's one of the clearer ones.
First build a query go get your min id and max urgency:
SELECT
user,
MIN(id) AS min_id,
MAX(urgency) AS max_urgency
FROM
db
GROUP BY
user
Then incorporate that as a logical table into
a larger query for your answers:
SELECT
user,
min_id,
max_urgency,
( SELECT MIN(id) FROM db
WHERE user = a.user
AND urgency = a.max_urgency
) AS max_urgency_min_id
FROM
(
SELECT
user,
MIN(id) AS min_id,
MAX(urgency) AS max_urgency
FROM
db
GROUP BY
user
) AS a
Given the obvious indexes, this should be pretty efficient.

The following will get you exactly one row back -- the most urgent, probably oldest problem in your table.
select id from my_table where id = (
select min(id) from my_table where urgency = (
select max(urgency) from my_table
)
)
I was about to suggest adding a create_date column to your table so that you could get the oldest problem first for those problems of the same urgency level. But I'm now assuming you're using the lowest ID for that purpose.
But now I see you wanted a list of them. For that, you'd sort the results by ID:
select id from my_table where urgency = (
select max(urgency) from my_table
) order by id;
[Edit: Left out the order by!]
I forget, honestly, how to get the row number. Someone on the interwebs suggests something like this, but no idea if it works:
select #rownum:=#rownum+1 ‘row', id from my_table where ...

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Update duplicate email addresses on mysql database table - mysql

Related

Update columns of a table which data contains from another table mysql

MySQL Query with the count, group by

Groupwise maximum in larger query

How to select SQL rows where a spesific column is not the same value as another row?

MySQL getting the lowest ID for a certain user -or- the ID of the entry with the highest urgency for each row

Categories

Resources