Deleting partially duplicate rows from a table [duplicate] - mysql

This question already has answers here:
Delete partially similar rows in MySQL
(3 answers)
Closed 1 year ago.
We have a table with the following schema:
+--------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(128) | YES | | NULL | |
| label | varchar(10) | NO | | NULL | |
| f1 | varchar(8) | YES | | NULL | |
| f2 | varchar(6) | YES | | NULL | |
+--------+------------------+------+-----+---------+----------------+
We would like to consider all rows where the name-label pair is non-distinct to be duplicates (i.e. ignoring f1 and f2). In such cases we would like to delete the duplicate rows by retaining only the row with the highest id (which incidentally would have been entered in the table at a later time, and therefore assumed to be more current).
What would be the most efficient way to realize it in MySQL 5.6.51?

JOIN itself and delete the row with smaller id
DELETE a
FROM duplicates a
JOIN duplicates b ON a.label = b.label AND a.name = b.name
WHERE a.id < b.id
name could be NULL, if NULL is considered duplicate:
DELETE a
FROM duplicates a
JOIN duplicates b ON a.label = b.label AND COALESCE(a.name, '') = COALESCE(b.name, '')
WHERE a.id < b.id

Related

MySQL - UPDATE one column based on results of a SELECT when the SELECT returns multiple columns

I've read MySQL - UPDATE query based on SELECT Query and am trying to do something similar - i.e. run an UPDATE query on a table and populate it with the results from a SELECT.
In my case the table I want to update is called substances and has a column called cas_html which is supposed to store CAS Numbers (chemical codes) as a HTML string.
Due to the structure of the database I am running the following query which will give me a result set of the substance ID and name (substances.id, substances.name) and the CAS as a HTML string (cas_values which comes from cas.value):
SELECT s.`id`, GROUP_CONCAT(c.`value` ORDER BY c.`id` SEPARATOR '<br>') cas_values, GROUP_CONCAT(s.`name` ORDER BY s.`id`) substance_name FROM substances s LEFT JOIN cas_substances cs ON s.id = cs.substance_id LEFT JOIN cas c ON cs.cas_id = c.id GROUP BY s.id;
Sample output:
id | cas_values | substance_name
----------------------------------------
1 | 133-24<br> | Chemical A
455-213<br>
21-234
-----|----------------|-----------------
2 999-23 | Chemical B
-----|----------------|-----------------
3 | | Chemical C
-----|----------------|-----------------
As you can see the cas_values column contains the HTML string (which may also be an empty string as in the case of "Chemical C"). I want to write the data in the cas_values column into substances.cas_html. However I can't piece together how to do this because other posts I'm reading get the data for the UPDATE in one column - I have other columns returned by my SELECT query.
Essentially the problem is that in my "sample output" table above I have 3 columns being returned. Other SO posts seem to have just 1 column being returned which is the actual values that are used in the UPDATE query (in this case on the substances table).
Is this possible?
I am using MySQL 5.5.56-MariaDB
These are the structures of the tables, if this helps:
mysql> DESCRIBE substances;
+-------------+-----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-----------------------+------+-----+---------+----------------+
| id | mediumint(8) unsigned | NO | PRI | NULL | auto_increment |
| app_id | varchar(8) | NO | UNI | NULL | |
| name | varchar(1500) | NO | | NULL | |
| date | date | NO | | NULL | |
| cas_html | text | YES | | NULL | |
+-------------+-----------------------+------+-----+---------+----------------+
4 rows in set (0.01 sec)
mysql> DESCRIBE cas;
+-------+-----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+-----------------------+------+-----+---------+----------------+
| id | mediumint(8) unsigned | NO | PRI | NULL | auto_increment |
| value | varchar(13) | NO | UNI | NULL | |
+-------+-----------------------+------+-----+---------+----------------+
2 rows in set (0.01 sec)
mysql> DESCRIBE cas_substances;
+--------------+-----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+-----------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| cas_id | mediumint(8) unsigned | NO | MUL | NULL | |
| substance_id | mediumint(8) unsigned | NO | MUL | NULL | |
+--------------+-----------------------+------+-----+---------+----------------+
3 rows in set (0.02 sec)
Try something like this :
UPDATE substances AS s,
(
SELECT s.`id`,
GROUP_CONCAT(c.`value` ORDER BY c.`id` SEPARATOR '<br>') cas_values,
GROUP_CONCAT(s.`name` ORDER BY s.`id`) substance_name
FROM substances s
LEFT JOIN cas_substances cs ON s.id = cs.substance_id
LEFT JOIN cas c ON cs.cas_id = c.id
GROUP BY s.id
) AS t
SET s.cas_html=t.cas_values
WHERE s.id = t.id
If you don't want to modify all the value, the best way to limit the update to test it, is to add a condition in the where, something like that :
...
WHERE s.id = t.id AND s.id = 1

Select the most recent note/comment from an application [duplicate]

This question already has answers here:
SQL select only rows with max value on a column [duplicate]
(27 answers)
Closed 5 years ago.
I have 2 relevant tables here, application and application_note.
I want to find the latest note (user and text/note) for each application.
application_note looks like
+----------------+-------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+-------------------+-----------------------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| create_time | timestamp | NO | | CURRENT_TIMESTAMP | |
| update_time | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| application_id | int(11) | YES | MUL | NULL | |
| note | text | YES | | NULL | |
| user | varchar(50) | YES | | NULL | |
+----------------+-------------+------+-----+-------------------+-----------------------------+
I've been trying a bunch of different queries. The closest thing I have is
SELECT user, note, application_id, MAX(create_time)
FROM application_note
GROUP BY application_id
which looks like the max(create_time) is what I expect, but the other values are off. I've been trying at this for awhile and have been getting no where.
edit: I plan to eventually add this a view or add this to a larger query, if that changes anything.
You have to join the table back on itself:
SELECT b.application_id, b.ct, a.note, a.user
FROM (SELECT application_id, MAX(create_time) AS ct
FROM application_note
GROUP BY application_id) b
INNER JOIN application_note a ON a.application_id=b.application_id
AND a.create_time=b.ct
This will return the record with the latest creation time for a given application id. Duplicates might occur when you have multiple records with the same creation_time for a given application_id

SQL join not working (or very slow)

I have the following tables in mysql:
Table A:
+-------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| sid | varchar(50) | YES | | NULL | |
| type | int(11) | YES | | NULL | |
+-------+-------------+------+-----+---------+-------+
Table B:
+---------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+-------+
| channel | varchar(20) | YES | | NULL | |
| sid | varchar(50) | YES | | NULL | |
| type | varchar(20) | YES | | NULL | |
+---------+-------------+------+-----+---------+-------+
I want to find the rows from A that have an entry in B with the same sid. I tried the following Join command:
SELECT A.sid FROM A join B on A.sid=B.sid;
This query never gives me the answer.
Tabe A has 465420 entries and table B has 291326 entries.
Why does it not work?
Are there too many entries?
Or does it have anything to do with the fact that I have no primary keys assigned?
Your query is fine. You would appear to need an index. I would suggest B(sid).
You can also write the query as:
select a.sid
from a
where exists (select 1 from b where a.sid = b.sid);
This will not affect performance -- unless there are lots of duplicates in b -- but it will eliminate issues caused by duplicates in b.
Try
SELECT A1.sid
FROM (select A.sid from A order by sid) A1
join (select B.sid from B order by sid) B1
on A1.sid=B1.sid;
Else above holds true. You need index.

Updating 10 million rows while having a check with another table

There are 12 million rows in table A and 10 million in table B.
Now both these table have a common field, say user_id.
Now I've added a column in table A to add the primary key of B.
So the Tables are something like this
Table A
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| user_id | int(11) | NO | | NULL | |
| b_id | int (11) | YES | MUL | NULL | |
+-------------+--------------+------+-----+---------+----------------+
Table B
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| user_id | int(11) | NO | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
Now I want to update the b_id in table A. In order to do that, I had written the following query:
update A
set A.b_id = (select B.id from B
and A.user_id = B.user_id
);
But even after indexing it and doing it in a chunks of 100K, its taking a really long time(around 3 min each).
Is there a better and faster way to update it?
update A
SET A.b_id = B.id
FROM Table A
INNER JOIN TABLE B ON A.user_id = B.user_id
Make sure you have indexes setup on the user_id columns.
Can't comment yet: JOINS are historically faster in Mysql as long as there isn't duplicated data.

Select from another table if result not matched

This has been asked but I dont quite match existing questions with my case.
I have two tables, the first table is credentials with id username and email
and an email-alias table with user-id (which corresponds to credentials.id) and email. Emails in credentials are more often "user.name#domain.com" while in alias they'd be "usern#domain.com".
All I have now is
SELECT `username` FROM `credentials` WHERE `email` LIKE ?
But the email will not always match if I query with "usern#domain.com". What I want to do is get the username with one query which would fall back to email-alias and use "usern#domain.com" to get an user-id from there to be used again in credentials to match a username
The pitfall is that the supplied email could be either an aliased one "usern#.." or "user.name#.."
mysql> describe `email-alias`;
+---------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| user-id | int(11) | NO | UNI | NULL | |
| email | varchar(100) | YES | | NULL | |
+---------+--------------+------+-----+---------+----------------+
mysql> describe `credentials`;
+-----------------+--------------+------+-----+---------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+--------------+------+-----+---------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| username | varchar(100) | NO | UNI | NULL | |
| email | varchar(100) | NO | | NULL | |
+-----------------+--------------+------+-----+---------------------+----------------+
Your question is kind of confusing but I think what you're trying to do is select username from the first table if the email exists and if it doesn't select from the 2nd table if it exists. You would use subqueries for that. Hope this helps.
SELECT `c.username`
FROM credentials c
WHERE c.email = 'usern#domain.com' OR c.id =
(SELECT `e.user-id` from email-alias e WHERE e.email = "usern#domain.com")
I'm not sure if I understand your question clearly because it's too confusing. But let me give this a try. You need to join the table using INNER JOIN.
SELECT `username`
FROM credentials a
INNER JOIN email_alias b
on a.ID = b.userID
WHERE b.email = 'usern#domain.com'
UPDATE 1
SELECT `username`
FROM credentials a
INNER JOIN email_alias b
on a.ID = b.userID
WHERE b.email LIKE '%usern#%'