Removing duplicate data from many rows in mysql? - mysql

I am a web developer so my knowledge of manipulating mass data is lacking.
A coworker is looking for a solution to our data problems. We have a table of about 400k rows with company names listed.
Whoever designed this didnt realize there needed to be some kind of unique identifier for a company, so there are duplicate entries for company names.
What method would one use in order to match all these records up based on company name, and delete the duplicates based on some kind of criteria (another column)
I was thinking of writing a script to do this in php, but I really have a hard time believing that my script would be able to execute while making comparisons between so many rows. Any advice?

Answer:
Answer origin
1) delete from table1
2) USING table1, table1 as vtable
3) WHERE (NOT table1.ID>vtable.ID)
4) AND (table1.field_name=vtable.field_name)
Here you tell mysql that there is a table1.
Then you tell it that you will use table1 and a virtual table with the values of table1.
This will let mysql not compare a record with itself!
Here you tell it that there shouldn’t be records with the same field_name.

The way I've done this in the past is to write a query that returns only the set I want (usually using DISTINCT + a subquery to determine the right record based on other values), and insert that into a different table. You can then delete the old table and rename the new one to the old name.

To find list of companies with duplicates in your table you can use script like that:
SELECT NAME
FROM companies
GROUP BY NAME
HAVING COUNT(*) > 1
And following will delete all duplicates except containing max values in col column
DELETE del
FROM companies AS del
INNER JOIN (
SELECT NAME, MAX(col) AS col
FROM companies
GROUP BY NAME
HAVING COUNT(*) > 1
) AS sub
ON del.NAME = sub.NAME AND del.col <> sub.col

Related

How to delete rows in a mySQL table specified by data from the same table (in one expression)? [duplicate]

This question already has answers here:
MySQL Error 1093 - Can't specify target table for update in FROM clause
(16 answers)
Closed 4 years ago.
I want to delete rows from a table "Book", where two colums have appeared before.
I successfully selected the ids of the rows wich should be deleted like so:
SELECT all_dupes.book_id
FROM (SELECT *
FROM Book as BBook NATURAL JOIN Book as BBBook
WHERE book_id NOT IN (SELECT book_id
FROM Book as BBook NATURAL JOIN Book as BBBook
GROUP BY buying_price,
selling_price
HAVING Count(*) = 1
ORDER BY book_id)
ORDER BY book_id) AS all_dupes
WHERE book_id NOT IN (SELECT book_id
FROM Book as BBook NATURAL JOIN Book as BBBook
GROUP BY buying_price,
selling_price
HAVING Count(*) >= 2
ORDER BY book_id);
…but when I try to delete the rows with
DELETE FROM Book
WHERE book_id IN (
<expression as above without tailing ;>
) ;
I get an error ERROR 1093 (HY000): Table 'Book' is specified twice, both as a target for 'DELETE' and as a separate source for data
I already tried to alias the table and to natural join the table to itself, like suggested in other questions regarding this issue.
Also I read quite some ammount of questions here, but they mostly are very specific and I don't get how to change my delete-query by the answers provided there.
What do I have to change in order to get this done? Splitting the expression is not an option (meaning there mustn't be two ;, but just one expression).
Database used: MariaDB
There are a few problems with your SQL. I would fix those first, even though your DBMS isn't rejecting the query. It might fix the problem, because in your DELETE statement you might have finally pushed the system past its limit. In any case it will clarify the question.
ORDER BY is, in standard SQL, permitted only once, in the outermost SELECT clause. It is a way to return the rows to the calling process in a particular order, not a way to express order internally to the SQL processor. Your extra ORDER BYs don't affect your query, so remove them.
GROUP BY should repeat any column names not aggregated in the SELECT clause. Because you select book_id, you should also group by book_id.
I doubt you actually need all those joins anyway. I'm not sure what you're trying to do, but I think your query might just be
delete Book
where exists ( select 1
from Book as B
where B.book_id = Book.book_id
group by B.book_id, B.buying_price, B.selling_price
having count(*) > 1
)
That would eliminate all rows with a book_id for which any combination of {book_id, buying_price, selling_price} is not unique. But I'm not sure that's what you really want.
I want to delete rows from a table "Book", where two colums have appeared before.
Yeah, there is no "before" in SQL, because there's no order. I think what you mean is that if you have 3 "duplicate" rows, you'd like to eliminate the extra 2. SQL has no such operation.
SQL operates by predicate logic: rows are deleted according to whether or not they match some criteria. Duplicate rows, by definition, all meet the same criteria. Because there's no order, there's no notion of deleting all those that match except the first one.
The best solution, it must be said, is to prevent duplication in the first place by correctly declaring uniqueness in the table definition. Failing that, the remedy is usually to insert the distinct set into a temporary table, delete in the main table all those that exist in the temporary one, insert from the temporary into the main, and drop the temporary table.

Delete Query deletes all records without leaving the original ones when duplicates are detected

I have two tables, tbl_NTE and tbl_PAH. There are records in tbl_PAH that is already available in tbl_NTE that is why I created an append query to automatically transfer and update some records which as a result causes duplicates every time I click the save button because the save button triggers the append query query.
I want to run a query where all the data with duplicates are deleted and just leave the original ones.
I created a delete query and typed the criteria:-
In (SELECT [CaseIDNo]
FROM [tbl_PAH] As Tmp GROUP BY [CaseIDNo]
HAVING Count(*)>1 )
I've also tried Last, First, Max and Group By as criteria but all it does it delete all the records as well.
In (SELECT DISTINCTROW tbl_PAH.CaseIDNo
FROM tbl_PAH
GROUP BY tbl_PAH.CaseIDNo
HAVING (((tbl_PAH.CaseIDNo) In (SELECT Last(tbl_PAH.CaseIDNo) AS
LastOfCaseIDNo FROM tbl_PAH Group By tbl_PAH.CaseIDNo HAVING
(((Count(tbl_PAH.CaseIDNo))>1));)));)
Here is the other one I've tried but also deletes the whole records of duplicates without leaving the original one.
DELETE tbl_PAH.CaseIDNo
FROM tbl_PAH
WHERE (((tbl_PAH.CaseIDNo) In (SELECT DISTINCTROW tbl_PAH.CaseIDNo
FROM tbl_PAH
GROUP BY tbl_PAH.CaseIDNo;)));
and when I run it, all the duplicates are deleted without leaving the original ones. Any idea on how I can work this out?
I've already set the Unique Records to Yes. I set the index to Yes (Duplicates Ok) to have no error while automatically appending the records to other tables but as a result, duplicates are created. Any help on deleting the duplicates with the criteria "When a record has duplicates in terms of CaseIDNo, the duplicates will be deleted leaving only the original record." I am a newbie at MS Access 2010 that is why I am still learning. I am using Microsoft Access 2010. Thank you in advance to those who will answer.
You can use the following query to delete all duplicate records where ID is not the minimal value of ID. Since ID is a unique column, that should leave the originals in place.
Note that I've refactored your first condition from an IN to an EXISTS, because those are often faster and more reliable.
DELETE tbl_PAH.CaseIDNo
FROM tbl_PAH t
WHERE EXISTS (SELECT 1 FROM tbl_PAH s WHERE s.CaseIDNo = t.CaseIDNo HAVING COUNT(s.CaseIDNo) > 1)
AND t.ID <> (SELECT Min(s2.ID) FROM tbl_PAH s2 WHERE t.CaseIDNo = s2.CaseIDNo)

Check what table a record belongs in using MySQL

I have a requirement to check what table a record belongs in out of 2 tables and set a variable depending on the returned table.
e.g. I have 2 tables (tbl_registered_users, tbl_unregistered_users). If I search for an email address that existed in tbl_registered_users I would like the query to return 'tbl_registered_users' so I can set a variable $whatTable = ... (for example).
I know I could do this with 2 queries or even 1 if I can guarantee the record will exist in at least one table however I would potentially like to use the query on 3/4/5/10 tables and on records that may not exist in any.
Thanks
You can use a UNION for that with a subquery:
SELECT *
FROM (
SELECT 'Registered' WhichTable, Email
FROM tbl_registered_users
UNION
SELECT 'UnRegistered', Email
FROM tbl_unregistered_users
) t
WHERE Email = 'emailaddress'
SQL Fiddle Demo
Using UNION ALL would yield a better performance, but it won't remove duplicates (in case you have duplicated data in either single table).

Comparing 2 identical MySQL tables

I have 2 indentical tables called LIVE and BACKUP.
What I cam trying to do is to compare a LIVE record with its equivalent BACKUP record to see if they match. This check is required each time an individual LIVE record is accessed.
i.e. I only want to compare record number 59 (as an example) rather than all records in the LIVE table?
Currently I can do what I want by simply comparing the LIVE record and its equivalent BACKUP record on a field by field basis.
However, I was wondering if it is possible to do a simple "Compare LIVE record A with BACKUP record A".
I don't need to know what the differences are or even in which fields they occur. I only need to know a simple yes/no as to whether both records match or not.
Is such a thing possible or am I stuck comparing the tables on a field by field basis?
Many thanks,
Pete
Here is a hack, assuming the columns really are all the same:
select count(*)
from ((select *
from live
where record = 'A'
) union
(select *
from backup
where record = 'A'
)
) t
This will return "1" if they are identical and "2" if more than one record exists. If you want to ensure against two values being in the same table, then use the modified form:
select count(distinct which)
from ((select 'live' as which, l.*
from live .
where record = 'A'
) union
(select 'backup' as which, b.*
from backup b
where record = 'A'
)
) t;
Also . . . Note the use of union. The duplicate removal is very intentional here.

MySQL: Grabbing the latest ID from duplicate records within a table

I'm trying to grab the latest ID from a duplicate record within my table, without using a timestamp to check.
SELECT *
FROM `table`
WHERE `title` = "bananas"
-
table
id title
-- -----
1 bananas
2 apples
3 bananas
Ideally, I want to grab the ID 3
I'm slightly confused by the SELECT in your example, but hopefully you will be able to piece this out from my example.
If you want to return the latest row, you can simply use a MAX() function
SELECT MAX(id) FROM TABLE
Though I definitely recommend trying to determine what makes that row the "latest". If its just because it has the highest column [id], you may want to consider what happens down the road. What if you want to combine two databases that use the same data? Going off the [id] column might not be the best decision. If you can, I suggest an [LastUpdated] or [Added] datestamp column to your design.
im assuming the id's are autoincremented,
you can count how many rows you have, store that in a variable and then set the WHERE= clause to check for said variable that stores how many rows you have.
BUT this is a hack solution because if you delete a row and the ID is not decremented you can end up skipping an id.
select max(a.id) from mydb.myTable a join mydb.myTable b on a.id <> b.id and a.title=b.title;