Remove Duplicates in MySQL - mysql

I have a database table that was generated by importing several thousand text documents each very large. For some reason, some files were imported multiple times.
I am trying to remove duplicate rows by using following query:
ALTER IGNORE TABLE mytable ADD UNIQUE INDEX myindex (LASTNAME, FIRSTNAME, HOUSENUMBER, STREET, CITY, ZIP, DOB, SEX);
but I was getting an error
1062 - Duplicate entry
Apparently, IGNORE has been deprecated.
How can I remove duplicates from my database?
I guess I have to do a DELETE with a JOIN but I can't figure out the code.
The table is InnoDB and currently has about 40,000,000 rows (there should be about 17,000,000). Each row has a primary key.
Considering the size, I am hesitant to temporally change the table to MyISAM.

Each row has a primary key
Is a unique number?
Create an AUX table like this(assuming ID is the PK):
create table mytable_aux as (
select LASTNAME, FIRSTNAME, HOUSENUMBER, STREET, CITY, ZIP, DOB, SEX, MIN(ID)
from mytable
group by LASTNAME, FIRSTNAME, HOUSENUMBER, STREET, CITY, ZIP, DOB, SEX);
Then delete everything that is not in aux table:
delete from mytable where id not in (select aux.id from mytable_aux aux) ;

Assuming it is just one table and you have the SQL dump available...
CREATE the table with all the relationships established but no data inserted. Keep the INSERT statements stored in a separate .sql file.
Change all the INSERT statements to INSERT IGNORE.
Import the updated .sql file containing only the INSERT IGNORE statements. The duplicates will be automatically ignored.
Please note that, without comparing manually, you won't be able to figure out which or how many records were ignored.
However, if you're absolutely sure that you really don't need the duplicates based on the relationships defined on the table, then this approach works fairly well.
Also, if you'd like to do the same with multiple tables, you'll have to make sure that you CREATE all the tables at the start, define the foreign keys / dependencies AND, most importantly, arrange the new .sql file in such a manner that the table that has no dependency gets the INSERT statements loaded first. Likewise, the last set of INSERT statements will be for the table with the most number of dependencies.
Hope that helps.

If those are the only fields in your table you can always:
create table temp_unique as
select distinct LASTNAME, FIRSTNAME, HOUSENUMBER, STREET, CITY, ZIP, DOB, SEX
from mytable
then rename (or drop if you dare) mytable and rename temp_unique to mytable, then create your indexes (make sure to create any other indexes or FKs or whatever that already exist).
If you're working on a live table you'll have to delete the underlying records one at a time. That's quite a bit different -- add a uid then perform deletes. If that's your situation, let us know, we can refactor.

Related

Mysql copying parent child data to another table with different parent child ids

I have a table structure similar to screenshot attached below.
I want to copy data from these tables to different tables with similar structure.
NEW_EMPLOYEE
NEW_EMPLOYEE_COURSE_COMPLETION
NEW_EMPLOYEE_COURSE_COMPLETION_SUPPORTING_DOCUMENTS
I am currently using cursor to copy the data to NEW_xxx tables.
I am just wondering if this causes any performance issues or any other better approach. Please share thoughts and I appreciate the help. Thanks.
Note: Primary keys from the original tables are not copied. They will be generated in the NEW_xxx tables. But the mapping should be preserved.
Add an original_ID column to each of the new tables, to hold the mappings as rows are copied. I assume the ID columns are AUTO_INCREMENT.
You can use INSERT INTO ... SELECT .... The SELECT query can join with the NEW_EMPLOYEE table to map the original employee IDs to the new IDs.
INSERT INTO NEW_EMPLOYEE (original_ID, FIRST_NAME, LAST_NAME, D_O_B)
SELECT ID, FIRST_NAME, LAST_NAME, D_O_B
FROM EMPLOYEE;
INSERT INTO NEW_EMPLOYEE_COURSE_COMPLETION (original_ID, employee_id, course_id)
SELECT ecc.id, ne.id, ecc.course_id
FROM EMPLOYEE_COURSE_COMPLETION AS ecc
JOIN NEW_EMPLOYEE AS ne ON ecc.employee_id = ne.original_id;
INSERT INTO NEW_EMPLOYEE_COURSE_COMPLETION_SUPPORTING_DOCUMENTS (original_id, employee_course_completion_id, document_id)
SELECT eccsd.id, ne.id, eccsd.employee_course_completion_id, eccsd.document_id
FROM EMPLOYEE_COURSE_COMPLETION AS eccsd
JOIN NEW_EMPLOYEE AS ne ON eccsd.employee_id = ne.original_id;

Check if entry exists and not insert in mysql

I am doing an insert from an imported table and adding data into multiple tables using mysql.
Basically when doing the insert there are some null fields as the data has been imported from a csv.
What I want to do is extract the data and not create multiple null entries. An example is when adding contacts which have no entries. Basically I want to have one entry in the table which can be bound to the id within the table.
How can i do this?
My current code is
Insert into Contact(FirstName, Surname, Position, TelephoneNo, EmailAddress, RegisteredDate)
Select Distinct Import.FirstName, Import.SecondName, Import.JobTitle,
Import.ContactTelNumber, Import.EmailAddress, Import.RegistrationDate
FROM Import
This basically imports and does no checks but where can I add check for this?
It's hard to infer exactly what you mean from your description. It would help if you showed a couple of example lines, one that you want included and one that you want to be excluded.
But you can add a variety of conditions in the WHERE clause of your SELECT. For example, if you just want to make sure that at least one column in Import is non-null, you could do this:
INSERT INTO Contact(FirstName, Surname, Position, TelephoneNo,
EmailAddress, RegisteredDate)
SELECT DISTINCT FirstName, SecondName, JobTitle,
ContactTelNumber, EmailAddress, RegistrationDate
FROM Import
WHERE COALESCE(FirstName, SecondName, JobTitle, ContactTelNumber,
EmailAddress, RegistrationDate) IS NOT NULL
COALESCE() is a function that accepts a variable number of arguments, and returns the first non-null argument. If all the arguments are null, it returns null. So if we coalesce all the columns, and we get a null, then we know that all the columns are null, and we exclude that row.
Re your comment:
Okay, it sounds like you want a unique constraint over the whole row, and you want to copy only rows that don't violate the unique constraint.
One way to accomplish this would be the following:
ALTER TABLE Contact ADD UNIQUE KEY (FirstName, Surname, Position, TelephoneNo,
EmailAddress, RegisteredDate);
INSERT IGNORE INTO Contact(FirstName, Surname, Position, TelephoneNo,
EmailAddress, RegisteredDate)
SELECT DISTINCT FirstName, SecondName, JobTitle,
ContactTelNumber, EmailAddress, RegistrationDate
FROM Import;
The INSERT IGNORE means if it encounters an error like a duplicate row, don't insert it, but also don't abort the insert for other rows.
The unique constraint creates an index, so it will take some time to run that ALTER TABLE, depending on the size of your table.
Also it may be impractical to have a key containing many columns. Indexes have a limit of 16 columns and 1000 bytes total in length. However, I would expect that what you really want is to restrict to one row per EmailAddress or some other subset of the columns.

Inserting into a table from an incompatible table

I have a MySql table called Person, and one day I accidentally deleted someone from this table. I have a backup table, called PersonBak so I was going to restore my deletion from the backup. However, in the course of moving forward on my application I renamed all the fields in Person, except for the primary key, PersonID. Now Person and PersonBak have the same data, but only one matching column name.
Is there any way to restore my missing person to Person from PersonBak without doing a lot of work? I have quite a few columns. Of course I could just do the work now, but I can imagine this coming up again.
Is there some way to tell MySql that these are really the same table, with the columns in the same order, just different column names? Or any way at all to do this without writing out specifics of which columns in PersonBak match which ones in Person?
If the column datatypes are the same between the tables, the column count is the same, and they are all in the same order, then MySQL will do all of the work for you:
INSERT INTO t1 SELECT * FROM t2;
The column names are ignored. The server uses ordinal position only, to decide how to line up the from/to columns.
What about this:
insert into Person(id, col11, col12) (select id, col21, col22 from personBak where id=5)
person schema:
columns (id, col11, col12)
personBak schema:
columns (id, col21, col22)
Look at Mysql SELECT INTO and you can specify the field names & create an insert statement

Inserting database row with values from another table

Basically, I have two tables: images and servers. When I want to insert a row into the images table, I need to specify a s_id as one of the fields. Problem is, I only have name, which is another field in the servers table. I need to find what s_id belongs to name, and then use that in my INSERT INTO query on the images table.
Maybe this image will help:
http://i.imgur.com/rYXbW.png
I only know the name field from the servers table, and I need to use it to get the s_id field from the servers table. When I have that, I can use it in my INSERT INTO query, as it's a foreign key.
I found this:
http://www.1keydata.com/sql/sqlinsert.html
But it just confused me even more.
One solution would be to run two queries. One to get the s_id, and one to run the insert query. But I'd like to limit the amount of queries I run if there's a reasonable alternative.
Thanks!
You can use the INSERT ... SELECT form, something like this (with real column names and values of course):
INSERT INTO images (s_id, u_id, name, filename, uploaded)
SELECT s_id, ...
FROM servers
WHERE name = 'the name'
I don't know where you're getting the u_id, name, filename, or uploaded column values for images but you can include them as literal values in the SELECT:
INSERT INTO images (s_id, u_id, name, filename, uploaded)
SELECT s_id, 11, 'pancakes', 'pancakes.jpg', '2011-05-28 11:23:42'
FROM servers
WHERE name = 'the name'
This sort of thing will insert multiple values if servers.name is not unique.
You should be able to do something like this, but you'll need to fill in the items in <> with the values you want to insert.
INSERT INTO images (s_id, u_id, name, filename, uploaded)
(SELECT s_id, <u_id>, <name>, <filename>, <uploaded>
FROM imgstore.servers
WHERE name = #server_name)
This is the syntax for SQL Server, but I think it will work with MySQL as well.
Here's an article on INSERT ... SELECT Syntax
Please see my comment above regarding a potential data integrity issue. I am assuming that the name field in your server table has a unique constraint placed on it.
There are a couple of ways that you can approach this INSERT, and I'm sure that some are better than others. I make no claim that my way is the best way, but it should work. I don't know how you're writing this query, so I'm going to use #FieldValue to represent the variable input. My approach is to use a subquery in your insert statement to get the data that you require.
INSERT INTO images (field1, field2... s_id) VALUES ('#field1val', '#field2val'... (SELECT s_id FROM servers WHERE name='#nameval'));

Fix DB duplicate entries (MySQL bug)

I'm using MySQL 4.1. Some tables have duplicates entries that go against the constraints.
When I try to group rows, MySQL doesn't recognise the rows as being similar.
Example:
Table A has a column "Name" with the Unique proprety.
The table contains one row with the name 'Hach?' and one row with the same name but a square at the end instead of the '?' (which I can't reproduce in this textfield)
A "Group by" on these 2 rows return 2 separate rows
This cause several problems including the fact that I can't export and reimport the database. On reimporting an error mentions that a Insert has failed because it violates a constraint.
In theory I could try to import, wait for the first error, fix the import script and the original DB, and repeat. In pratice, that would take forever.
Is there a way to list all the anomalies or force the database to recheck constraints (and list all the values/rows that go against them) ?
I can supply the .MYD file if it can be helpful.
To list all the anomalies:
SELECT name, count(*) FROM TableA GROUP BY name HAVING count(*) > 1;
There are a few ways to tackle deleting the dups and your path will depend heavily on the number of dups you have.
See this SO question for ways of removing those from your table.
Here is the solution I provided there:
-- Setup for example
create table people (fname varchar(10), lname varchar(10));
insert into people values ('Bob', 'Newhart');
insert into people values ('Bob', 'Newhart');
insert into people values ('Bill', 'Cosby');
insert into people values ('Jim', 'Gaffigan');
insert into people values ('Jim', 'Gaffigan');
insert into people values ('Adam', 'Sandler');
-- Show table with duplicates
select * from people;
-- Create table with one version of each duplicate record
create table dups as
select distinct fname, lname, count(*)
from people group by fname, lname
having count(*) > 1;
-- Delete all matching duplicate records
delete people from people inner join dups
on people.fname = dups.fname AND
people.lname = dups.lname;
-- Insert single record of each dup back into table
insert into people select fname, lname from dups;
-- Show Fixed table
select * from people;
Create a new table, select all rows and group by the unique key (in the example column name) and insert in the new table.
To find out what is that character, do the following query:
SELECT HEX(Name) FROM TableName WHERE Name LIKE 'Hach%'
You will se the ascii code of that 'square'.
If that character is 'x', you could update like this:(but if that column is Unique you will have some errors)
UPDATE TableName SET Name=TRIM(TRAILING 'x' FROM Name);
I'll assume this is a MySQL 4.1 random bug. Somes values are just changing on their own for no particular reason even if they violates some MySQL constraints. MySQL is simply ignoring those violations.
To solve my problem, I will write a prog that tries to resinsert every line of data in the same table (to be precise : another table with the same caracteristics) and log every instance of failures.
I will leave the incident open for a while in case someone gets the same problem and someone else finds a more practical solution.