Remove Duplicate Data From mysql database - mysql

To work on database related stuffs. Mostly it is done when client send you its data in form of excel sheets and you push that data to database tables after some excel manipulations. I have also done it many times.
A very common problem faced in this approach is that it might result in duplicate rows at times, because data sent is mostly from departments like HR and finance where people are not well aware of data normalization techniques [:-)].
I will use Employee table where column names are id, name, department and email.
Below are the SQL scripts for generating the test data.
Create schema TestDB;
CREATE TABLE EMPLOYEE
(
ID INT,
NAME Varchar(100),
DEPARTMENT INT,
EMAIL Varchar(100)
);
INSERT INTO EMPLOYEE VALUES (1,'Anish',101,'anish#howtodoinjava.com');
INSERT INTO EMPLOYEE VALUES (2,'Lokesh',102,'lokesh#howtodoinjava.com');
INSERT INTO EMPLOYEE VALUES (3,'Rakesh',103,'rakesh#howtodoinjava.com');
INSERT INTO EMPLOYEE VALUES (4,'Yogesh',104,'yogesh#howtodoinjava.com');
--These are the duplicate rows
INSERT INTO EMPLOYEE VALUES (5,'Anish',101,'anish#howtodoinjava.com');
INSERT INTO EMPLOYEE VALUES (6,'Lokesh',102,'lokesh#howtodoinjava.com');
Solution:
DELETE e1 FROM EMPLOYEE e1, EMPLOYEE e2 WHERE e1.name = e2.name AND e1.id > e2.id;

delete all duplicate record in excel sheet itself using filter and then insert those record in ur database
use distinct keyword for unique
How to insert Distinct Records from Table A to Table B (both tables have same structure)
check this stack overflow link

Add Unique constraint on name field and use below query
INSERT INTO EMPLOYEE VALUES
Please refer "INSERT IGNORE" vs "INSERT ... ON DUPLICATE KEY UPDATE"

You can always make sure before inserting to the database either that record already exists or not, in your case you can have condition on its unique key which will be different for every employee. Moreover you can have a single column as unique key or a composite key that uses more than one columns to uniquely identify a record.

Related

How to insert into table a new row then do nothing if already exists(without UNIQUE key)

Let's say I have these two tables. Where I insert employees to employee table coming from the staging table.
staging table:
id
employee_id
name
1
12
Paul
2
13
Kyle
employee table
id
employee_id
name
5
4
Will
6
13
Kyle
Now, on the employee table let's say I'd like to copy what's on my staging table currently, using the INSERT SELECT INTO statement, Paul will be inserted but I don't want Kyle to be inserted since he's on the employee table already(employee.employee_id is the defining column).
I know this could be done by just setting a unique or primary key, on employee_id and just using the statement ON DUPLICATE KEY UPDATE then do nothing by just setting them back to their original values.
I'm new to SQL, and I'm stuck with the solution setting a UNIQUE key and ON DUPLICATE KEY UPDATE statement, but I'd like to know how to do this without that solution I mentioned?
First of all, you should keep in mind that the decision whether to create unique or primary keys or not does not depend on how to create insert statements or such. It's a matter of what your table should do and what not.
In order to achieve your goal, you can add a where to your insert statement which excludes the already existing entries, as example:
INSERT INTO employees (id, employee_id, name)
SELECT id, employee_id, name
FROM staging
WHERE employee_id NOT IN (SELECT employee_id FROM employees)
Break the problem down into its constituent parts
get a list of employees who are in the staging table but not in the target table
insert those records only
Part 1 can be achieved with a Left Join and Where condition that the target table is null. Wrap that up as a CTE and then use it in 2)

How do I get the updated rows id when using on duplicate update in a inserting multiple rows query?

So I have two tables, Person(id, name, phone, partner), Partner(id, personA, personB). Person table has unique key (name, phone) and the id is auto-incremented.
I have an insert query that insert multiple rows which will first check if name or phone is the same, then decide whether update the partner fields or do a normal inert.
INSERT INTO Table1(name, phone, partner)
VALUES ("name1", "phone1", "partner1"), ("name2", "phone2", "partner2")
ON DUPLICATE KEY UPDATE partner = VALUES(partner);
So the point is, every time I execute this query, it will possibly update or insert multiple rows.
Is there a way that I can get those updated row's ids so that I can insert the ids to the Partner table? I don't need to update the Partner table, just insert.
Following question will be, I can use row_count before and after the query been executed to get how many rows been inserted and their id's range. But is there any better way to work around with this?
Thanks!

My SQL INSERT INTO based on conditions

I have a table called autosaves where i my web-app saves every 4 second a user autosave in case my web-app crashes.
autoSaves
customerId
designType
autosaveFile
The condition is this:
If a customerId and designtype already exists, update the row with these values(customerId designType autosaveFile)
otherwise if the 2 conditions i mentioned do not exist then create a new row with the new values.
I have come accross the Insert Into statement but i cannot seem to understand how to formulate it so that it updates when the 2 conditions are met.
You need to create a unique index on the customerId and designType columns:
CREATE UNIQUE INDEX ix_cust_design ON autoSaves (customerId, designType);
Then you can use the following INSERT statement:
INSERT INTO autoSaves (customerId, designType, autosaveFile)
VALUES (#id, #type, #file)
ON DUPLICATE KEY UPDATE autosaveFile = VALUES(autosaveFile)

MySQL - insert into with foreign key index

Here is the scenario:
I have 2 tables and 2 temporary tables. Before I insert user data to the official tables, I insert them to a temp table to let them do the checks. There is a company table with company info, and a contact table that has contact info. The contact table has a field called company_id which is a foreign key index for the company table.
Temp tables are set up the same way.
I want to do something like: INSERT INTO company () SELECT * FROM temp_company; and INSERT INTO contact () SELECT * FROM temp_contact
My question is, how do I transfer the foreign key from the temp_company to the newly inserted id on the company table using a statement like this? Is there a way to do it?
Currently I am:
grabbing the temp rows
going one by one and inserting them
grabbing the last insert id
then inserting the contacts afterwards with the new last insert id
I just don't know if that is the most efficient way. Thanks!
if you have the same number of columns in both tables and then you should just be able to use the syntax you have there? Just take out the (). Just make sure there aren't any duplicate primary keys:
INSERT INTO company SELECT * FROM temp_company;
INSERT INTO contact SELECT * FROM temp_contact;
You can also specifically specify the columns that get inserted, this way you can specify exactly which column you insert as the new ID.
INSERT INTO company (`ID`,`col_1`,...,`last_col`) SELECT `foreign_key_col`,`col_1`,...,`last_col` FROM temp_company;
INSERT INTO contact (`ID`,`col_1`,...,`last_col`) SELECT `foreign_key_col`,`col_1`,...,`last_col` FROM temp_contact;
Just make sure you are selecting the right # of columns.

Fix DB duplicate entries (MySQL bug)

I'm using MySQL 4.1. Some tables have duplicates entries that go against the constraints.
When I try to group rows, MySQL doesn't recognise the rows as being similar.
Example:
Table A has a column "Name" with the Unique proprety.
The table contains one row with the name 'Hach?' and one row with the same name but a square at the end instead of the '?' (which I can't reproduce in this textfield)
A "Group by" on these 2 rows return 2 separate rows
This cause several problems including the fact that I can't export and reimport the database. On reimporting an error mentions that a Insert has failed because it violates a constraint.
In theory I could try to import, wait for the first error, fix the import script and the original DB, and repeat. In pratice, that would take forever.
Is there a way to list all the anomalies or force the database to recheck constraints (and list all the values/rows that go against them) ?
I can supply the .MYD file if it can be helpful.
To list all the anomalies:
SELECT name, count(*) FROM TableA GROUP BY name HAVING count(*) > 1;
There are a few ways to tackle deleting the dups and your path will depend heavily on the number of dups you have.
See this SO question for ways of removing those from your table.
Here is the solution I provided there:
-- Setup for example
create table people (fname varchar(10), lname varchar(10));
insert into people values ('Bob', 'Newhart');
insert into people values ('Bob', 'Newhart');
insert into people values ('Bill', 'Cosby');
insert into people values ('Jim', 'Gaffigan');
insert into people values ('Jim', 'Gaffigan');
insert into people values ('Adam', 'Sandler');
-- Show table with duplicates
select * from people;
-- Create table with one version of each duplicate record
create table dups as
select distinct fname, lname, count(*)
from people group by fname, lname
having count(*) > 1;
-- Delete all matching duplicate records
delete people from people inner join dups
on people.fname = dups.fname AND
people.lname = dups.lname;
-- Insert single record of each dup back into table
insert into people select fname, lname from dups;
-- Show Fixed table
select * from people;
Create a new table, select all rows and group by the unique key (in the example column name) and insert in the new table.
To find out what is that character, do the following query:
SELECT HEX(Name) FROM TableName WHERE Name LIKE 'Hach%'
You will se the ascii code of that 'square'.
If that character is 'x', you could update like this:(but if that column is Unique you will have some errors)
UPDATE TableName SET Name=TRIM(TRAILING 'x' FROM Name);
I'll assume this is a MySQL 4.1 random bug. Somes values are just changing on their own for no particular reason even if they violates some MySQL constraints. MySQL is simply ignoring those violations.
To solve my problem, I will write a prog that tries to resinsert every line of data in the same table (to be precise : another table with the same caracteristics) and log every instance of failures.
I will leave the incident open for a while in case someone gets the same problem and someone else finds a more practical solution.