Delete Duplicates from large mysql Address DB - mysql

I know, deleting duplicates from mysql is often discussed here. But none of the solution work fine within my case.
So, I have a DB with Address Data nearly like this:
ID; Anrede; Vorname; Nachname; Strasse; Hausnummer; PLZ; Ort; Nummer_Art; Vorwahl; Rufnummer
ID is primary Key and unique.
And i have entrys for example like this:
1;Herr;Michael;Müller;Testweg;1;55555;Testhausen;Mobile;012345;67890
2;Herr;Michael;Müller;Testweg;1;55555;Testhausen;Fixed;045678;877656
The different PhoneNumber are not the problem, because they are not relevant for me. So i just want to delete the duplicates in Lastname, Street and Zipcode. In that case ID 1 or ID 2. Which one of both doesn't matter.
I tried it actually like this with delete:
DELETE db
FROM Import_Daten db,
Import_Daten dbl
WHERE db.id > dbl.id AND
db.Lastname = dbl.Lastname AND
db.Strasse = dbl.Strasse AND
db.PLZ = dbl.PLZ;
And insert into a copy table:
INSERT INTO Import_Daten_1
SELECT MIN(db.id),
db.Anrede,
db.Firstname,
db.Lastname,
db.Branche,
db.Strasse,
db.Hausnummer,
db.Ortsteil,
db.Land,
db.PLZ,
db.Ort,
db.Kontaktart,
db.Vorwahl,
db.Durchwahl
FROM Import_Daten db,
Import_Daten dbl
WHERE db.lastname = dbl.lastname AND
db.Strasse = dbl.Strasse And
db.PLZ = dbl.PLZ;
The complete table contains over 10Mio rows. The size is actually my problem. The mysql runs on a MAMP Server on a Macbook with 1,5GHZ and 4GB RAM. So not really fast. SQL Statements run in a phpmyadmin. Actually i have no other system possibilities.

You can write a stored procedure that will each time select a different chunk of data (for example by rownumber between two values) and delete only from that range. This way you will slowly bit by bit delete your duplicates

A more effective two table solution can look like following.
We can store only the data we really need to delete and only the fields that contain duplicate information.
Let's assume we are looking for duplicate data in Lastname , Branche, Haushummer fields.
Create table to hold the duplicate data
DROP TABLE data_to_delete;
Populate the table with data we need to delete ( I assume all fields have VARCHAR(255) type )
CREATE TABLE data_to_delete (
id BIGINT COMMENT 'this field will contain ID of row that we will not delete',
cnt INT,
Lastname VARCHAR(255),
Branche VARCHAR(255),
Hausnummer VARCHAR(255)
) AS SELECT
min(t1.id) AS id,
count(*) AS cnt,
t1.Lastname,
t1.Branche,
t1.Hausnummer
FROM Import_Daten AS t1
GROUP BY t1.Lastname, t1.Branche, t1.Hausnummer
HAVING count(*)>1 ;
Now let's delete duplicate data and leave only one record of all duplicate sets
DELETE Import_Daten
FROM Import_Daten LEFT JOIN data_to_delete
ON Import_Daten.Lastname=data_to_delete.Lastname
AND Import_Daten.Branche=data_to_delete.Branche
AND Import_Daten.Hausnummer = data_to_delete.Hausnummer
WHERE Import_Daten.id != data_to_delete.id;
DROP TABLE data_to_delete;

You can add a new column e.g. uq and make it UNIQUE.
ALTER TABLE Import_Daten
ADD COLUMN `uq` BINARY(16) NULL,
ADD UNIQUE INDEX `uq_UNIQUE` (`uq` ASC);
When this is done you can execute an UPDATE query like this
UPDATE IGNORE Import_Daten
SET
uq = UNHEX(
MD5(
CONCAT(
Import_Daten.Lastname,
Import_Daten.Street,
Import_Daten.Zipcode
)
)
)
WHERE
uq IS NULL;
Once all entries are updated and the query is executed again, all duplicates will have the uq field with a value=NULL and can be removed.
The result then is:
0 row(s) affected, 1 warning(s): 1062 Duplicate entry...
For newly added rows always create the uq hash and and consider using this as the primary key once all entries are unique.

Related

MYSQL: How to update unique random number to existing rows

It's been my first question to this website, I'm sorry if I used any wrong keywords. I have been with one problem from quite a few days.
The Problem is, I have a MYSQL table named property where I wanted to add a ref number which will be a unique 6 digit non incremental number so I alter the table to add a new column named property_ref which has default value as 1.
ALTER TABLE property ADD uniqueIdentifier INT DEFAULT (1) ;
Then I write a script to first generate a number then checking it to db if exist or not and If not exist then update the row with the random number
Here is the snippet I tried,
with cte as (
select subIdentifier, id from (
SELECT id, LPAD(FLOOR(RAND() * (999999 - 100000) + 100000), 6, 0) AS subIdentifier
FROM property as p1
WHERE "subIdentifier" NOT IN (SELECT uniqueIdentifier FROM property as p2)
) as innerTable group by subIdentifier
)
UPDATE property SET uniqueIdentifier = (
select subIdentifier from cte as c where c.id = property.id
) where property.id != ''
this query returns a set of record for almost all the rows but I have a table of entries of total 20000,
but this query fills up for ~19000 and rest of the rows are null.
here is a current output
[current result picture]
If anyone can help, I am extremely thanks for that.
Thanks
Instead of trying to randomly generate unique numbers that do not exist in the table, I would try the approach of randomly generating numbers using the ID column as a seed; as long as the ID number is unique, the new number will be unique as well. This is not technically fully "random" but it may be sufficient for your needs.
https://www.db-fiddle.com/f/iqMPDK8AmdvAoTbon1Yn6J/1
update Property set
UniqueIdentifier = round(rand(id)*1000000)
where UniqueIdentifier is null
SELECT id, round(rand(id)*1000000) as UniqueIdentifier FROM test;

SQL normalized data INSERT WHERE NOT EXISTS ; ON DUPLICATE KEY UPDATE

i am using MySql Workbench version 6.3.9 with mySql 5.6.35.
i have the following tables:
EQUIPMENT
eID | caochID | eName
COACH
coachID | coachName
SQLfiddle prepared http://sqlfiddle.com/#!9/e333d/1
eID is a primary key. there are multiple coachID's in different equipment, so there will be duplicate coachIDs with different equipment, but the eID will be unique as it is a primary key.
REQUIRED
i need to insert a row in the equipment table, if it does not already exist. If it exists, do nothing.
various posts online have pointed me towards two options:
a) INSERT...ON DUPLICATE KEY UPDATE...
b)INSERT...WHERE NOT EXISTS
PROBLEM i have problems with both of these solutions. for the first solution (ON DUPLICATE KEY UPDATE) the query inserts the row as required but does not update the existing row. instead it creates a new entry. for the second solution (WHERE NOT EXISTS) i get an error : SYNTAX ERROR: 'WHERE' (WHERE) is not a valid input at this position.
the sql query doesnt need to make any joins. i listed both tables so that you can see how they are related. the insert query i need will only insert for the equipment table.
You can insert by using a tmp table and ensuring that the same record is not existing from current table. Add limit 1 to ensure only one record is inserted. Below query will not insert since 1 and small ball exists.
INSERT INTO `Equipment` (`c_id`, `eName`)
SELECT * FROM (SELECT '1', 'small ball') tmp
WHERE NOT EXISTS (
SELECT c_id FROM Equipment WHERE `c_id`='1' and `eName` = 'small ball'
) LIMIT 1;
NOT EXISTS
insert into table2 (....) --- all if not columns ... destination
select ....
from table1 t1 --- source of data to check
where not exists (
select 1
from table2 t2
where t2.col = t1.col --- match source and destination table making sure table1 data is not in table2
)

How can I update an existing record to have a new auto_increment id in MySQL?

I have a table with primary key (its name is "id") defined as auto_increment. I use NULL in INSERT statements to "fill" the id value. It works, of course. However now I need to "move" an existing record to a new primary key value (the next available, the value is not so much important, but it must be a new one, and the last one if ordered by id). How can I do it in an "elegant" way? Since the "use NULL at INSERT" does not work too much with UPDATE:
update idtest set id=NULL where id=1;
This simply makes the id of the record zero. I would expect to do the same thing as with INSERT, but it seems my idea was incorrect.
Of course I can use "INSERT ... SELECT" statement, then a DELETE on the old one, or I can use something like MAX(id) + 1 to UPDATE the id of the old record in one step, etc, but I am curious if there is a finer solution.
Also, the MAX(id) solution doesn't seem to work either by the way:
mysql> update idtest set id=max(id)+1 where id=3;
ERROR 1111 (HY000): Invalid use of group function
mysql> update idtest set id=(select max(id)+1 from idtest) where id=3;
ERROR 1093 (HY000): You can't specify target table 'idtest' for update in FROM clause
This is the way I believe:
UPDATE users SET id = (SELECT `AUTO_INCREMENT`
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'test'
AND TABLE_NAME = 'users') WHERE id = 2;
select * from users;
I used by own tables substitute yours.
test is database name, users is table name and id is AUTO_INCREMENT in my case.
EDIT: My Query above works perfect but its side effects are somewhat 'dangerous', upon next insert as AUTO_INCREMENT value will collide with this recently updated record so just next single insert will fail. To avoid that case I've modified above query to a transaction:
START transaction;
UPDATE users SET id = (SELECT `AUTO_INCREMENT`
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'test'
AND TABLE_NAME = 'users') WHERE id = 2;
#renew auto increment to avoid duplicate warning on next insert
INSERT IGNORE INTO users(username) values ('');
COMMIT
Hope this will help someone if not OP.
The way you are trying to update same table is wrong but you can use join on same table
update idtest t
join (select id +1 as id
from idtest order by id desc
limit 1) t1
set t.id=t1.id
where t.id=3;
or
update idtest t
join (select max(id) +1 as id
from idtest ) t1
set t.id=t1.id
where t.id=3;
You can use the REPLACE INTO clause to do the trick.
From the manual:
REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted. See Section 13.2.5, "INSERT Syntax".
EDIT
My mistake (in the comments) that you have to have two unique constraint to achieve this:
When you use the auto_increment value to REPLACE the record, the record will be replaced with the give ID and will not change (however the AI value will increment).
You have to exclude the AI column from the query. You can do that if you have one more UQ constraint.
Check this SQLFiddle demo: http://sqlfiddle.com/#!2/1a702e
The first query will replace all the records (but the id's value will not change).
The second one will replace it too, and the new AI value will be used. (Please note, that the second query does not contain the id column, and there is a UQ constraint on the some column).
You can notice, that the second query uses higher AI values than it is excepted: this is because the first replace incremented the AI value.
If you do not have two unique keys (one for the AI and one for another columns), the REPLACE statement will work as a normal INSERT statement!
(Ofcourse you can change one of the UNIQUE KEYs with a PRIMARY KEY)

Update with Subquery never completes

I'm currently working on a project with a MySQL Db of more than 8 million rows. I have been provided with a part of it to test some queries on it. It has around 20 columns out of which 5 are of use to me. Namely: First_Name, Last_Name, Address_Line1, Address_Line2, Address_Line3, RefundID
I have to create a unique but random RefundID for each row, that is not the problem. The problem is to create same RefundID for those rows whose First_Name, Last_Name, Address_Line1, Address_Line2, Address_Line3 as same.
This is my first real work related to MySQL with such large row count. So far I have created these queries:
-- Creating Teporary Table --
CREATE temporary table tempT (SELECT tt.First_Name, count(tt.Address_Line1) as
a1, count(tt.Address_Line2) as a2, count(tt.Address_Line3) as a3, tt.RefundID
FROM `tempTable` tt GROUP BY First_Name HAVING a1 >= 2 AND a2 >= 2 AND a3 >= 2);
-- Updating Rows with First_Name from tempT --
UPDATE `tempTable` SET RefundID = FLOOR(RAND()*POW(10,11))
WHERE First_Name IN (SELECT First_Name FROM tempT WHERE First_Name is not NULL);
This update query keeps on running but never ends, tempT has more than 30K rows. This query will then be run on the main DB with more than 800K rows.
Can someone help me out with this?
Regards
The solutions that seem obvious to me....
Don't use a random value - use a hash:
UPDATE yourtable
SET refundid = MD5('some static salt', First_Name
, Last_Name, Address_Line1, Address_Line2, Address_Line3)
The problem is that if you are using an integer value for the refundId then there's a good chance of getting a collision (hint CONV(SUBSTR(MD5(...),1,16),16,10) to get a SIGNED BIGINT). But you didn't say what the type of the field was, nor how strict the 'unique' requirement was. It does carry out the update in a single pass though.
An alternate approach which creates a densely packed seguence of numbers is to create a temporary table with the unique values from the original table and a random value. Order by the random value and set a monotonically increasing refundId - then use this as a look up table or update the original table:
SELECT DISTINCT First_Name
, Last_Name, Address_Line1, Address_Line2, Address_Line3
INTO temptable
FROM yourtable;
set #counter=-1;
UPDATE temptable t SET t,refundId=(#counter:=#counter + 1)
ORDER BY r.randomvalue;
There are other solutions too - but the more efficient ones rely on having multiple copies of the data and/or using a procedural language.
Try using the following:
UPDATE `tempTable` x SET RefundID = FLOOR(RAND()*POW(10,11))
WHERE exists (SELECT 1 FROM tempT y WHERE First_Name is not NULL and x.First_Name=y.First_Name);
In MySQL, it is often more efficient to use join with update than to filter through the where clause using a subquery. The following might perform better:
UPDATE `tempTable` join
(SELECT distinct First_Name
FROM tempT
WHERE First_Name is not NULL
) fn
on temptable.First_Name = fn.First_Name
SET RefundID = FLOOR(RAND()*POW(10,11));

trouble copying data from one table to another with auto increment field

I have the following SQL statement which was working perfectly until I moved it to another server. The middle query (encapsulated in ** ) does not seem to work. I am getting an error saying that 'AUTO' is an incorrect integer type. If I remove it altogether, it says that I have an incorrect number of fields. I am trying to copy data from one table to another and allow the destination table to auto increment its ID number.
SET sql_safe_updates=0;
START TRANSACTION;
DELETE FROM shares
WHERE asset_id = '$asset_ID';
/*************************************************************/
INSERT INTO shares
SELECT 'AUTO', asset_ID, member_ID, percent_owner, is_approved
FROM pending_share_changes
WHERE asset_ID = '$asset_ID';
/*************************************************************/
DELETE FROM pending_share_changes
WHERE asset_ID = '$asset_ID';
DELETE FROM shares
WHERE asset_ID = '$asset_ID' AND percent_owner = '0';
COMMIT;";
Based on this page of the mysql docs, you have to do:
INSERT INTO shares
(column_name1, column_name2, column_name3, column_name4) -- changed!
SELECT asset_ID, member_ID, percent_owner, is_approved
FROM pending_share_changes
WHERE asset_ID = '$asset_ID';
The difference is that the column names of the "receiving" table are explicitly listed after the name of the receiving table.
The docs say
AUTO_INCREMENT columns work as usual.