Fast delete duplicate records in MySQL - mysql

I'm trying to import very big SQL dump (around 37 million rows) into InnoDB table. There are tons of duplicates and what I want to achieve is, without changing actual dump want to prevent duplicate row insertion. The field email might have duplicates. I tried following: after importing whole dump into db I tried to execute following SQL:
set session old_alter_table=1;
ALTER IGNORE TABLE sample ADD UNIQUE (email);
But second query worked around 1 hour and then I just canceled this query.
What is proper way to get rid off duplicates?
I have couple of ideas:
Maybe before starting to import to make a table with unique index and while insertion to prevent duplicates without harming whole process?
Maybe after importing dump to select distinct email and to insert into another table?

From a .dump file
When importing, use -f for "force":
mysql -f -p < 2015-10-01.sql
This causes the import to continue after an error is encountered, which is useful in this case if you create the unique key constraint before importing.
From a .csv file
If you are using "LOAD DATA", use "IGNORE", e.g.:
LOAD DATA LOCAL INFILE 'somefile.csv' IGNORE
INTO TABLE some_db.some_tbl
FIELDS TERMINATED BY ';'
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
(`somefield1`,`somefield2`);
According to the documentation:
If you specify IGNORE, rows that duplicate an existing row on a unique
key value are discarded.
This requires you to create the unique key constraint before importing, which will be fast on an empty table.

Edit the dump file as follows:
Modify the CREATE TABLE statement to add a unique key on the email field, or add an ALTER TABLE statement after it.
Find all the INSERT INTO sample statements, and change them to INSERT IGNORE INTO sample.
You could also do step 2 using a pipeline:
sed 's/INSERT INTO sample/INSERT IGNORE INTO sample/' sample_table.dump | mysql -u root -p sample_db
If the file is too big to edit to add the ALTER TABLE statement, I suggest you create the dump with the --no-create-info option to mysqldump, and create the table by hand (with the unique key) before loading the dump file.

Related

Mysql load data infile leaving unchanged fields

Suppose I have a MySQL table with three fields: key, value1, value2
I want to load data for two fields (key,value1) from file inserts.txt.
Content of inserts.txt:
1;2
3;4
with:
LOAD DATA LOCAL INFILE
"inserts.txt"
REPLACE
INTO TABLE
`test_insert_timestamp`
FIELDS TERMINATED BY ';'
But in case of REPLACE, I want to leave the value2 unchanged.
How could I achieve this?
The REPLACE statement consists in the following algorithm:
MySQL uses the following algorithm for REPLACE (and LOAD DATA ... REPLACE):
Try to insert the new row into the table
While the insertion fails because a duplicate-key error occurs for a
primary key or unique index:
Delete from the table the conflicting row that has the duplicate key
value
Try again to insert the new row into the table
(https://dev.mysql.com/doc/refman/5.7/en/replace.html)
So you can't keep a value from a line which is going to be deleted.
What you want to do is emulating a "ON DUPLICATE KEY UPDATE" logic.
You can't do that within a single LOAD DATA query. What you have to do is to load your data in a temporary table first, then to make an INSERT from your temporary table to your destination table, where you will be able to use the "ON DUPLICATE KEY UPDATE" feature.
The whole process is fully detailed in the most upvoted answer of this question : MySQL LOAD DATA INFILE with ON DUPLICATE KEY UPDATE

How to execute SQL queries skipping error?

I try to import to table data from sql files using command line.
This data contents duplicates in filed url.
But field url in table is unique. So when I try to insert data I get error "Dublicate entry"
How to inmport all data skipping this error?
You can to use the --force (-f) flag.
mysql -u userName -p -f -D dbName < script.sql
From man mysql:
ยท --force, -f
Continue even if an SQL error occurs.
Create a staging table with the same structure as your destination
table but without the constraints (unique index included).
Manually check the duplicates and decide on the way you want to
choose between duplicates rows / merge rows.
Write the appropriate query and use "insert into ... select ...".
How to inmport all data skipping this error?
Drop the index for time being -> run your batch insert -> recreate the index back
If you are using insert, the you can ignore errors using ignore error or on duplicate key update (preferable because it only ignores duplicate key errors).
If you are using load data infile, then you can use the ignore key word. As described in the documentation:
If you specify IGNORE, rows that duplicate an existing row on a
unique key value are discarded. For more information, see Comparison
of the IGNORE Keyword and Strict SQL Mode.
Or, do as I would normally do:
Load the data into a staging table.
Validate the staging table and only load the appropriate data into the final table.

adding data from excel to existing table in MySQL

I have an existing table on MySQL with 7 fields - 4 are integer(one acting as primary key and auto incremete) and 3 text. This table already has data.
I am collecting new data in Excel and wish to bulk-add the data to the Mysql table. Some of the cells in Excel will be blank and need to show as null in MySQL and the primary key won't be part of excel, I let MYSQL auto add it. Previously I was manually adding each record through PHPmyadmin but it is far too time consuming (I have 1000's of records to add).
How do I go about this in terms of setting the Excel sheet in the right way and making sure I add to the exsisting data instead of replacing it? I have heard CSV files are the way? I use PHPmyadmin if that helps.
Thanks
If you want to append data to a MySQL table, I would use .csv files to do it.
In MySQL:
load data local infile 'yourFile'
into table yourTable (field1, field2, field3)
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\n' -- On windows you may need to use `\r\n'
ignore 1 lines; -- Ignore the header line
Check the reference manual: http://dev.mysql.com/doc/refman/5.0/en/load-data.html

LOAD DATA LOCAL INFILE help required

Here's my query for loading mysql table using csv file.
LOAD DATA LOCAL INFILE table.csv REPLACE INTO TABLE table1 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\'' LINES TERMINATED BY 'XXX' IGNORE 1 LINES
SET date_modified = CURRENT_TIMESTAMP;
Suppose my CSV contains 500 records with 15 columns. I changed three rows and terminated them with 'XXX'. I now want to update the mysql table with this file. My primary key is auto-incremented value. When I run this query, all the 500 rows are getting updated with old data and the rows I changed are getting added as new ones. I dont want the new ones. I want my table to be replaced with csv as-is. I tried changing my primary key to non-AI, it still didnt work. Any pointers please?? Thanks.
I am making some assumptions here.
1) You dont have the autonumber value in your file.
Since your primary key is not in your file MySQL will not be able to match rows. A autonumber primary key is a artificial key thus it is not part of the data. MySQL adds this artificial primary key when the row is inserted.
Lets assume your file contained some unique identifier lets call it Identification_Number. This number is both in the file and your table uses it as a primary key in this case MySQL will be able to identify the rows from the file and match them to the rows in the table.
While a lot of people will only use autonumbers in a database I always check if there is not a natural key in the data. If I identify one I do some performance testing with this natural key in a table. Then based on the performance metrics of both I then decide on a key.
Hopefully I did not get your question wrong but I suspect this might be the case.

mysqldump table without dumping the primary key

I have one table spread across two servers running MySql 4. I need to merge these into one server for our test environment.
These tables literally have millions of records each, and the reason they are on two servers is because of how huge they are. Any altering and paging of the tables will give us too huge of a performance hit.
Because they are on a production environment, it is impossible for me to alter them in any way on their existing servers.
The issue is the primary key is a unique auto incrementing field, so there are intersections.
I've been trying to figure out how to use the mysqldump command to ignore certain fields, but the --disable-keys merely alters the table, instead of getting rid of the keys completely.
At this point it's looking like I'm going to need to modify the database structure to utilize a checksum or hash for the primary key as a combination of the two unique fields that actually should be unique... I really don't want to do this.
Help!
To solve this problem, I looked up this question, found #pumpkinthehead's answer, and realized that all we need to do is find+replace the primary key in each row with the NULL so that mysql will use the default auto_increment value instead.
(your complete mysqldump command) | sed -e "s/([0-9]*,/(NULL,/gi" > my_dump_with_no_primary_keys.sql
Original output:
INSERT INTO `core_config_data` VALUES
(2735,'default',0,'productupdates/configuration/sender_email_identity','general'),
(2736,'default',0,'productupdates/configuration/unsubscribe','1'),
Transformed Output:
INSERT INTO `core_config_data` VALUES
(NULL,'default',0,'productupdates/configuration/sender_email_identity','general'),
(NULL,'default',0,'productupdates/configuration/unsubscribe','1'),
Note: This is still a hack; For example, it will fail if your auto-increment column is not the first column, but solves my problem 99% of the time.
if you don't care what the value of the auto_increment column will be, then just load the first file, rename the table, then recreate the table and load the second file. finally, use
INSERT newly_created_table_name (all, columns, except, the, auto_increment, column)
SELECT all, columns, except, the, auto_increment, column
FROM renamed_table_name
You can create a view of the table without the primary key column, then run mysqldump on that view.
So if your table "users" has the columns: id, name, email
> CREATE VIEW myView AS
SELECT name, email FROM users
Edit: ah I see, I'm not sure if there's any other way then.
Clone Your table
Drop the column in clone table
Dump the clone table without the structure (but with -c option to get complete inserts)
Import where You want
This is a total pain. I get around this issue by running something like
sed -e "s/([0-9]*,/(/gi" export.sql > expor2.sql
on the dump to get rid of the primary keys and then
sed -e "s/VALUES/(col1,col2,...etc.) VALUES/gi" LinxImport2.sql > LinxImport3.sql
for all of the columns except for the primary key. Of course, you'll have to be careful that ([0-9]*, doesn't replace anything that you actually want.
Hope that helps someone.
SELECT null as fake_pk, `col_2`, `col_3`, `col_4` INTO OUTFILE 'your_file'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM your_table;
LOAD DATA INFILE 'your_file' INTO TABLE your_table
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n';
For added fanciness, you can set a before insert trigger on your receiving table that sets the new primary key for reach row before the insertion occurs, thereby using regular dumps and still clearing your pk. Not tested, but feeling pretty confident about it.
Use a dummy temporary primary key:
Use mysqldump normally --opts -c. For example, your primary key is 'id'.
Edit the output files and add a row "dummy_id" to the structure of your table with the same type as 'id' (but not primary key of course). Then modify the INSERT statement and replace 'id' by 'dummy_id'. Once imported, drop the column 'dummy_id'.
jimyi was on the right track.
This is one of the reasons why autoincrement keys are a PITA. One solution is not to delete data but add to it.
CREATE VIEW myView AS
SELECT id*10+$x, name, email FROM users
(where $x is a single digit uniquely identifying the original database) either creating the view on the source database (which you hint may not be possible) or use an extract routine like that described by Autocracy or load the data into staging tables on the test box.
Alternatively, don't create the table on the test system - instead put in separate tables for the src data then create a view which fetches from them both:
CREATE VIEW users AS
(SELECT * FROM users_on_a) UNION (SELECT * FROM users_on_b)
C.
The solution I've been using is to just do a regular SQL export of the data I'm exporting, then removing the primary key from the insert statements using a RegEx find&replace editor. Personally I use Sublime Text, but I'm sure TextMate, Notepad++ etc. can do the same.
Then I just run the query in which ever database the data should be inserted to by copy pasting the query into HeidiSQL's query window or PHPMyAdmin. If there's a LOT of data I save the insert query to an SQL file and use file import instead. Copy & paste with huge amounts of text often makes Chrome freeze.
This might sound like a lot of work, but I rarely use more than a couple of minutes between the export and the import. Probably a lot less than I would use on the accepted solution. I've used this solution method on several hundred thousand rows without issue, but I think it would get problematic when you reach the millions.
I like the temporary table route.
create temporary table my_table_copy
select * from my_table;
alter table my_table_copy drop id;
// Use your favorite dumping method for the temporary table
Like the others, this isn't a one-size-fits-all solution (especially given OP's millions of rows) but even at 10^6 rows it takes several seconds to run but works.