Super slow load data infile when deleting duplicate rows

Super slow load data infile when deleting duplicate rows - mysql

I encounter an issue while I load data in my mysql database. I use this as a way to insert data in my database :
USE database;
ALTER TABLE country
ADD UNIQUE INDEX idx_name (`insee_code`,`post_code`,`city`);
LOAD DATA INFILE 'C:/wamp64/tmp/myfile-csv'
REPLACE
INTO TABLE `country` CHARACTER SET utf8
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
While my table are simply:
CREATE TABLE `country` (`insee_code` VARCHAR(250),
`post_code` VARCHAR(250),
`city` VARCHAR(250));
Before I use a php script to load other tables, it's pretty fast (3GB in 3 minutes) but with this one, it takes 17 min to
load 1 GB.
I don't know why, because with index, some rows are lost or corrupted and I'm just wondering why. If someone has an other way to delete duplicates rows while loading data from a CSV, I'll appreciate to ear it.
Thanks in advance.

With a REPLACE you basically delete the row first, then insert the new row. What you want to do is IGNORE instead.
read more about it here: 13.2.7 LOAD DATA INFILE Syntax
The REPLACE and IGNORE keywords control handling of input rows that
duplicate existing rows on unique key values:
If you specify REPLACE, input rows replace existing rows. In other words, rows that have the same value for a primary key or unique index
as an existing row. See Section 13.2.9, “REPLACE Syntax”.
If you specify IGNORE, rows that duplicate an existing row on a unique key value are discarded. For more information, see Comparison
of the IGNORE Keyword and Strict SQL Mode.
Also it would be better, if you would add a primary key. If you don't , MySQL creates one implicitly for you. This one is not visible and is either a uuid or a bigint. I don't remember that clearly. Anyway this is not optimal performance- and storagewise. Execute this:
ALTER TABLE country ADD column id int unsigned auto_increment primary key;

Related

AUTO_INCREMENT Indexing Duplicate Entry

I am attempting to load data from a csv file into a MySQL database using the LOAD DATA command.
My csv is structured like:
Index
Name
...
0
blah
1
blahbla
...
But when trying to read my data using
CREATE TABLE data (
id INT NOT NULL AUTO_INCREMENT,
KernelName VARCHAR(255) NOT NULL,
...,
primary key (id)
);
USE myDatabase;
LOAD DATA INFILE '/filepath/myFile.csv'
INTO TABLE myTable
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
I receive the error ERROR 1062 (23000): Duplicate entry '1' for key 'myTable.PRIMARY'
I suspect this is happening because the AUTO_INCREMENT is creating a MySQL table where the id starts at 1 instead of the 0 that I'm reading. Causing the duplicate entry error.
New to MySQL and don't care if indexing starts at 0 or 1, just not sure what the easiest fix would be. Should I skip the index row? Change auto_increment to start at 0?

Update
I was able to fix this issue by omitting the AUTO_INCREMENT in my initial CREATE TABLE command. Then after importing data, I used ALTER TABLE table MODIFY id INTEGER NOT NULL AUTO_INCREMENT; to restore auto increment functionality.
Thanks to everyone who helped with this in the comments!

Removing duplicates with unique index

I inserted between two tables fields A,B,C,D, believing I had created a Unique Index on A,B,C,D to prevent duplicates. However I somehow simply made a normal index on those. So duplicates got inserted. It is 20 million record table.
If I change my existing index from normal to unique or simply a add a new unique index for A,B,C,D will the duplicates be removed or will adding fail since unique records exist? I'd test it yet it is 30 mil records and I neither wish to mess the table up or duplicate it.

If you have duplicates in your table and you use
ALTER TABLE mytable ADD UNIQUE INDEX myindex (A, B, C, D);
the query will fail with Error 1062 (duplicate key).
But if you use IGNORE
-- (only works before MySQL 5.7.4)
ALTER IGNORE TABLE mytable ADD UNIQUE INDEX myindex (A, B, C, D);
the duplicates will be removed. But the documentation doesn't specify which row will be kept:
IGNORE is a MySQL extension to standard SQL. It controls how ALTER TABLE works if there are duplicates on unique keys in the new table or
if warnings occur when strict mode is enabled. If IGNORE is not
specified, the copy is aborted and rolled back if duplicate-key errors
occur. If IGNORE is specified, only one row is used of rows with
duplicates on a unique key. The other conflicting rows are deleted.
Incorrect values are truncated to the closest matching acceptable
value.
As of MySQL 5.7.4, the IGNORE clause for ALTER TABLE is removed and
its use produces an error.
(ALTER TABLE Syntax)
If your version is 5.7.4 or greater - you can:
Copy the data into a temporary table (it doesn't technically need to be temporary).
Truncate the original table.
Create the UNIQUE INDEX.
And copy the data back with INSERT IGNORE (which is still available).
CREATE TABLE tmp_data SELECT * FROM mytable;
TRUNCATE TABLE mytable;
ALTER TABLE mytable ADD UNIQUE INDEX myindex (A, B, C, D);
INSERT IGNORE INTO mytable SELECT * from tmp_data;
DROP TABLE tmp_data;
If you use the IGNORE modifier, errors that occur while executing the
INSERT statement are ignored. For example, without IGNORE, a row that
duplicates an existing UNIQUE index or PRIMARY KEY value in the table
causes a duplicate-key error and the statement is aborted. With
IGNORE, the row is discarded and no error occurs. Ignored errors
generate warnings instead.
(INSERT Syntax)
Also see: INSERT ... SELECT Syntax and Comparison of the IGNORE Keyword and Strict SQL Mode

if you think there will be duplicates, adding the unique index will fail.
first check what duplicates there are:
select * from
(select a,b,c,d,count(*) as n from table_name group by a,b,c,d) x
where x.n > 1
This may be a expensive query on 20M rows, but will get you all duplicate keys that will prevent you from adding the primary index.
You could split this up into smaller chunks if you do a where in the subquery: where a='some_value'
For the records retrieved, you will have to change something to make the rows unique. If that is done (query returns 0 rows) you should be safe to add the primary index.

Instead of IGNORE you can use ON DUPLICATE KEY UPDATE, which will give you control over which values should prevail.

To answer your question- adding a UNIQUE constraint on a column that has duplicate values will throw an error.
For example, you can try the following script:
CREATE TABLE `USER` (
`USER_ID` INT NOT NULL,
`USERNAME` VARCHAR(45) NOT NULL,
`NAME` VARCHAR(45) NULL,
PRIMARY KEY (`USER_ID`));
INSERT INTO USER VALUES(1,'apple', 'woz'),(2,'apple', 'jobs'),
(3,'google', 'sergey'),(4,'google', 'larry');
ALTER TABLE `USER`
ADD UNIQUE INDEX `USERNAME_UNIQUE` (`USERNAME` ASC);
/*
Operation failed: There was an error while applying the SQL script to the database.
ERROR 1062: Duplicate entry 'apple' for key 'USERNAME_UNIQUE'
*/

How to reseed an "Auto increment" column for InnoDB engine database?

I am using an artificial primary key for a table. The table had two columns, one is the primary key and the other one is a Dates (datatype: Date) column. When I tried to load bulk data from a file (which contained values for the second column only), the YYYY part of the dates were added to the primary key column (which was the first column in the table) and the rest of the date was truncated.
So I needed to reset the table. I tried it using the Truncate table statement, but it failed with an error because this table was referenced in the foreign key constraint of another table. So I had to do it using the delete * from table; statement. I did delete all the records, but then when I inserted the records again (using the insert into statement this time), it started incrementing the ID starting from the year after the last year in the year I had previously inserted (i.e. it did not refresh it).
NOTE:- I am using MySQL 5.5 and InnoDB engine.
MY EFFORT SO FAR:-
I tried ALTER TABLE table1 AUTO_INCREMENT=0; (Reference Second Answer) ---> IT DID NOT HELP.
I tried ALTER TABLE table1 DROP column; (Reference- answer 1) ---> Error on rename of table1
Deleted the table again and tried to do:
DBCC CHECKIDENT('table1', RESEED, 0);
(Reference) ---> Syntax error at "DDBC" - Unexpected INDENT_QUOTED
(This statement is right after the delete table statement, if that
matters)
In this article, under the section named "Auto Increment Columns for INNODB Tables" and the heading "Update 17 Feb 2009:", it says that in InnoDB truncate does reset the AUTO_INCREMENT index in versions higher than MySQL 4.1... So I want some way to truncate my table, or do something else to reset the AUTO_INCREMENT index.
QUESTION:-
Is there a way to somehow reset the auto_increment when I delete the data in my table?
I need a way to fix the aforementioned DDBC CHECKINDENT error, or somehow truncate the table which has been referenced in a foreign key constraint of another table.

Follow below steps:
Step1: Truncate table after disabling foreign key constraint and then again enable-
set foreign_key_checks=0;
truncate table mytable;
set foreign_key_checks=1;
Step2: Now at the time of bulk uploading select columns in table only those are in your csv file means un-check rest one (auto id also) and make sure that colums in csv should be in same order as in your table. Also autoid columns should not in your csv file.
You can use below command to upload data.
LOAD DATA LOCAL INFILE '/root/myfile.csv' INTO TABLE mytable fields terminated by ',' enclosed by '"' lines terminated by '\n' (field2,field3,field5);
Note: If you are working in windows environment then change accordinglyl.

You can only reset the auto increment value to 1 (not 0). Therefore, unless I am mistaken you are looking for
alter table a auto_increment = 1;
You can query the next used auto increment value using
select auto_increment from information_schema.tables where
table_name='a' and table_schema=schema();
(Do not forget to replace 'a' with the actual name of your table).
You can play around with a test database (it is likely that your MySQL installation already has a database called test, otherwise create it using create database test;)
use test;
create table a (id int primary key auto_increment, x int); -- auto_increment = 1
insert into a (x) values (1), (42), (43), (12); -- auto_increment = 5
delete from a where id > 1; -- auto_increment = 5
alter table a auto_increment = 2; -- auto_increment = 2
delete from a;
alter table a auto_increment = 1; -- auto_increment = 1

CSV data import and data processing

I'm having to import, on a very regular basis, data from a CSV into a MySQL database.
LOAD DATA LOCAL INFILE '/path/to/file.csv' INTO TABLE `tablename` FIELDS TERMINATED BY ','
The data I'm importing doesn't have a primary key column and equally I can't alter the structure of the CSV file as I have no control over it.
So I need to import this CSV data into a temporary MySQL table which is fine but then I need to take this data and process it line by line. As each row is run through a process, I need to delete that row from the temporary table so that I don't re-process it.
Because the temporary table has no primary key I can't do DELETE FROM tablename WHERE id=X which would be the best option, instead I have to match against a bunch of alphanumeric columns (probably up to 5 in order to avoid accidentally deleting a duplicate).
Alternatively I was thinking I could alter the table AFTER the CSV import process was complete and add a primary key column, then process the data as previously explained. Then when complete, alter the table again to remove the primary key column ready for a new import. Can someone please tell if this is a stupid idea or not? What would be most efficient and quick?
Any ideas or suggestions greatly appreciated!

You can have an auto_increment column in your temporary table from the beginning and populate values as you load data
CREATE TEMPORARY TABLE tablename
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
col1 INT,
col2 VARCHAR(32),
col3 INT,
...
);
Then specify all columns in parentheses, but leave id out
LOAD DATA LOCAL INFILE '/path/to/file.csv'
INTO TABLE `tablename`
FIELDS TERMINATED BY ','
(col1, col2, col3,...); -- specify all columns, but leave id out
That way you don't need to add and remove id column before and after import. Since you're doing import on a regular basis you can consider to use a permanent table instead of temporary one and just TRUNCATE it after you done with the import to clear the table and reset id column.

Delete Duplicates in MYSQL using a text column for comparison

We've got a database table that need a multiple column unique key. However, one of those columns is TEXT and has lengths as long as 1000 chars (so varchar won't work either). Because of the TEXT column, I can't actually have a unique key for those columns. What's a good way to remove duplicates? Of course, fast would be nice.

The best way is to use a UNIQUE INDEX to avoid duplicate.
Creating a new unique key on the over columns you need to have as uniques will automatically clean the table of any duplicates.
ALTER IGNORE TABLE `table_name`
ADD UNIQUE KEY `key_name`(`column_1`,`column_2`);
The IGNORE part does not allow the script to terminate after the first error occurs. And the default behavior is to delete the duplicates.

Add a unique constraint as below:
ALTER IGNORE TABLE table1
ADD UNIQUE unique_name(column1, comlumn1, column3 ... Text);
Here IGNORE will help in removing the duplicates while creating the constraint.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008