remove Duplicated data from very huge table - mysql

I have a table contains more than 500 millions records in MySQL database ,
i need to remove duplicated from it ,
i tried this query on table contain 20 millions , it was ok but for the 500 millions it take very long time :
-- Create temporary table
CREATE TABLE temp_table LIKE names_tbles;
-- Add constraint
ALTER TABLE temp_table ADD UNIQUE(name , family);
-- Copy data
INSERT IGNORE INTO temp_table SELECT * FROM names_tbles;
is there better solution ?

One option is aggregation rather than insert ignore. That way, there is no need for the database to manage rejected records:
insert into temp_table(id, name, family)
select min(id), name, family
from names_tbles
group by id, family;
I would take one step further and suggest adding the unique constraints only after the table is populated, so there is no need for the database to check for duplicates (the query guarantees that already), which should speed up the insert statement.

Related

MySQL renaming and create table at the same time

I need to rename MySQL table and create a new MySQL table at the same time.
There is critical live table with large number of records. master_table is always inserted records from scripts.
Need to backup the master table and create a another master table with same name at the same time.
General SQL is is like this.
RENAME TABLE master_table TO backup_table;
Create table master_table (id,value) values ('1','5000');
Is there a possibility to record missing data during the execution of above queries?
Any way to avoid missing record? Lock the master table, etc...
What I do is the following. It results in no downtime, no data loss, and nearly instantaneous execution.
CREATE TABLE mytable_new LIKE mytable;
...possibly update the AUTO_INCREMENT of the new table...
RENAME TABLE mytable TO mytable_old, mytable_new TO mytable;
By renaming both tables in one statement, they are swapped atomically. There is no chance for any data to be written "in between" while there is no table to receive the write. If you don't do this atomically, some writes may fail.
RENAME TABLE is virtually instantaneous, no matter how large the table. You don't have to wait for data to be copied.
If the table has an auto-increment primary key, I like to make sure the new table starts with an id value greater than the current id in the old table. Do this before swapping the table names.
SELECT AUTO_INCREMENT FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA='mydatabase' AND TABLE_NAME='mytable';
I like to add some comfortable margin to that value. You want to make sure that the id values inserted to the old table won't exceed the value you queried from INFORMATION_SCHEMA.
Change the new table to use this new value for its next auto-increment:
ALTER TABLE mytable_new AUTO_INCREMENT=<increased value>;
Then promptly execute the RENAME TABLE to swap them. As soon as new rows are inserted to the new, empty table, it will use id values starting with the increased auto-increment value, which should still be greater than the last id inserted into the old table, if you did these steps promptly.
Instead of renaming the master_backup table and recreating it, you could
just create a backup_table with the data from the master_table for the first backup run.
CREATE TABLE backup_table AS
SELECT * FROM master_table;
If you must add a primary key to the backup table then run this just once, that is for the first backup:
ALTER TABLE backup_table ADD CONSTRAINT pk_backup_table PRIMARY KEY(id);
For future backups do:
INSERT INTO backup_table
SELECT * FROM master_table;
Then you can delete all the data in the backup_table found in the master_table like:
DELETE FROM master_table A JOIN
backup_table B ON A.id=B.id;
Then you can add data to the master_table with this query:
INSERT INTO master_table (`value`) VALUES ('5000'); -- I assume the id field is auto_incrementable
I think this should work perfectly even without locking the master table, and with no missing executions.

How can I dynamically set the table name using SELECT INTO?

DROP TABLE Backup_LOAD_EMPLOYEE
SELECT * INTO dbo.Backup_LOAD_Employee FROM LOAD_Employee WHERE 1=1
TRUNCATE TABLE LOAD_Employee
I am bulk inserting employee data from external source . In my sp each time after import , I will truncate the load_employee table. Before truncate I would like to take a table backup,previous day data should truncate .
how to give auto increment table name ( in an SP)?
This doesn't answer your question directly (but you can use dynamic SQL), but a better solution is probably to put the backup date into a column, instead of creating one table per day. Then you can more easily query the archived data for multiple days, because it's all in one table. Something like this:
create table dbo.Backup_LOAD_Employee (
BackupDate date,
--- other columns
)
go
insert into dbo.Backup_LOAD_Employee (BackupDate, ...)
select cast(getdate() as date), ... -- other columns
from dbo.LOAD_Employee
truncate table dbo.LOAD_Employee

Which one faster on Check and Skip Insert if existing on SQL / MySQL

I have read many article about this one. I want to hear from you.
My problem is:
A table: ID(INT, Unique, Auto Increase) , Title(varchar), Content(text), Keywords(varchar)
My PHP Code will always do insert new record, but not accept duplicated record base on Title or Keywords. So, the title or keyword can't be Primary field. My PHP Code need to do check existing and insert like 10-20 records same time.
So, I check like this:
SELECT * FROM TABLE WHERE TITLE=XXX
And if return nothing, then I do INSERT.
I read some other post. And some guy say:
INSERT IGNORE INTO Table values()
An other guy suggest:
SELECT COUNT(ID) FROM TABLE
IF it return 0, then do INSERT
I don't know which one faster between those queries.
And I have 1 more question, what is different and faster on those queries too:
SELECT COUNT(ID) FROM ..
SELECT COUNT(0) FROM ...
SELECT COUNT(1) FROM ...
SELECT COUNT(*) FROM ...
All of them show me total of records in table, but I don't know do mySQL think number 0 or 1 is my ID field? Even I do SELECT COUNT(1000) , I still get total records of my table, while my table only have 4 columns.
I'm using MySQL Workbench, have any option for test speed on this app?
I would use insert on duplicate key update command. One important comment from the documents states that: "...if there is a single multiple-column unique index on the table, then the update uses (seems to use) all columns (of the unique index) in the update query."
So if there is a UNIQUE(Title,Keywords) constraint on the table in the example, then, you would use:
INSERT INTO table (Title,Content,Keywords) VALUES ('blah_title','blah_content','blah_keywords')
ON DUPLICATE KEY UPDATE Content='blah_content';
it should work and it is one query to the database.
SELECT COUNT(*) FROM .... is faster than SELECT COUNT(ID) FROM .. or build something like this:
INSERT INTO table (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=3;

MySQL table 30 million records insert into another

I have a 30 million record mysql table.
Has about 20 columns of which I will use 15 to insert into another table.
Now I can't use PHP to load this large dataset (selecting 30 million rows and loading into memory isn't feasible), what would be the best method of loading all these records? MySQL 5.X
I'm using EMS to connect to the database.
What about doing an INSERT INTO MySmallerTable SELECT Col1, col2, col3... FROM MyBiggerTable
It might be worth breaking it into multiple INSERT
Like:
INSERT INTO ... SELECT ... WHERE ID between 1 and 100000;
INSERT INTO ... SELECT ... WHERE ID between 100001 and 200000;
etc.
you can do this
INSERT INTO new_table (`col1`,`col2`,`col3`) SELECT `oldcol1`,`oldcol2`,`oldcol3`
FROM old_table LIMIT 0,100000
and repeat it by php loop (with changing limit start value)
There are few ways you can including one user M_M provided above. I have not used EMS and not sure what it can and can't do. But I have extensively used Workbench.
A.
Create the new destination table
Create view on the source table with the columns of interest
LInsert into destination from source with simple INSERT INTO SOURCE_TABLE SELECT * FROM DESTINATION_TABLE
B.
Use mysqldump
Upload into new table
Alter table by dropping the columns you don't need

Deleting duplicates from a large table

I have quite a large table with 19 000 000 records, and I have problem with duplicate rows. There's a lot of similar questions even here in SO, but none of them seems to give me a satisfactory answer. Some points to consider:
Row uniqueness is determined by two columns, location_id and datetime.
I'd like to keep the execution time as fast as possible (< 1 hour).
Copying tables is not very feasible as the table is several gigabytes in size.
No need to worry about relations.
As said, every location_id can have only one distinct datetime, and I would like to remove all the duplicate instances. It does not matter which one of them survives, as the data is identical.
Any ideas?
I think you can use this query to delete the duplicate records from the table
ALTER IGNORE TABLE table_name ADD UNIQUE (location_id, datetime)
Before doing this, just test with some sample data first..and then Try this....
Note: On version 5.5, it works on MyISAM but not InnoDB.
SELECT *, COUNT(*) AS Count
FROM table
GROUP BY location_id, datetime
HAVING Count > 2
UPDATE table SET datetime = null
WHERE location_id IN (
SELECT location_id
FROM table as tableBis
WHERE tableBis.location_id = table.location_id
AND table.datetime > tableBis.datetime)
SELECT * INTO tableCopyWithNoDuplicate FROM table WHERE datetime is not null
DROp TABLE table
RENAME tableCopyWithNoDuplicate to table
So you keep the line with the lower datetime. I'm not sure about perf, it depends on your table column, your server etc...
This query works perfectly for every case : tested for Engine : MyIsam for 2 million rows.
ALTER IGNORE TABLE table_name ADD UNIQUE (location_id, datetime)
You can delete duplicates using these steps:
1- Export the following query's results into a txt file:
select dup_col from table1 group by dup_col having count(dup_col) > 1
2- Add this to the first of above txt file and run the final query:
delete from table1 where dup_col in (.....)
Please note that '...' is the contents of txt file created in the first step.