MySQL/ASP - Delete Duplicate Rows
I have a table with 100,000 rows called 'photoSearch'. When transferring the data from other tables (that took bloody ages and I was bloody tired), I accidentally forgot to remove the test transfer I did, which left 3500 rows in the table before I transferred everything over in one go.
The ID column is 'photoID' (INT) and I need to remove all duplicates that have a photoID of less than 6849. If I could just remove the duplicates, it would be less painful than to delete the table and start another transfer.
Has anybody got any suggestions on the most practical and safest way to do this?
UPDATE:
I actually answered my own question. I backed up my table for safety, and then I ran this:
ALTER IGNORE TABLE photoSearch ADD UNIQUE INDEX unique_id_index (photoID);
This removed all 3500 duplicates in under a minute :)
Traditional method
Backup your existing table photoSearch to something like tmp_photoSearch using a
create table tmp_photoSearch select * from photoSearch;
After that, you can perform data massage into table tmp_photoSearch.
Once you have gotten the results as expected,
perform a swap table
rename table photoSearch to photoSearch_backup, tmp_photoSearch to photoSearch;
To increase insert speed (if the bottle-neck is not on network transfer),
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
To increase performance for MyISAM tables, for both LOAD DATA INFILE and INSERT, enlarge the key cache by increasing the key_buffer_size system variable
Related
This is taking many hours on a table with over 4.6millon records.
Is there a way to speed this up?
UPDATE tableA
SET SKU = CONCAT("X-", tableA.supplier_SKU);
There is no index on any column yet.
EXPLAIN indicates rows=4.6 million, filtered = 100% !
If there is an index(indexes) on SKU, dropping it, updating and recreating might help.
Can you lock the table first (ensure no other user is blocking your operation)?
lock tables tableA write;
?
Can you create another table, update there and then rename?
https://dev.mysql.com/doc/refman/5.7/en/rename-table.html
*note - link above describes how to swap two tables in one statement.
4.6M records doesn't sound like sth that should take hours, unless you can't lock the table because other users keep updating it.
Please provide SHOW CREATE TABLE tableA.
The slow part is needing to save 4.6 million "old" rows before getting to the "commit".
Do not ever use LOCK TABLES with InnoDB.
You could break the task into chunks so that it blocks other actions less. (But the total time will probably be longer.) See this for 'chunking': http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks
I wanted to delete old records from 10 related tables every 6 months using primary keys and foreignkeys. I am planning to do it in a single transaction block, because in case of any failure I have to rollback the changes. My queries will be somethign like this
DELETE FROM PARENT_TABLE WHERE PARENT_ID IN (1, 2, 3,etc);
DELETE FROM CHILD_TABLE1 WHERE PARENT_ID IN (1, 2, 3,etc);
The records to delete will be around 1million. Is it safe to delete all these in a single transaction? how will be the performanace?
Edit
To be more clear on my question. I will detail my execution plan
I am first retreiving primary keys of all the records from the parent table which has to be deleted and store it to a temporary table
START TRANSACITON
DELETE FROM CHILD_ONE WHERE PARENT_ID IN (SELECT * FROM TEMP_ID_TABLE);
DELETE FROM CHILD_TWO WHERE PARENT_ID IN (SELECT * FROM TEMP_ID_TABLE);
DELETE FROM PARENT_TABLE WHERE PARENT_ID IN (SELECT * FROM TEMP_ID_TABLE);
COMMIT;
ROLLBACK on any failure.
Given that I can have around a million records to delete from all these tables, is it safe to put everything inside a single transaction block?
You can probably succeed. But it is not wise. Something random (eg, a network glitch) could come along to cause that huge transaction to abort. You might be blocking other activity for a long time. Etc.
Are the "old" records everything older than date X? If so, it would much more efficient to make use of PARTITIONing for DROPping old rows. We can discuss the details. Oops, you have FOREIGN KEYs, which are incompatible with PARTITIONing. Do all the tables have FKs?
Why do you wait 6 months before doing the delete? 6K rows a day would would have the same effect and be much less invasive and risky.
IN ( SELECT ... )
has terrible performance, use a JOIN instead.
If some of the tables are just normalizations, why bother deleting from them?
Would it work to delete 100 ids per transaction? That would be much safer and less invasive.
First of all: Create a proper backup AND test it before you start to delete the records
The number of record you asked for is mostly depends on the configuration (hardware) of your database server. You have to test it out, how many records could be deleted on that specific server without problems. Start with e.g. 1000 records then increase the amount in each iteration while it seems to be too slow. If you have replication, the setup and the slave's performance affects the row number too (too much write requests could cause serious delay in replication).
An advice: Remove all foreign keys and indexes (except the primary key and the indexes related to the where clauses you use to perform the action) if possible before you start the delete.
Edit:
If the count of records which will be deleted is larger than the count of records which will not, consider to just copy the records into a new table, then rename the old and new tables. For the first step, copy the structure of table using the CREATE TABLE .. LIKE statement, then drop all unnecessary indexes and constraints, copy the records, add the indexes, then rename the tables. (Copy the lastest new records from the original table into the copy if necessary), then you can drop the old table.
what i believe first you have to move the data in another database then
use single Transaction to delete all 10 table which is very safe to rollback immediately and delete the data from live data base when interaction of user is very less more info
I got a mysql database with approx. 1 TB of data. Table fuelinjection_stroke has apprx. 1.000.000.000 rows. DBID is the primary key that is automatically incremented by one with each insert.
I am trying to delete the first 1.000.000 rows using a very simple statement:
Delete from fuelinjection_stroke where DBID < 1000000;
This query is takeing very long (>24h) on my dedicated 8core Xeon Server (32 GB Memory, SAS Storage).
Any idea whether the process can be sped up?
I believe that you table becomes locked. I've faced same problem and find out that can delete 10k records pretty fast. So you might want to write simple script/program which will delete records by chunks.
DELETE FROM fuelinjection_stroke WHERE DBID < 1000000 LIMIT 10000;
And keep executing it until it deletes everything
Are you space deprived? Is down time impossible?
If not, you could fit in a new INT column length 1 and default it to 1 for "active" (or whatever your terminology is) and 0 for "inactive". Actually, you could use 0 through 9 as 10 different states if necessary.
Adding this new column will take a looooooooong time, but once it's over, your UPDATEs should be lightning fast as long as you do it off the PRIMARY (as you do with your DELETE) and you don't index this new column.
The reason why InnoDB takes so long to DELETE on such a massive table as yours is because of the cluster index. It physically orders your table based upon your PRIMARY (or first UNIQUE it finds...or whatever it feels like if it can't find PRIMARY or UNIQUE), so when you pull out one row, it now reorders your ENTIRE table physically on the disk for speed and defragmentation. So it's not the DELETE that's taking so long. It's the physical reordering after that row is removed.
When you create a new INT column with a default value, the space will be filled, so when you UPDATE it, there's no need for physical reordering across your huge table.
I'm not sure exactly what your schema is exactly, but using a column for a row's state is much faster than DELETEing; however, it will take more space.
Try setting values:
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT (for non-windows machine)
innodb_buffer_pool_size=25GB (currently it is close to 21GB)
innodb_doublewrite=0
innodb_support_xa=0
innodb_thread_concurrency=0...1000 (try different values, beginning with 200)
References:
MySQL docs for description of different variables.
MySQL Server Setting Tuning
MySQL Performance Optimization basics
http://bugs.mysql.com/bug.php?id=28382
What indexes do you have?
I think your issue is that the delete is rebuilding the index on every iteration.
I'd delete the indexes if any, do the delete, then re-add the indexes. It'll be far faster, (I think).
I was having the same problem, and my table has several indices that I didn't want to have to drop and recreate. So I did the following:
create table keepers
select * from origTable where {clause to retrieve rows to preserve};
truncate table origTable;
insert into origTable null,keepers.col2,...keepers.col(last) from keepers;
drop table keepers;
About 2.2 million rows were processed in about 3 minutes.
Your database may be checking for records that need to be modified in a foreign key (cascades, delete).
But I-Conica answer is a good point(+1). The process of deleting a single record and updating a lot of indexes during done 100000 times is inefficient. Just drop the index, delete all records and create it again.
And of course, check if there is any kind of lock in the database. One user or application can lock a record or table and your query will be waiting until the user release the resource or it reachs a timeout. One way to check if your database is doing real work or just waiting is lauch the query from a connection that sets the --innodb_lock_wait_timeout parameter to a few seconds. If it fails at least you know that the query is OK and that you need to find and realease that lock. Examples of locks are Select * from XXX For update and uncommited transactions.
For such long tables, I'd rather use MYISAM, specially if there is not a lot of transactions needed.
I don't know exact ans for ur que. But writing another way to delete those rows, pls try this.
delete from fuelinjection_stroke where DBID in
(
select top 1000000 DBID from fuelinjection_stroke
order by DBID asc
)
I have a table called research_words which has some hundred million rows.
Every day I have tens of million of new rows to be added, about 5% of them are totally new rows, and 95% are updates which have to add to some columns in that row. I don't know which is which so I use:
INSERT INTO research_words
(word1,word2,origyear,cat,numbooks,numpages,numwords)
VALUES
(34272,268706,1914,1,1,1,1)
ON DUPLICATE KEY UPDATE
numbooks=numbooks+1,numpages=numpages+1,numwords=numwords+1
This is an InnoDB table where the primary key is over word1,word2,origyear,cat.
The issue I'm having is that I have to insert the new rows each day and it's taking longer than 24 hours to insert each days rows! Obviously I can't have it taking longer than a day to insert the rows for the day. I have to find a way to make the inserts faster.
For other tables I've had great success with ALTER TABLE ... DISABLE KEYS; and LOAD DATA INFILE, which allows me to add billions of rows in less than an hour. That would be great, except that unfortunately I am incrementing to columns in this table. I doubt disabling the keys would help either because surely it will need them to check whether the row exists in order to add it.
My scripts are in PHP but when I add the rows I do so by an exec call directly to MySQL and pass it a text file of commands, instead of sending them with PHP, since it's faster this way.
Any ideas to fix the speed issue here?
Old question, but perhaps worth an answer all the same.
Part of the issue stems from the large number of inserts being run essentially one at a time, with a unique index update after each one.
In these instances, a better technique might be to select n rows to insert and put them in a temp table, left join them to the destination table, calculate their new values (in OP's situation IFNULL(dest.numpages+1,1) etc.) and then run two further commands - an insert where the insert fields are 1 and an update where they're greater. The updates don't require an index refresh, so they run much faster; the inserts don't require the same ON DUPLICATE KEY logic.
I have a table with 10 million records, what is the fastest way to delete & retain last 30 days.
I know this can be done in event scheduler, but my worry is if takes too much time, it might lock the table for much time.
It will be great if you can suggest some optimum way.
Thanks.
Offhand, I would:
Rename the table
Create an empty table with the same name as your
original table
Grab the last 30 days from your "temp" table and insert
them back into the new table
Drop the temp table
This will enable you to keep the table live through (almost) the entire process and get the past 30 days worth of data at your leisure.
You could try partition tables.
PARTITION BY LIST (TO_DAYS( date_field ))
This would give you 1 partition per day, and when you need to prune data you just:
ALTER TABLE tbl_name DROP PARTITION p#
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Not that it helps you with your current problem, but if this is a regular occurance, you might want to look into a merge table: just add tables for different periods in time, and remove them from the merge table definition when no longer needed. Another option is partitioning, in which it is equally trivial to drop a (oldest) partition.
To expand on Michael Todd's answer.
If you have the space,
Create a blank staging table similar to the table you want to reduce in size
Fill the staging table with only the records you want to have in your destination table
Do a double rename like the following
Assuming:
table is the table name of the table you want to purge a large amount of data from
newtable is the staging table name
no other tables are called temptable
rename table table to temptable, newtable to table;
drop temptable;
This will be done in a single transaction, which will require an instantaneous schema lock. Most high concurrency applications won't notice the change.
Alternatively, if you don't have the space, and you have a long window to purge this data, you can use dynamic sql to insert the primary keys into a temp table, and join the temp table in a delete statement. When you insert into the temp table, be aware of what max_packet_size is. Most installations of MySQL use 16MB (16777216 bytes). Your insert command for the temp table should be under max_packet_size. This will not lock the table. You'll want to run optimize table to reclaim space for the rest of the engine to use. You probably won't be able to reclaim disk space, unless you were to shutdown the engine and move the data files.
Shutdown your resource,
SELECT .. INTO OUTFILE, parse output, delete table, LOAD DATA LOCAL INFILE optimized_db.txt - more cheaper to re-create, than to UPDATE.