mysql: removing duplicates while avoiding client timeout - mysql

Issue: hundreds of identical (schema) tables. Some of these have some duplicated data that needs to be removed. My usual strategy for this is:
walk list of tables - for each do
create temp table with unique key on all fields
insert ignore select * from old table
truncate original table
insert select * back into original table
drop or clean temp table
For smaller tables this works fine. Unfortunately the tables I'm cleaning often have 100s of millions of records so my jobs and client connections are timing out while I'm running this. (Since there are hundreds of these tables I'm using Perl to walk the list and clean each one. This is where the timeout happens).
Some options I'm looking into:
mysqldump - fast but I don't see how to do the subsequent 'insert ignore' step
into outfile / load infile - also fast but I'm running from a remote host and 'into outfile' creates all the files on the mysql server. Hard to clean up.
do the insert/select in blocks of 100K records - this avoid the db timeout but its pretty slow.
I'm sure there is a better way. Suggestions?

If an SQL query to find the duplicates can complete without timing out, I think you should be able to do a SELECT with a Count() operator with a WHERE clause that restricts the output to only the rows with duplicate data (Count(DUPEDATA) > 1). The results of this SELECT can be placed INTO a temporary table, which can then be joined with the primary table for the DELETE query.
This approach uses the set-operations strengths of SQL/MySQL -- no need for Perl coding.

Related

Inserting 1 million records in mysql

I have two tables and in both tables I get 1 million records .And I am using cron job every night for inserting records .In first table I am truncating the table first and then inserting the records and in second table I am updating and inserting record according to primary key. I am using mysql as my database.My problem is I need to do this task each day but I am unable to insert all data .So what can be the possible solution for this problem
Important is to set off all kind of actions and checks MySQL wants to perform when posting data, like autocommit, indexing, etc.
https://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-bulk-data-loading.html
Because if you do not do this, MySQL does a lot of work after every record added, and it adds up, when the process is proceeding, resulting in a very slow processing and importing in the end, and may not complete in one day.
If you must use MySql : For the first table, disable the indexes, do the inserts, than enable indexes. This will works faster.
Alternatively MongoDb will be faster, and Redis is very fast.

What's Mysql fastest way to delete a database content?

mysql fastest way to delete a database content on freebsd? please help
i've tried to delete from navicat (400,000+ lines),
but in a hour... just 100,000 was deleted
I haven't phpmyadmin
To delete everything in a table:
TRUNCATE TABLE table_you_want_to_nuke
To delete certain rows, you have two options:
Follow these steps:
Create a temporary table using CREATE TABLE the_temp_table LIKE current_table
Drop all of the indexes on the temp table.
Copy the records you want to keep with INSERT INTO the_temp_table SELECT * FROM current_table WHERE ...
TRUNCATE TABLE current_table
INSERT INTO current_table SELECT * FROM the_temp_table
To speed this option up, you may want to drop all indexes from current_table before the final INSERT INTO, then recreate them after the INSERT. MySQL is much faster at indexing existing data than it is at indexing on the fly.
The option you're currently trying: DELETE FROM your_table WHERE whatever_condition. You probably need to break this into chunks using the WHERE condition or LIMIT, so you can batch it and not bog down the server forever.
Which is better/faster depends on lots of things, mostly the ratio of deleted records to retained records and the number of indexes involved. As always, test this carefully before doing it on a live database, as both DELETE and TRUNCATE will permanently destroy data.

mysql - best way to avoid duplicates

I am very new to the database game so forgive my ignorance.
I am loading millions of rows into a simply structured MySQL database table
SQLStr = "LOAD DATA LOCAL INFILE 'f:/Smallscale/02 MMToTxt/flat.txt'
INTO TABLE `GMLObjects` FIELDS TERMINATED BY ','
LINES STARTING BY 'XXX';"
At the moment the table has been set to no duplicates against one field.
However, I am wondering if it would be quicker to remove the no duplicates rule and deal with the problem of duplicates later, either by using ALTER TABLE or SELECT DISTINCT, or some such query.
What are your thoughts?
P.S the database engine is InnoDB
You can not "alter" a table with duplicates into a table without.
MySQL can not know which of the rows it should remove. It would mean a lot of work and trouble to delete them later, and it would produce deleted entries without any benefit.
So avoid duplicates as early as possible.
Why load duplicates in your DB in the first place?
Avoid them as early as possible. That is better for performance and you don't have to write complicated queries.

Remove over 100,000 rows from mysql table - server crashes

I have a question when I try to remove over 100,000 rows from a mysql table the server freezes and non of its websites can be accessed anymore!
I waited 2 hours and then restarted the server and restored the account.
I used following query:
DELETE FROM `pligg_links` WHERE `link_id` > 10000
-
SELECT* FROM `pligg_links` WHERE `link_id` > 10000
works perfectly
Is there a better way to do this?
You could delete the rows in smaller sets. A quick script that deletes 1000 rows at a time should see you through.
"Delete from" can be very expensive for large data sets.
I recommend using partitioning.
This may be done slightly differently in PostgreSQL and MySQL, but in PostgreSQL you can create many tables that are "partitions" of the larger table or on a partition. Queries and whatnot can be run on the larger table. This can greatly increase the speed with which you can query given you partition correctly. Also, you can delete a partition by simply dropping it. This happens very very quickly because it is somewhat equivalent to dropping a table.
Documentation for table partitioning can be found here:
http://www.postgresql.org/docs/8.3/static/ddl-partitioning.html
Make sure you have an index on link_id column.
And try to delete with chunks like 10.000 in a time.
Deleting from table is very costy operation.

Running optimize on table copy?

I have an InnoDB table in MySQL which used to contain about 600k rows. After deleting 400k+ rows, my guess is that I need to run an OPTIMIZE.
However, since the table will be locked during this operation, the site will not be usable at that time. So, my question is: should I run the optimize on the live database table (with a little under 200k rows)? Or is it possible to create a copy of that table, run the OPTIMIZE on that copy and after that rename both tables so the copy the data back to the live table?
If you create a copy, then it should be optimised already if you do CREATE TABLE..AS SELECT... No need to run it separately
However, I'd consider copy the 200k rows to keep into a new table, then renaming the tables.
This way is less steps and less work all round.
CREATE TABLE MyTableCopy AS
SELECT *
FROM myTable
WHERE (insert Keep condition here);
RENAME TABLE
myTable TO myTable_DeleteMelater,
MyTableCopy TO myTable;