mysql - best way to avoid duplicates - mysql

I am very new to the database game so forgive my ignorance.
I am loading millions of rows into a simply structured MySQL database table
SQLStr = "LOAD DATA LOCAL INFILE 'f:/Smallscale/02 MMToTxt/flat.txt'
INTO TABLE `GMLObjects` FIELDS TERMINATED BY ','
LINES STARTING BY 'XXX';"
At the moment the table has been set to no duplicates against one field.
However, I am wondering if it would be quicker to remove the no duplicates rule and deal with the problem of duplicates later, either by using ALTER TABLE or SELECT DISTINCT, or some such query.
What are your thoughts?
P.S the database engine is InnoDB

You can not "alter" a table with duplicates into a table without.
MySQL can not know which of the rows it should remove. It would mean a lot of work and trouble to delete them later, and it would produce deleted entries without any benefit.
So avoid duplicates as early as possible.

Why load duplicates in your DB in the first place?
Avoid them as early as possible. That is better for performance and you don't have to write complicated queries.

Related

Manage Querying from huge table which takes a lot time to alter/update

i have a very huge table "table1" from which i am continuously querying all day(24x7)
What happens is at the end of the day say at 12.AM, i run a query which would alter "table1" at row level. this activity takes around 3-4 hours till my updated "table1" is finished creating.
But till that time i wanted to still query from "table1".
SO i decided to create two tables. "table1_active" and "table1_passive"
normally during the day i will query from "table1_passive" and after i am updating "table1_active" i should switch my querying from "table1_passive"
to "table1_active"
and this switching should be done everyday, so that my all day querying should not hamper.
I dont know is there a better way to actually set a trigger or can anyone suggest me a method to do it?
In my experience, the use of a secondary table like table1_passive it's risky. You don't know exactly (as I understand) when the update process finish so you will don't know either when you should switch querying between table1_passive and table1_active.
There are several ways to improve the update process over your table but you have to keep in mind, those are temporary solutions if table1 grows constantly:
Use MyISAM as storage engine. Here is a very good article about improve the updates over a MyISAM table.
If you are updating table1 based on a where clause, you might use indexes to help database engine find which records has to update.
Consider use partitions to work with your table faster.
If you sill have those two tables, you can:
Create an Unique_Index on table1_active and set ON DUPLICATE KEY UPDATE
Update table1_passive.
Use bulk insert in table1_active to speed up the process, the database will make sure that there are not duplicate rows based on your criteria.
BUT, if you are querying all day, and the table grows constantly I suggest to use NoSql because the problem will be there, even if you optimize the update process now.
If we could know the table structure and the update query you are using maybe we can help you better ;)
Regards and good luck!

Is there an equivalent of EXPLAIN that will work in front of an ALTER TABLE query?

It looks like the MySQL EXPLAIN prefix only works in front of certain queries. Is there an equivalent of EXPLAIN that will work in front of an ALTER TABLE query?
I would love to be able to find out how long my planned ALTER TABLE statement is likely to take.
Background: I have a table from someone else that contains 300 columns of data. I know that I'm only going to need to use a few of those columns, and in order to figure out which columns I need, I'm planning to do a full-text search for a few key words. But in order to do that, I need to add a full-text index. And since I'm new to this size of data set, I'm not entirely sure that this is a realistic plan. I'm hoping something like EXPLAIN (or, more likely, a substitute tool from this thread) might help determine that.
EDIT: In answer to a couple questions below, I should mention that this table has about 4 million rows and is on a local testing machine. So I can just run this thing blindly if needed. I just don't prefer to if possible. Thanks for all the good information so far.
Most "Alter table" will trigger the copy to tmp table operation, which it will create temp table with new schema, then lock table, copy data from old table to new table, then rename, drop old table.
So most time consumed is copy to temp table, it's depend on how big of that table if the server have enough memory. Use show table status to check how big of the table (data_length+ index_length), sample on other table to know the transfer speed on your mysql server, then you can estimate how long it will take.
Another way mentioned on mysql doc about explain on DML, but I didn't got result, maybe not finished yet :
http://dev.mysql.com/doc/refman/5.6/en/explain.html
As of MySQL 5.6.3, permitted explainable statements for EXPLAIN are SELECT, DELETE, INSERT, REPLACE, and UPDATE. Before MySQL 5.6.3, SELECT is the only explainable statement.

MySQL how to ignore indexig while inserting rows

I have a table in my MySQL database with round 5M rows. Inserting rows to the table is too slow as MySQL updates index while inserting. How to stop index updating while inserting and do the indexing separately later?
Thanks
Kamrul
Sounds like your table might be over indexed. Maybe post your table definition here so we can have a look.
You have two choices:
Keep current indexes and remove unused indexes. If you have 3 indexes on a table for every single write to the table there will be 3 writes to the indexes. A index is only helpful during reads so you might want to remove unused indexes. During a load indexes will be updated which will slow down your load.
Drop you indexes before load then recreate them after load. You can drop your indexes before data load then insert and rebuild. The rebuild might take longer than the slow inserts. You will have to rebuild all indexes one by one. Also unique indexes can fail if duplicates are loaded during the load process without the indexes.
Now I suggest you take a good look at the indexes on the table and reduce them if they are not used in any queries. Then try both approaches and see what works for you. There is no way I know of in MySQL to disable indexes as they need the values insert to be written to their internal structures.
Another thing you might want to try it to split the IO over multiple drives i.e partition your table over several drives to get some hardware performance in place.

mysql: removing duplicates while avoiding client timeout

Issue: hundreds of identical (schema) tables. Some of these have some duplicated data that needs to be removed. My usual strategy for this is:
walk list of tables - for each do
create temp table with unique key on all fields
insert ignore select * from old table
truncate original table
insert select * back into original table
drop or clean temp table
For smaller tables this works fine. Unfortunately the tables I'm cleaning often have 100s of millions of records so my jobs and client connections are timing out while I'm running this. (Since there are hundreds of these tables I'm using Perl to walk the list and clean each one. This is where the timeout happens).
Some options I'm looking into:
mysqldump - fast but I don't see how to do the subsequent 'insert ignore' step
into outfile / load infile - also fast but I'm running from a remote host and 'into outfile' creates all the files on the mysql server. Hard to clean up.
do the insert/select in blocks of 100K records - this avoid the db timeout but its pretty slow.
I'm sure there is a better way. Suggestions?
If an SQL query to find the duplicates can complete without timing out, I think you should be able to do a SELECT with a Count() operator with a WHERE clause that restricts the output to only the rows with duplicate data (Count(DUPEDATA) > 1). The results of this SELECT can be placed INTO a temporary table, which can then be joined with the primary table for the DELETE query.
This approach uses the set-operations strengths of SQL/MySQL -- no need for Perl coding.

Mysql performance with multiple update in ONE query

I need to update about 100 000 records in MySQL table (with indexes) so this process can take long time. i'm searching solution which will work faster.
I have three solutions but i have no time for speed tests.
Solutions:
usual UPDATE with each new record in array loop (bad perfomance)
using UPDATE syntax like here Update multiple rows with one query? - can't find any perfomance result
using LOAD DATA INFILE with the same value for key field, i guess in this case it will call UPDATE instead UNSERT - i guess should work faster when ever
Do you know which is solution is best.
The one important criteria is execution speed.
Thanks.
LOAD DATA INFILE the fastest way to upsert large amount of data from file;
second solution is not so bad as you might think. Especially if you can execute something like
update table
set
field = values
where
id in (list_of_ids)
but it would be better to post your update query.