How to UPDATE many rows(1 500 000) fast - mysql

I have table with 1.5 mil rows and I have 47k values to update.
I've tried two ways of doing it and both are pretty slow.
First is 47k rows of
UPDATE $table
SET name = '$name'
WHERE id = '$id'
Second is
$prefix = "UPDATE table
SET name = (case ";
while () {
$mid .= "when id = '$id' then '$name' ";
}
$suffix = "end);";
$query = $prefix . $mid . $suffix;
Is there a way of doing it faster? Maybe with LOAD DATA INFILE ? Can't figure out the UPDATE syntax with this one.

I had to import large files on a daily basis, and tried all sorts of things.
In the end I got the best performance a specific combination of:
First copy the CSV to the database server, and load it from the local disk there, instead of loading the CSV from your client machine.
Make sure that you have a table structure that matches exactly with this. I've used a temporary table for the import, and then used separate queries on that to get data into the final table.
No foreign keys and unique index checks on the tmp table.
That will speeds things up a lot already. If you need to squeeze more performance in, you can increase the log buffer size.
And obviously:
make sure that you don't import stuff that you don't need to. Be critical about which fields you include, and which rows.
If you only have a few different values of text in a column, use a numeric value for it instead.
Do you really need 8 decimals in your floats?
Are you repeatedly importing the same data, where you could insert updates only?
Make sure that you don't trigger unnecessary type conversions during import. Prepare your data to be as close to the table that you're importing into.

Related

Get subset of rows based on a list of primary keys in file

I have a large (1000s) list of IDs in a text file. I want to return the rows from my database that correspond to these IDs. How can I do this?
My current method is to simply paste the whole list into a gigantic SQL query and run that. It works, but I feel like there must be a better way.
As the list of values goes bigger and bigger, a better solution is to load it into a table, that you can then use it your query. In MySQL, the load data statement syntax comes handy for this.
Consider something like:
create temporary table all_ids (id int);
load data infile 'myfile.txt' into table all_ids;
create index idx_all_ids on all_ids(id); -- for performance
select t.*
from mytable t
where exists (select 1 from all_ids a where a.id = t.id)
The load data syntax accepts many options to accommodate the format of the input file - you can read the documentation for more information.

MySQL and queries performance

Question regarding best practices for queries and perl.
Is there a performance or 'bad practice' of having a select inside a perl for loop? Is it an issue to send so many selects in rapid fire to the DB?
code is quasi pseudo code
#line has 5000 lines
foreach my $elem( #line ){
SQL = SELECT IGNORE INTO <table> ( column1, .. , column 10 ) VALUES ( 'a', .. , 'j' )
}
What about deletes and/or updates?
foreach my $elem( #line ) {
my $UN = substr($elem, 0, 10 );
SQL = UPDATE <table> SET <column> = $UN;
}
foreach my $elem( #line ) {
my $UN = substr($elem, 0, 10 );
SQL = DELETE from <table> WHERE <column> = $UN
}
Also, I have a question in the same arena, I have 5000 items I am checking and my Database has anywhere from 1 element to 5000 elements at any given time. Is it acceptable to loop through my 5000 items in perl and delete the ID in the Database or should there be a check at first to see if the ID exists before issues the delete command.
foreach my $elem ( #line ){
$ID = substr( $elem, 5, 0 );
SQL = DELETE FROM <table> WHERE id = $ID;
}
or should it be something like:
foreach my $elem ( #line ){
$ID = substr( $elem, 5, 0 );
SQL = DELETE FROM <table> WHERE id = $ID if ID exists;
}
Thanks,
--Eherr
As for inserts in rapid succession, not a problem. The server is tailored to handle that.
Caution should be taken with insert ignore for other reasons though, such as program logic that should address failure that otherwise would not be able to address a failure you just ignored.
As for your particular update you showed, that does not make a ton of sense in a loop (or perhaps at all) because you are not specifying a where clause. Meaning, why loop, say, 1000 times, each doing an update to all the rows due to no where clause? Maybe that was just a typo of yours.
As for deletes, there is no problem with that, either, in a loop, in general. If you are looking to empty a table, look into truncate table, faster, and not logged if that is ever a desire. Note though that truncate is disallowed on tables that are the referenced table in foreign key constraint situations. In those situations there are the concepts of the referencing table and the referenced.
Other general comments: care should be taken to ensure that any referential integrity in place or that should be in place is honored. Doing insert ignore, update, or delete can fail due to foreign key constraints. Also, checking for the existence of a row that you are about to delete anyway may be overkill idk. It is marching down a btree anyway to find it. Why do it twice (the marching part). Marching might not be a good word, perhaps flying. But on a tablescan, it would be added pain.
Lastly, when you are in a situation of massive bulk insert, loops are never up to the task in any programming language as compared to LOAD DATA INFILE performance. Several of your peers have seen 8 to 10 hour operations reduced to 2 minutes by using LOAD DATA (references to links available if you ask). Ok This Link is one.
Mysql Manual Page below:
Referential Integrity with Foreign Key Constraints
Quickly clearing tables with Truncate Table
Bulk inserts with LOAD DATA INFILE.
As per my opinion,it is a bit slow to make multiple queries.Better construct a single update,insert,select and delete query and fire.
There are few tips before using multiple quesris or single query.
1) If Database is configured to kill all queries that takes more than spcecified time, then using a single query if it is too large, can lead to killing of query.
2) Also, if user is waiting for response, then it can be done using pagination,i.e., fetch few records now...and subsequent later but not one by one.
5000 queries into a database shouldn't be a performance bottleneck. You're fine. You can always benchmark a read-only run.

Large MySQL Table daily updates?

I have a MySQL table that has a bunch of product pricing information on around 2 million products. Every day I have to update this information for any products whose pricing information has changed [huge pain].
I am wondering what the best way to handle these changes are other than running something like compare and update any products that have changed ?
Love any advice that you can provide
For bulk updates you should definitely be using LOAD DATA INFILE rather than a lot of smaller update statements.
First, load the new data into a temporary table:
LOAD DATA INFILE 'foo.txt' INTO TABLE bar (productid, info);
Then run the update:
UPDATE products, bar SET products.info = bar.info WHERE products.productid = bar.productid;
If you also want to INSERT new records from the same file that you're updating from, you can SELECT INTO OUTFILE all of the records that don't have a matching ID in the existing table then load that outfile into your products table using LOAD DATA INFILE.
I maintain a price comparison engine with millions of prices and I select each row that I find in the source and update each row individually. If there is no row then I insert. Its best to use InnoDB transactions to speed this up.
This is all done by a PHP script that knows how to parse the source files and update the tables.

How to check if the value in the mySQL DB exist?

I have a mySQL database and I have a Perl script that connects to it and performs some data manipulation.
One of the tables in the DB looks like this:
CREATE TABLE `mydb`.`companies` (
`company_id` INT NOT NULL AUTO_INCREMENT,
`company_name` VARCHAR(100) NULL ,
PRIMARY KEY (`company_id`) );
I want to insert some data in this table. The problem is that some companies in the data can be repeated.
The question is: How do I check that the "company_name" already exist? If it exist I need to retrieve "company_id" and use it to insert the data into another table. If it does not, then this info should be entered in this table, but I already have this code.
Here is some additional requirement: The script can be run multiple times simultaneously, so I can't just read the data into the hash and check if it already exist.
I can throw an additional "SELECT" query, but it will create an additional hit on the DB.
I tried to look for an answer, but every question here or the thread on the web talks about using the primary key checking. I don't need this. The DB structure is already set but I can make changes if need to be. This table will be used as an additional table.
Is there another way? In both DB and Perl.
"The script can be run multiple times simultaneously, so I can't just read the data into the hash and check if it already exist."
It sounds like your biggest concern is that one instance of the script may insert a new company name while another script is running. The two scripts may check the DB for the existence of that company name when it doesn't exist, and then they might both insert the data, resulting in a duplicate.
Assuming I'm understanding your problem correctly, you need to look at transactions. You need to be able to check for the data and insert the data before anyone else is allowed to check for that data. That will keep a second instance of the script from checking for data until the 1st instance is done checking AND inserting.
Check out: http://dev.mysql.com/doc/refman/5.1/en/innodb-transaction-model.html
And: http://dev.mysql.com/doc/refman/5.1/en/commit.html
MyISAM doesn't support transactions. InnoDB does. So you need to make sure your table is InnoDB. Start your set of queries with START TRANSACTION.
Alternatively, you could do this, if you have a unique index on company_name (which you should).
$query_string = "INSERT INTO `companies` (NULL,'$company_name')";
This will result in an error if the company_name already exists. Try a sample run attempting to insert a duplicate company name. In PHP,
$result = mysql_query($query_string);
$result will equal false on error. So,
if(!$result) {
$query2 = "INSERT INTO `other_table` (NULL,`$company_name`)";
$result2 = mysql_query($query2);
}
If you have a unique key on company_name in both tables, then MySQL will not allow you to insert duplicates. Your multiple scripts may spend a lot of time trying to insert duplicates, but they will not succeed.
EDIT: continuing from the above code, and doing your work for you, here is what you would do if the insert was successful.
if(!$result) {
$query2 = "INSERT INTO `other_table` (NULL,`$company_name`)";
$result2 = mysql_query($query2);
} else if($result !== false) { // must use '!==' rather than '!=' because of loose PHP typing
$last_id = mysql_insert_id();
$query2 = "UPDATE `other_table` SET `some_column` = 'some_value' WHERE `id` = '$last_id'";
// OR, maybe you want this query
// $query2a = "INSERT INTO `other_table` (`id`,`foreign_key_id`) VALUES (NULL,'$last_id');
}
I suggest you write a stored procedure(STP), which takes input as company name.
In this STP, first check existing company name. If it exists, return id. Otherwise, insert and return id.
This way, you hit DB only once
For InnoDB use transaction. For MyISAM lock table, do modifications, unlock.

Is there a smart way to mass UPDATE in MySQL?

I have a table that needs regular updating. These updates happen in batches. Unlike with INSERT, I can't just include multiple rows in a single query. What I do now is prepare the UPDATE statement, then loop through all the possibilities and execute each. Sure, the preparation happens only once, but there are still a lot of executions.
I created several versions of the table of different sizes (thinking that maybe better indexing or splitting the table would help). However, that did not have an effect on update times. 100 updates take about 4 seconds for either 1,000-row table or 500,000-row one.
Is there a smarter way of doing this faster?
As asked in the comments, here is actual code (PHP) I have been testing with. Column 'id' is a primary key.
$stmt = $dblink->prepare("UPDATE my_table SET col1 = ? , col2 = ? WHERE id = ?");
$rc = $stmt->bind_param("ssi", $c1, $c2, $id);
foreach ($items as $item) {
$c1 = $item['c1'];
$c2 = $item['c2'];
$id = $item['id'];
$rc = $stmt->execute();
}
$stmt->close();
If you really want to do it all in one big statement, a kludgy way would be to use the "on duplicate key" functionality of the insert statement, even though all the rows should already exist, and the duplicate key update will hit for every single row.
INSERT INTO table (a,b,c) VALUES (1,2,3),(4,5,6)
ON DUPLICATE KEY UPDATE 1=VALUES(a), b=VALUES(b), c=VALUES(c);
Try LOAD DATA INFILE. Much faster than MySQL INSERT's or UPDATES, as long as you can get the data in a flat format.