I want to update about millions (or half of the millions) records into my database, but it went very slow. It took me couple of hours to update only 100,000. Do you guys have any ideas?
Basically I have process to encrypt particular column value and then update it back to database. I cant do it at database level because of code integration dependency.
sample code:
dbContext.Configuration.AutoDetectChangesEnabled = false;
List<Users> usersLst = dbContext.Users.AsNoTracking().Take(500000).ToList();
foreach (var usr in usersLst) {
usr.Password = this.Encrypt(usr.Password);
dataContext.Entry(consumer).State = System.Data.Entity.EntityState.Modified;
}
dbContext.SaveChanges();
Note: - I tried with SQL Server, it is much faster. 1M records update in 15-20 mins.
Update will take too much of time, From my personal experience I am telling, Don't update if you have more then 10k data. Better way is
Write a sub query for that field and try to get the values what you have to update with the old data from your table.
Convert this select query into a view, If you are new in creating view.. just check the structure here
Now write a insert query in which, you are taking values from the view which you created just now .
Insert won't take too much time, for me it took 10 minutes to insert 1 crore data in a table from view.
Even if you feel it is too much of time you can do this insert using trigger in mysql, which will take only 3 or 4 minutes.
Hope I have answered your question. Just try and give me a feedback.
Related
I have a table named as TRENDS, containing around 20k records. I need to manipulate each row of TRENDS table based on the each column value and final output of the row is a string, named insight which is nothing but that manipulated row. And then i need to store that insight into a INSIGHTS table. Along with a insight i am generating 3 more queries which are in three seprate functions. Result of each query is get stored into another table called FACTS along with a insight_id to indicates that these 3 facts belongs to the same insight.
Since the data is in mysql database I used mysql-connector library of python to run on my scripts for retrieval and insertion operations.
With each insight and 3 facts i am performing execute() and commit() which is taking 3 sec for one set of record to insert and these is 20k recods in TRENDS table which is taking lot of time to complete.
What is the fastest way to solve this problem?
Please suggest a better algo if possible.
We'll be able to provide much more help if you can provide a sample of the data in each table and your desired output. Here are some very general pointers based on what you've said:
There are a lot of overheads to excuting a query, if you read and write one line at a time you will waste a huge amount of execution time waiting for queries to execute
Bulk operations are much faster, why not read all 20k rows at once then write 20k back, or if that's too demanding on your local system why not do 1000 at a time?
...or see if you can write a query which completes the entire operation in SQL
The fastest method would probably be to use a stored procedure you call just once in order to avoid round trips (i.e. your app calling MySQL, waiting for a response, sending the statementment, getting an okay, executing the statement, getting a success message, ...). As to the three facts inserts, you can probably make this one statement using UNION ALL.
Here is an example:
CREATE PROCEDURE myproc
BEGIN
DECLARE v_first_new_insight_id INT DEFAULT 2147483647;
START TRANSACTION;
UPDATE trends SET insight = col1 + col2 + col3 + DATE_FORMAT(CURRENT_DATE, '%d');
INSERT INTO insights (insight) SELECT insight FROM trends;
SET v_first_new_insight_id = LAST_INSERT_ID();
INSERT INTO facts (insight_id, fact)
SELECT insight_id, get_fact1(insight) FROM insights WHERE insight_id >= v_first_new_insight_id
UNION ALL
SELECT insight_id, get_fact2(insight) FROM insights WHERE insight_id >= v_first_new_insight_id
UNION ALL
SELECT insight_id, get_fact3(insight) FROM insights WHERE insight_id >= v_first_new_insight_id
;
COMMIT;
END
I have a database with 200+ entries, and with a cronjob I'm updating the database every 5 minutes. All entries are unique.
My code:
for($players as $pl){
mysql_query("UPDATE rp_players SET state_one = '".$pl['s_o']."', state_two = '".$pl['s_t']."' WHERE id = '".$pl['id']."' ")
or die(mysql_error());
}
There are 200+ queries every 5 minute. I don't know what would happen if my database will have much more entries (2000... 5000+). I think the server will die.
Is there any solution (optimization or something...)?
I think you can't do much but make the cron to be executed every 10 minutes if it's getting slower and slower. Also, you can set X rule to delete X days old entries.
If id is your primary (and unique as you mentioned) key, updates should be fast and couldn't be optimised (since it's a primary key... if not, see if you can add an index).
The only problem which could occur (on my mind) is cronjob overlapping, due to slow updates: let's assume your job starts at 1:00am and isn't finished at 1:05am... this will mean that your queries will pile up, creating server load, slow response time, etc...
If this is your case, you should use rabbitmq in order to queue your update queries in order to process them in a more controlled way...
I would load all data that is to be updated into a temporary table using the LOAD DATA INFILE command: http://dev.mysql.com/doc/refman/5.5/en/load-data.html
Then, you could update everything with one query:
UPDATE FROM rp_players p
INNER JOIN tmp_players t
ON p.id = t.id
SET p.state_one = t.state_one
, p.state_two = t.state_two
;
This would be much more efficient because you would remove a lot of the back and forth to the server that you are incurring by running a separate query every time through a php loop.
Depending on where the data is coming from, you might be able to remove PHP from this process entirely.
I need to update about 100 000 records in MySQL table (with indexes) so this process can take long time. i'm searching solution which will work faster.
I have three solutions but i have no time for speed tests.
Solutions:
usual UPDATE with each new record in array loop (bad perfomance)
using UPDATE syntax like here Update multiple rows with one query? - can't find any perfomance result
using LOAD DATA INFILE with the same value for key field, i guess in this case it will call UPDATE instead UNSERT - i guess should work faster when ever
Do you know which is solution is best.
The one important criteria is execution speed.
Thanks.
LOAD DATA INFILE the fastest way to upsert large amount of data from file;
second solution is not so bad as you might think. Especially if you can execute something like
update table
set
field = values
where
id in (list_of_ids)
but it would be better to post your update query.
I am deleting approximately 1/3 of the records in a table using the query:
DELETE FROM `abc` LIMIT 10680000;
The query appears in the processlist with the state "updating". There are 30m records in total. The table has 5 columns and two indexes, and when dumped to SQL the file about 9GB.
This is the only database and table in MySQL.
This is running on a machine with 2GB of memory, a 3 GHz quad-core processor and a fast SAS disk. MySQL is not performing any reads or writes other than this DELETE operation. No other "heavy" processes are running on the machine.
This query has been running for more than 2 hours -- how long can I expect it to take?
Thanks for the help! I'm pretty new to MySQL, so any tidbits about what's happening "under the hood" while running this query are definitely appreciated.
Let me know if I can provide any other information that would be pertinent.
Update: I just ran a COUNT(*), and in 2 hours, it's only deleted 200k records. I think I'm going to take Joe Enos' advice and see how well inserting the data into a new table and dropping the previous table performs.
Update 2: Sorry, I actually misread the number. In 2 hours, it's not deleted anything. I'm confused. Any suggestions?
Update 3: I ended up using mysqldump with --where "true LIMIT 10680000,31622302" and then importing the data into a new table. I then deleted the old table and renamed the new one. This took just over half an hour.
Don't know if this would be any better, but it might be worth thinking about doing the following:
Create a new table and insert 2/3 of the original table into the new one.
Drop the original table.
Rename the new table to the original table's name.
This would prevent the log file from having all the deletes, but I don't know if inserting 20m records is faster than deleting 10m.
You should post the table definition.
Also, to know why is it taking to much time, try to enable the profile mode on the delete request via :
SET profiling=1;
DELETE FROM abc LIMIT 10680000;
SET profiling=0;
SHOW PROFILES;
SHOW PROFILE ALL FOR QUERY X; (X is the ID of your query shown in SHOW PROFILES)
and post what it returns (But I think the query must end to return the profiling data)
http://dev.mysql.com/doc/refman/5.0/en/show-profiles.html
Also, I think you'll get more responses on ServerFault ;)
When you run this query, the InnoDB log file for the database is used to record all the details of the rows that are deleted - and if this log file isn't large enough from the outset it'll be auto-extended as and when necessary (if configured to do so) - I'm not familiar with the specifics but I expect this auto-extension is not blindingly fast. 2 hours does seem like a long time - but doesn't surprise me if the log file is growing as the query is running.
Is the table from which the records are being deleted on the end of a foreign key (i.e. does another table reference it through a FK constraint)?
I hope your query ended by now ... :) but from what I've seen, LIMIT with large numbers (and I never tried this kind of numbers) is very slow. I would try something based on the pk like
DELETE FROM abc WHERE abc_pk < 10680000;
EDIT: To clarify the records originally come from a flat-file database and is not in the MySQL database.
In one of our existing C programs which purpose is to take data from the flat-file and insert them (based on criteria) into the MySQL table:
Open connection to MySQL DB
for record in all_record_of_my_flat_file:
if record contain a certain field:
if record is NOT in sql_table A: // see #1
insert record information into sql_table A and B // see #2
Close connection to MySQL DB
select field from sql_table A where field=XXX
2 inserts
I believe that management did not feel it is worth it to add the functionality so that when the field in the flat file is created, it would be inserted into the database. This is specific to one customer (that I know of). I too, felt it odd that we use tool such as this to "sync" the data. I was given the duty of using and maintaining this script so I haven't heard too much about the entire process. The intent is to primarily handle additional records so this is not the first time it is used.
This is typically done every X months to sync everything up or so I'm told. I've also been told that this process takes roughly a couple of days. There is (currently) at most 2.5million records (though not necessarily all 2.5m will be inserted and most likely much less). One of the table contains 10 fields and the other 5 fields. There isn't much to be done about iterating through the records since that part can't be changed at the moment. What I would like to do is speed up the part where I query MySQL.
I'm not sure if I have left out any important details -- please let me know! I'm also no SQL expert so feel free to point out the obvious.
I thought about:
Putting all the inserts into a transaction (at the moment I'm not sure how important it is for the transaction to be all-or-none or if this affects performance)
Using Insert X Where Not Exists Y
LOAD DATA INFILE (but that would require I create a (possibly) large temp file)
I read that (hopefully someone can confirm) I should drop indexes so they aren't re-calculated.
mysql Ver 14.7 Distrib 4.1.22, for sun-solaris2.10 (sparc) using readline 4.3
Why not upgrade your MySQL server to 5.0 (or 5.1), and then use a trigger so it's always up to date (no need for the monthly script)?
DELIMITER //
CREATE TRIGGER insert_into_a AFTER INSERT ON source_table
FOR EACH ROW
BEGIN
IF NEW.foo > 1 THEN
SELECT id AS #testvar FROM a WHERE a.id = NEW.id;
IF #testvar != NEW.id THEN
INSERT INTO a (col1, col2) VALUES (NEW.col1, NEW.col2);
INSERT INTO b (col1, col2) VALUES (NEW.col1, NEW.col2);
END IF
END IF
END //
DELIMITER ;
Then, you could even setup update and delete triggers so that the tables are always in sync (if the source table col1 is updated, it'll automatically propagate to a and b)...
Here's my thoughts on your utility script...
1) Is just a good practice anyway, I'd do it no matter what.
2) May save you a considerable amount of execution time. If you can solve a problem in straight SQL without using iteration in a C-Program, this can save a fair amount of time. You'll have to profile it first to ensure it really does in a test environment.
3) LOAD DATA INFILE is a tactic to use when inserting a massive amount of data. If you have a lot of records to insert (I'd write a query to do an analysis to figure out how many records you'll have to insert into table B), then it might behoove you to load them this way.
Dropping the indexes before the insert can be helpful to reduce running time, but you'll want to make sure you put them back when you're done.
Although... why aren't all the records in table B in the first place? You haven't mentioned how processing works, but I would think it would be advantageous to ensure (in your app) that the records got there without your service script's intervention. Of course, you understand your situation better than I do, so ignore this paragraph if it's off-base. I know from experience that there are lots of reasons why utility cleanup scripts need to exist.
EDIT: After reading your revised post, your problem domain has changed: you have a bunch of records in a (searchable?) flat file that you need to load into the database based on certain criteria. I think the trick to doing this as quickly as possible is to determine where the C application is actually the slowest and spends the most time spinning its proverbial wheels:
If it's reading off the disk, you're stuck, you can't do anything about that, unless you get a faster disk.
If it's doing the SQL query-insert operation, you could try optimizing that, but your'e doing a compare between two databases (the flat-file and the MySQL one)
A quick thought: by doing a LOAD DATA INFILE bulk insert to populate a temporary table very quickly (perhaps even an in-memory table if MySQL allows that), and then doing the INSERT IF NOT EXISTS might be faster than what you're currently doing.
In short, do profiling, and figure out where the slowdown is. Aside from that, talk with an experienced DBA for tips on how to do this well.
I discussed with another colleague and here is some of the improvements we came up with:
For:
SELECT X FROM TABLE_A WHERE Y=Z;
Change to (currently waiting verification on whether X is and always unique):
SELECT X FROM TABLE_A WHERE X=Z LIMIT 1;
This was an easy change and we saw some slight improvements. I can't really quantify it well but I did:
SELECT X FROM TABLE_A ORDER BY RAND() LIMIT 1
and compared the first two query. For a few test there was about 0.1 seconds improvement. Perhaps it cached something but the LIMIT 1 should help somewhat.
Then another (yet to be implemented) improvement(?):
for record number X in entire record range:
if (no CACHE)
CACHE = retrieve Y records (sequentially) from the database
if (X exceeds the highest record number in cache)
CACHE = retrieve the next set of Y records (sequentially) from the database
search for record number X in CACHE
...etc
I'm not sure what to set Y to, are there any methods for determining what's a good sized number to try with? The table has 200k entries. I will edit in some results when I finish implementation.