I am using MySQL as database. I need to update some data. However the data may haven't changed so I may not need to update the row in such case.
I wanted to know which one will be better (performance wise):
a) Search the table to determine if the data has changed. For example I can search by primary key and then see if the value of remaining fields have changed or not. If yes, then continue with update statement and if not then leave it.
b) Use UPDATE query directly. If there are no changes in the data, MySQL will automatically ignore it and not process updating the data.
So which one will be perform better in such case.
From the MySQL manual:
If you set a column to the value it currently has, MySQL notices this
and does not update it.
So save yourself the latency and leave that task to MySQL. It will even tell you how many rows were actually affected.
First options seems better to me but in a specific scenerio.
Select all or some rows from table and get them in a result set
Traverse the result set, as in-memory traversal is really fast enough, and pick up the primary keys whose records you want to update and then execute update queries.
It seems comparetively efficient solution to me as compared to executing update query on each row regardless whether you need to do so or not.
If you use the select-then-update approach, you need to lock the row (e.g. select for update), otherwise you are in a race condition - the row can be changed after you selected and checked that it hasn't be changed.
As #AndreKR pointed out, MySQL won't perform any write operation, if the values are the same, so using update directly is faster than using 2 queries.
Related
I'm using DBAL in my project because it is easier to convert the database statements in an already written project that I'm converting to Symfony v2.8 and MySQL than going with full-on Doctrine, but now I need to implement "read-only row locks" to prevent data changes by other users while a pair of tightly coupled but separate SELECT statements are consecutively executed, and I'm thinking that I should use Transactions and SELECT FOR UPDATE statements. However, I don't see that DBAL supports SELECT FOR UPDATE statements in it's documentation. I do see that Transactions are supported, but as I understand it, these won't prevent other users from UPDATE-ing or DELETE-ing the data in the same data row that the SELECTs statements are using.
Specifically, the two SELECTs share data retrieved in one row by the first SELECT with a second SELECT that retrieve multiple rows from the same tables based on the first SELECT. The two SELECTs are somewhat complex, and I don't know if I could combine them into a super-sized single SELECT, nor do I really want to as that would make the new SELECT harder to maintain in the future.
The problem is that other users could be updating the same values retrieved by the first SELECT and if this done between the the two SELECTs, it would break the second SELECT of the pair and either prevent the second from returning data or at least return the wrong data.
I believe that I need to use a SELECT FOR UPDATE to lock the row that it retrieve to temporarily prevent other users from performing their updates and deletes on the single row retrieved by the first SELECT of the pair, but since I'm not actually performing an update, but rather two SELECTs, how do I release the lock on the one row locked by the first SELECT without performing a 'fake' update, say by UPDATE-ING a column value with the same value it already had?
Thanks
For the transaction you want repeatable results for:
START TRANSACTION READ ONLY
SELECT ...
{some processing}
SELECT {that covers the same rows} [will return the same result]
COMMIT
note: READ ONLY is optional
Experiment by running two mysql client connections and observer the results. The other connection can modify or insert rows covering the first selects criteria and the first transaction won't observe them.
I'm using MySQL 5.6. Let's say we have the following two tables:
Every DataSet has a huge amount of child DataEntry records that the number would be 10000 or 100000 or more. DataSet.md5sum and DataSet.version get updated when its child DataEntry records are inserted or deleted, in one transaction. A DataSet.md5sum is calculated against all of its children DataEntry.content s.
Under this situation, What's the most efficient way to fetch consistent data from those two tables?
If I issue the following two distinct SELECTs, I think I might get inconsistent data due to concurrent INSERT / UPDATEs:
SELECT md5sum, version FROM DataSet WHERE dataset_id = 1000
SELECT dataentry_id, content FROM DataEntry WHERE dataset_id = 1000 -- I think the result of this query will possibly incosistent with the md5sum which fetched by former query
I think I can get consistent data with one query as follows:
SELECT e.dataentry_id, e.content, s.md5sum, s.version
FROM DataSet s
INNER JOIN DataEntry e ON (s.dataset_id = e.dataset_id)
WHERE s.dataset_id = 1000
But it produces redundant dataset which filled with 10000 or 100000 duplicated md5sums, So I guess it's not efficient (EDIT: My concerns are high network bandwidth and memory consumption).
I think using pessimistic read / write lock (SELECT ... LOCK IN SHARE MODE / FOR UPDATE) would be another option but it seems overkill. Are there any other better approaches?
The join will ensure that the data returned is not affected by any updates that would have occurred between the two separate selects, since they are being executed as a single query.
When you say that md5sum and version are updated, do you mean the child table has a trigger on it for inserts and updates?
When you join the tables, you will get a "duplicate md5sum and version" because you are pulling the matching record for each item in the DataEntry table. It is perfectly fine and isn't going to be an efficiency issue. The alternative would be to use the two individual selects, but depending upon the frequency of inserts/updates, without a transaction, you run the very slight risk of getting data that may be slightly off.
I would just go with the join. You can run explain plans on your query from within mysql and look at how the query is executed and see any differences between the two approaches based upon your data and if you have any indexes, etc...
Perhaps it would be more beneficial to run these groups of records into a staging table of sorts. Before processing, you could call a pre-processor function that takes a "snapshot" of the data about to be processed, putting a copy into a staging table. Then you could select just the version and md5sum alone, and then all of the records, as two different selects. Since these are copied into a separate staging table, you wont have to worry about immediate updates corrupting your session of processing. You could set up timed jobs to do this or have it as an on-demand call. Again though, this would be something you would need to research the best approach given the hardware/network setup you are working with. And any job scheduling software you have available to you.
Use this pattern:
START TRANSACTION;
SELECT ... FOR UPDATE; -- this locks the row
...
UPDATE ...
COMMIT;
(and check for errors after every statement, including COMMIT.)
"100000" is not "huge", but "BIGINT" is. Recomment INT UNSIGNED instead.
For an MD5, make sure you are not using utf8: CHAR(32) CHARACTER SET ascii. This goes for any other hex strings.
Or, use BINARY(16) for half the space. Then use UNHEX(md5...) when inserting, and HEX(...) when fetching.
You are concerned about bandwidth, etc. Please describe your client (PHP? Java? ...). Please explain how much (100K rows?) needs to be fetched to re-do the MD5.
Note that there is a MD5 function in MySQL. If each of your items had an MD5, you could take the MD5 of the concatenation of those -- and do it entirely in the server; no bandwidth needed. (Be sure to increase group_concat_max_len)
I need to retrieve around 60,000+ MySQL Records from a partner's server and save it to my database. My Script needs to do this 3X a day (60K+ X 3)
Which one is better and faster
DELETE ALL Records from my DB -> Retrieve Records from Partner DB -> Insert Records to my DB
OR
Retrieve records from partner DB -> Update my DB records (if exist) / INSERT (if not exist)
NOTE : if UPDATE, I need to update all the fields of the record
Acording to my opinion, second approach will be more faster than the first one. Because if recoeds already exist then it will skip that record from insertion.
The two operation sequences you've proposed are NOT equivalent.
The second operation sequence does NOT delete the rows which were removed from the partner DB, while the first sequence does delete them.
MySQL provides REPLACE statement, which does effectively the same as your second sequence and will probably be the fastest one. Benchmark your code to be sure.
Definitely the 2nd one
Retrieve records from partner DB -> Update my DB records (if exist) / INSERT (if not exist)
Deleting is a costly operation, esp. when u have a case of 60k+ records and considering that schema remains same overtime only the values change.
Also, consider it from this point of view that, if updating, then not all values may not need to be updated, only some of them....so this is a comparatively cheaper than deleting and again writing the values which might even contain some of the same values which you just deleted! :)
Don't just consider it from deleting point of view, also consider that you have to Update the DB too....what would you prefer, always updating 60k+ values or less than that....
I need to update about 100 000 records in MySQL table (with indexes) so this process can take long time. i'm searching solution which will work faster.
I have three solutions but i have no time for speed tests.
Solutions:
usual UPDATE with each new record in array loop (bad perfomance)
using UPDATE syntax like here Update multiple rows with one query? - can't find any perfomance result
using LOAD DATA INFILE with the same value for key field, i guess in this case it will call UPDATE instead UNSERT - i guess should work faster when ever
Do you know which is solution is best.
The one important criteria is execution speed.
Thanks.
LOAD DATA INFILE the fastest way to upsert large amount of data from file;
second solution is not so bad as you might think. Especially if you can execute something like
update table
set
field = values
where
id in (list_of_ids)
but it would be better to post your update query.
How can I undo the most recently executed mysql query?
If you define table type as InnoDB, you can use transactions. You will need set AUTOCOMMIT=0, and after you can issue COMMIT or ROLLBACK at the end of query or session to submit or cancel a transaction.
ROLLBACK -- will undo the changes that you have made
You can only do so during a transaction.
BEGIN;
INSERT INTO xxx ...;
DELETE FROM ...;
Then you can either:
COMMIT; -- will confirm your changes
Or
ROLLBACK -- will undo your previous changes
Basically: If you're doing a transaction just do a rollback. Otherwise, you can't "undo" a MySQL query.
For some instrutions, like ALTER TABLE, this is not possible with MySQL, even with transactions (1 and 2).
You can stop a query which is being processed by this
Find the Id of the query process by => show processlist;
Then => kill id;
in case you do not only need to undo your last query (although your question actually only points on that, I know) and therefore if a transaction might not help you out, you need to implement a workaround for this:
copy the original data before commiting your query and write it back on demand based on the unique id that must be the same in both tables; your rollback-table (with the copies of the unchanged data) and your actual table (containing the data that should be "undone" than).
for databases having many tables, one single "rollback-table" containing structured dumps/copies of the original data would be better to use then one for each actual table. it would contain the name of the actual table, the unique id of the row, and in a third field the content in any desired format that represents the data structure and values clearly (e.g. XML). based on the first two fields this third one would be parsed and written back to the actual table. a fourth field with a timestamp would help cleaning up this rollback-table.
since there is no real undo in SQL-dialects despite "rollback" in a transaction (please correct me if I'm wrong - maybe there now is one), this is the only way, I guess, and you have to write the code for it on your own.