I'm in a situation where an entire column in a table (used for user tokens) needs to be wiped, i.e., all user tokens are reset simultaneously. There are two ways of going about it: reset each user's token individually with a separate UPDATE query; or make one big query that affects all rows.
The advantage of one big query is that it will obviously be much faster, but I'm worried about the implications of a large UPDATE query when the database is big. Will requests that occur during the query be affected?
Afraid it's not that simple. Even if you enable dirty reads, running one big update has a lot of drawbacks:
long running transaction that updates one column will effectively block other insert, update and delete transactions.
long running transaction causes enourmous load on disk because server is having to write to a log file everything that is taking place so that you can roll back that huge transaction.
if a transaction fails, you would have to rerun it entirely, it is not restartable.
So if simultaneous requirement can be interpreted "in one batch that may take a while to run", I would opt for batching it. A good research write up on performance of DELETEs in MySql is here: http://mysql.rjweb.org/doc.php/deletebig, and I think most of the findings are applicable to UPDATE.
The trick will be finding the optimal "batch size".
Added benefits of batching is that you can make this process resilient to failures and restart-friendly.
The answer depends on the transaction and isolation level you've established.
You can set isolation to allow "dirty reads", "phantom reads", or force serialization of reads and writes.
However you do that UPDATE, you'll want it to be a single unit of work.
I'd recommend minimizing network latency and updating all the user tokens in one network roundtrip. This means either writing a single query or batching many into one request.
Related
There a few large tables in one of the databases of a customer (each table is ~50M rows in size and is not too wide). The intent is to infrequently read these tables (completely). As there are no reasonable CDC indices present, the plan is to read the tables by querying them
SELECT * from large_table;
The reads will be performed using a jdbc driver. With the following fetch configuration present, the intent is to read the data approximately one record at a time (it may require a significant amount of time) so that the client code is never overwhelmed.
PreparedStatement stmt = connection.prepareStatement(queryString, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
I was going through the execution path of a query in High Performance MySQL, however some questions seemed unanswered:
Without the temp tables being explicitly created and the query cache being made use of, "how" are the stream reads tracked on the server?
Is any temporary data created (in main memory or files on disk) whatsoever? If so, where is it created and how much?
If temporary data is not created, how are the rows to be returned tracked? Does the query engine keep track of all the page files to be read for this query on this connection? In case there are several such queries running on the server, are the earliest "Tracked" files purged in favor of queries submitted recently?
PS: I want to understand the effect of this approach on the MySql server (not saying that there aren't better ways of reading the tables)
That simple query will not use a temp table. It will simply fetch the rows and transfer them to the client until it finishes. Nor would any possible index be useful. (If the real query is more complex, let's see it.)
The client may wait for all the rows (faster, but memory intensive) before it hands any to the user code, or it may hand them off one at a time (much slower).
I don't know the details in JDBC on specifying it.
You may want to page through the table. If so, don't use OFFSET, but use the PRIMARY KEY and "remember where you left off". More discussion: http://mysql.rjweb.org/doc.php/pagination
Your Question #3 leads to a complex answer...
Every query brings all the relevant data (and index entries) into RAM. The data/index is read in chunks ("blocks") of 16KB from the BTree structure that is persisted on disk. For a simple select like that, it will read the blocks 'sequentially' until finished.
But, be aware of "caching":
If a block is already in RAM, no I/O is needed.
If a block is not in the cache ("buffer_pool"), it will, if necessary, bump some block out and read the desired block in. This is very normal, and very common. Do not fear it.
Because of the simplicity of the query, only a few blocks ever need to be in RAM at any moment. Hence, if your buffer pool were only a few megabytes, it could still handle, say, a 1TB table. There would be a lot of I/O, and that would impact other operations.
As for "tracking", let me use the analogy of reading a long book in a single sitting. There is nothing to track, you are simply turning pages ('blocks'). You don't even need a 'bookmark' for tracking, it is next-next-next...
Another note: InnoDB uses "B+Tree", which includes a link from one block to the "next", thereby making the page turning efficient.
Another interpretation of tracking... "Transactions" and "ACID". When any query (read or write) touches a table, there is some form of lock applied to each row touched. For SELECT the lock is rather light-weight. For writes it can cause delays or even a "deadlock". The locks are unavoidable, but sometimes actions can be taken to minimize their impact.
Logically (but not actually), a "snapshot" of all rows in all tables is taken at the instant you start a transaction. This allows you to see a consistent view of everything, even if other connections are changing rows. The underlying mechanism is very lightweight on reading, but heavier for writes. Writes will make a copy of the row so that each connection sees the snapshot that it 'should' see. Also, the copy allows for ROLLBACK and recovery from a crash (eg power failure).
(Transaction "isolation" mode allows some control over the snapshot.) To get the optimal performance for your case, do nothing special.
Here's a way to conceptualize the handling of transactions: Each row has a timestamp associated with it. Each query saves the start time of the query. The query can "see" only rows that are older than that start time. A subsequent write in another connection will be creating copies of rows with a later timestamp, hence not visible to the SELECT. Hence, the onus is on writes to do extra work; reads are cheap.
I have a server which sends up to 20 UPDATE statements to a separate MySQL server every 3-5 seconds for a game. My question is, is it faster to concat them together(UPDATE;UPDATE;UPDATE). Is it faster to do them in a transaction then commit the transaction? Is it faster to just do each UPDATE individually?
Any insight would be appreciated!
It sort of depends on how the server connects. If the connection between the servers is persistent, you probably won't see a great deal of difference between concatenated statements or multiple separate statements.
However, if the execution involves establishing the connection, executing the SQL statement, then tearing down the connection, you will save a lot of resources on the database server by executing multiple statements at a time. The process of establishing the connection tends to be an expensive and time-consuming one, and has the added overhead of DNS resolution since the machines are separate.
It makes the most logical sense to me to establish the connection, begin a transaction, execute the statements individually, commit the transaction and disconnect from the database server. Whether you send all the UPDATE statements as a single concatenation or multiple individual statements is probably not going to make a big difference in this scenario, especially if this just involves regular communication between these two servers and you need not expect it to scale up with user load, for example.
The use of the transaction assumes that your 3-5 second periodic bursts of UPDATE statements are logically related somehow. If they are not interdependent, then you could skip the transaction saving some resources.
As with any question regarding performance, the best answer is if your current system is meeting your performance and scaling needs, you ought not pay too much attention to micro-optimizing it just yet.
It is always faster to wrap these UPDATEs into single transaction block.
Price for this is that if anything fails inside that block it would be that nothing happened at all - you will have to repeat your work again.
Aslo, keep in mind that transactions in MySQL only work when using InnoDB engine.
As the question says, is there ever a reason to wrap read-only sql statements in a transaction? Obviously updates require transactions.
You still need a read-lock on the objects you operate on. You want to have consistent reads, so writing the same records shouldn't be possible while you're reading them...
If you issue several SELECT statements in a single transaction, you will also produce several read-locks.
SQL Server has some good documentation on this (the "read-lock" is called shared lock, there):
http://msdn.microsoft.com/en-us/library/aa213039%28v=sql.80%29.aspx
I'm sure MySQL works in similar ways
Yes, if it's important that the data is consistent across the select statements run. For instance if you were getting the balance of several bank accounts for a user, you wouldn't want the balance values read to be inconsistent. Eg if this happened:
With balance values B1=10 and B2=20
Your code reads B1= 10.
Transaction TA1 starts on another DB client
TA1 writes B1 to 20, B2 to 10
TA1 commits
Your code reads B2 = 10
So you now think that B1 is 10 and B2 is 10, which could be displayed to the user and that says that $10 has disappeared!
Transactions for reading will prevent this, since we would read B2 as 20 in step 5 (assuming a multiversioning concurrency control DB, which mysql+innodb is).
MySQL 5.1, with the innodb engine has a default transaction isolation level which is REPEATABLE READS. So if you perform your SELECT inside a transaction no Dirty reads or Nonrepeatable reads can happen. That means even with transaction commiting between two of your queries you'll always get a consistent database. In theory in REPEATABLE READS you couls only fear phantom reads, but with innodb this cannot even occurs. So by simply opening a Transaction you can assume database consistency (coherence) and perform as much select as you want without fearing parallel-running-and-ending write transactions.
Do you have any interest in having such a big consistency constraint? Well it depends of what you're doing with your queries. having inconsistent reads means that if one of your query is based on a result from a previous one you may have problems:
if you're performing only one query you do not care, at all
if none of your queries assumes a result from a previous one, do not care
if you never re-read a record in the same session, same thing
if you always read dependencies of your main record in the same query and do not use lazy loading, no problem
if a small inconsistency between your first and last query will not break your code, then forget about it. But be careful, this can make a very hard to debug application bug (and hard to reproduce). So get a robust application code, something which could maybe handle databases errors and crash nicely (or not even crash) when this occurs (2 time in one year?).
if you show critical data (I mean bank accounts and not blogs or chats), then you should maybe care about it
if you have a lot of write operations, then you increase the risk of inconsistent reads, you may need to add transactions at least on some key points
you may need to test impact on performances, having all read requests in transactions, when several write transactions are really altering the data, is certainly slowing the engine, he needs to handle several versions of the data. So you shoul dcheck if the impact is not too big for your application
I have a large quantity of data in a production database that I want to update with batches of data while the data in the table is still available for end user use. The updates could be insertion of new rows or updates of existing rows. The specific table is approximately 50M rows, and the updates will be between 100k - 1M rows per "batch". What I would like to do is insert replace with a low priority.. In other words, I want the database to kind of slowly do the batch import without impacting performance of other queries that are occurring concurrently to the same disk spindles. To complicate this, the update data is heavily indexed. 8 b-tree indexes across multiple columns to facilitate various lookup that adds quite a bit of overhead to the import.
I've thought about batching the inserts down into 1-2k record blocks, then having the external script that loads the data just pause for a couple seconds between each insert, but that's really kind of hokey IMHO. Plus, during a 1M record batch, I really don't want to add 500-1000 2second pauses to add 20-40 minutes of extra load time if its not needed. Anyone have ideas on a better way to do this?
I've dealt with a similar scenario using InnoDB and hundreds of millions of rows. Batching with a throttling mechanism is the way to go if you want to minimize risk to end users. I'd experiment with different pause times and see what works for you. With small batches you have the benefit that you can adjust accordingly. You might find that you don't need any pause if you run this all sequentially. If your end users are using more connections then they'll naturally get more resources.
If you're using MyISAM there's a LOW_PRIORITY option for UPDATE. If you're using InnoDB with replication be sure to check that it's not getting too far behind because of the extra load. Apparently it runs in a single thread and that turned out to be the bottleneck for us. Consequently we programmed our throttling mechanism to just check how far behind replication was and pause as needed.
An INSERT DELAYED might be what you need. From the linked documentation:
Each time that delayed_insert_limit rows are written, the handler checks whether any SELECT statements are still pending. If so, it permits these to execute before continuing.
Check this link: http://dev.mysql.com/doc/refman/5.0/en/server-status-variables.html What I would do is write a script that will execute your batch updates when MySQL is showing Threads_running or Connections under a certain number. Hopefully you have some sort of test server where you can determine what a good number threshold might be for either of those server variables. There are plenty of other of server status variables to look at in there also. Maybe control the executions by the Innodb_data_pending_writes number? Let us know what works for you, its an interesting question!
Once in a while, I need to perform a massive update to a very large table. If users continue hitting the Web site while the update is being run, there will be a line-up of MySQL clients.
It appears that the longer the line-up, the slower the main operation gets (i.e. it updates fewer rows per unit time). Killing those processes can speed things up, but they're bound to come back.
Is there a way to address this (other than by bringing the site down)? I don't mind the users waiting a few minutes, but once the line-up has reached a certain size, the operation never completes.
This applies to UPDATE statements, as well as statements resulting in a temporary table being created (e.g. ALTER TABLE)
The waiting connections take up memory, and they're queueing up lock requests on your large and busy table. Eventually you're going to exhaust your maximum DB connections or one of your memory pools due to the number of connections held open. If I had to guess, I'd guess that your slowdown is due to memory exhaustion and the resultant swap-thrashing.
If the update you're doing doesn't require consistency between rows in the large table, you can try lowering the isolation level of the update transaction, using SET TRANSACTION ISOLATION LEVEL. This will greatly decrease the amount of locking and work that MySQL normally does to provided each client "repeatable reads" on a table being updated and read concurrently. You could also try partitioning your large table and running one update per partition, or otherwise breaking up the update operation into multiple pieces so that the table isn't locked for a long time at any one stretch.
If you do require consistency to be maintained between the rows, i.e. the whole table has to go from state X to X' in a single transaction with no intermediate states ever being visible, you're not going to be able to use the above techniques. You might try cloning the table, doing the update on the new table, then renaming the old table out of the way and renaming the new table into its place. Since it's a large table, this may require a significant increase in the runtime and storage needed for the operation. There are also caveats for doing this when triggers and constraints are present. The benefit is that you avoid holding a write lock on the table being updated, except during the relatively fast rename operations. Your users will only be delayed during that small swap window, and this will likely not take so long as to cause the major slowdowns you've experienced.