I am looking for someone more knowledgable in terms of MySQL query optimizer / cache.
I have to perform bulk insert operation of dataset changing in size (so each time number of rows will change). I wanted to know if, assuming these queries are parametrized, thas bulk insert query will be cached / reused? So basically I'm wondering if MySQL is smart enough to see that these inserts are all the same and cache a single insert query (whether its insert-per-row or multiple rows on for one row)? Then reuse the optimized, cached query?
Just to give you more idea of data amount we talk about, it around 500-600 batches of bulk inserts and each bulk insert has unknown number of rows to be inserted (as in, it may be anything between 1 and 100000). They are executed one after another.
So will that bulk insert be cached and reused? If not, would splitting the bulk insert into small batches (like a single INSERT query with X rows) change it?
I'm not looking for another way of doing this as for now. Simply, because it is not possible given how the application is written, so I am only limited to parametrized vs non-parametrized query and one bulk insert vs multiple batches. This is then called X number of times for each "object" I have, which like I said already, may even be 500 or 600 times (but this I can't change for now).
EDIT
I thought that this description may be vague, so here it is as a list:
Step into iteration of loop over objects
Identify data to be inserted for that object
Create a bulk insert query (non-parametrized currently, but I want to change that)
Execute bulk query
Go to another iteration
So what I really can change for now is points 3 and 4.
Related
What is the best way to execute a bulk update of all records in a large table in MySQL?
During the sanitization process, we are updating all rows in the users table, which has 28M rows, to mask a few columns. This is currently taking right around 2 hours to complete in a rake task, and the AWS session expiration is also 2 hours. If the rake task takes longer than the session expiration, the build will fail.
Due to a large number of records, we are updating 25K rows at a time by using find_in_batches, and then update_all in the results. We throttle between each batch by sleeping for 0.1s to avoid high CPU.
So the question is, is there any way we can optimize the bulk update further, or shall we increase the AWS session expiration to 3 hours?
One option could be to batch by id ranges, rather than by exact batch sizes. So update between id 1-100000, followed by 100001-200000, and so on. This avoids the large sets of ids being passed around. As there will be gaps in the ids, each batch would be a different size, but this might not be an issue.
Thanks for your input.
For such large updates overhead of fetching records and instantiating AR objects is very significant (and also there'll be slowdown from GC), fastest way to execute - is to write raw sql query that will do the update (or use update_all to construct it, which is very similar, but allows to use scopes/joins by relations).
I have a mysql table that keep gaining new records every 5 seconds.
The questions are
can I run query on this set of data that may takes more than 5 seconds?
if SELECT statement takes more than 5s, will it affect the scheduled INSERT statement?
what happen when INSERT statement invoked while SELECT is still running, will SELECT get the newly inserted records?
I'll go over your questions and some of the comments you added later.
can I run query on this set of data that may takes more than 5 seconds?
Can you? Yes. Should you? It depends. In a MySQL configuration I set up, any query taking longer than 3 seconds was considered slow and logged accordingly. In addition, you need to keep in mind the frequency of the queries you intend to run.
For example, if you try to run a 10 second query every 3 seconds, you can probably see how things won't end well. If you run a 10 second query every few hours or so, then it becomes more tolerable for the system.
That being said, slow queries can often benefit from optimizations, such as not scanning the entire table (i.e. search using primary keys), and using the explain keyword to get the database's query planner to tell you how it intends to work on that internally (e.g. is it using PKs, FKs, indices, or is it scanning all table rows?, etc).
if SELECT statement takes more than 5s, will it affect the scheduled INSERT statement?
"Affect" in what way? If you mean "prevent insert from actually inserting until the select has completed", that depends on the storage engine. For example, MyISAM and InnoDB are different, and that includes locking policies. For example, MyISAM tends to lock entire tables while InnoDB tends to lock specific rows. InnoDB is also ACID-compliant, which means it can provide certain integrity guarantees. You should read the docs on this for more details.
what happen when INSERT statement invoked while SELECT is still running, will SELECT get the newly inserted records?
Part of "what happens" is determined by how the specific storage engine behaves. Regardless of what happens, the database is designed to answer application queries in a way that's consistent.
As an example, if the select statement were to lock an entire table, then the insert statement would have to wait until the select has completed and the lock has been released, meaning that the app would see the results prior to the insert's update.
I understand that locking database can prevent messing up the SELECT statement.
It can also put a potentially unacceptable performance bottleneck, especially if, as you say, the system is inserting lots of rows every 5 seconds, and depending on the frequency with which you're running your queries, and how efficiently they've been built, etc.
what is the good practice to do when I need the data for calculations while those data will be updated within short period?
My recommendation is to simply accept the fact that the calculations are based on a snapshot of the data at the specific point in time the calculation was requested and to let the database do its job of ensuring the consistency and integrity of said data. When the app requests data, it should trust that the database has done its best to provide the most up-to-date piece of consistent information (i.e. not providing a row where some columns have been updated, but others yet haven't).
With new rows coming in at the frequency you mentioned, reasonable users will understand that the results they're seeing are based on data available at the time of request.
All of your questions are related to locking of table.
Your all questions depend on the way database is configured.
Read : http://www.mysqltutorial.org/mysql-table-locking/
Perform Select Statement While insert statement working
If you want to perform a select statement during insert SQL is performing, you should check by open new connection and close connection every time. i.e If I want to insert lots of records, and want to know that last record has inserted by selecting query. I must have to open connection and close connection in for loop or while loop.
# send a request to store data
insert statement working // take a long time
# select statement in while loop.
while true:
cnx.open()
select statement
cnx.close
//break while loop if you get the result
I'm currently building a system that does running computations, and every 5 seconds inserts or updates information based on those computations to a few rows in MySQL. I'm working on running this system on a few different servers at once right now with a few agents that are each doing similar processing and then writing on the same set of rows. I already randomize the order in which each agent writes its set of rows, but there's still a lot of deadlock happening. What's the best/fastest way to get through those deadlocks? Should I just rerun the query each time one happens, or do row locks, or something else entirely?
I suggest you try something that won't require more than one client to update your 'few rows.'
For example, you could have each agent that produces results do an INSERT to a staging table with the MEMORY access method.
Then, every five seconds you can run a MySQL event (a stored procedure within the server) that loops through all the rows in that table, posting their results to your 'few rows' and then deleting them. If it's important for the rows in your staging table to be processed in order, then you can use an AUTO_INCREMENT id field. But it might not be important for them to be in order.
If you want to get fancier and more scalable than that, you'll need a queue management system like Apache ActiveMQ.
I have a MySQL table with 237 million rows. I want to process all of these rows and update them with new values.
I do have sequential ID's, so I could just use a lot of select statements:
where id = '1'
where id = '2'
This is the method mentioned in Sequentially run through a MYSQL table with 1,000,000 records?.
But I'd like to know if there is a faster way using something like a cursor that would be used to sequentially read a big file without needing to load the full set into memory. The way I see it, a cursor would be much faster than running millions of select statements to get the data back in manageable chunks.
Ideally, you get the DBMS to do the work for you. You make the SQL statement so it runs solely in the database, not returning data to the application. All else apart, this saves the overhead of 237 million messages to the client and 237 million messages back to the server.
Whether this is feasible depends on the nature of the update:
Can the DBMS determine what the new values should be?
Can you get the necessary data into the database so that the DBMS can determine what the new values should be?
Will every single one of the 237 million rows be changed, or only a subset?
Can the DBMS be used to determine the subset?
Will any of the id values be changed at all?
If the id values will never be changed, then you can arrange to partition the data into manageable subsets, for any flexible definition of 'manageable'.
You may need to consider transaction boundaries; can it all be done in a single transaction without blowing out the logs? If you do operations in subsets rather than as a single atomic transaction, what will you do if your driving process crashes at 197 million rows processed? Or the DBMS crashes at that point? How will you know where to resume operations to complete the processing?
I have data containing about 30 000 records. And I need to insert this data into MySQL table. I group this data in packages by 1000 and create multiple inserts like this:
INSERT INTO `table_name` VALUES (data1), (data2), ..., (data1000);
How can I optimize performance of this inserting? Can I insert more than 1000 records per time? Each row contains data with size about 1KB. Thanks.
Try wrapping your bulk insert inside a transaction.
START TRANSACTION
INSERT INTO `table_name` VALUES (data1), (data2), ..., (data1000);
COMMIT
That might improve performance, I'm not sure if mySQL can partially commit a bulk insert though (if it can't then this likely won't really help much)
Remember that even at 1.5 seconds, for 30,000 records each at ~1k in size, you're doing 20MB/s commit speed you could actually be drive limited depending on your hardware setup.
Advice then would be to investigate a SSD or changing your Raid setup or get faster mechanical drives (there's plenty of online articles on the pros and cons of using a SQL db mounted on a SSD).
You need to check mysql server configurations and specifically check buffer size etc.
You can remove indexes from the table, if any, to make it faster. Create the indexes onces data is in.
Look here, you should get all you need.
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
One insert statement with multiple values, it says, is much faster than multiple insert statements.
Is this a once off operation?
If so, just generate a single sql statement per data element and execute them all on the server. 30,000 really shouldnt take very long and you will have the simplest means of inputting your data.