Let's say in mysql, I want to update a column in one of the table. I need to SELECT the record and change the value, after that, UPDATE it back to the database. In some case, I couldn't do these 2 operations in one sql query and nest them into subquery (due to mysql limitation), I have to load it into program (let's say Java), change the value, and then put back into database.
For example, program A get a column's value and want to increase it with one and then put it back. At the same time, program B want to do the same thing too. Before program A put back the increased value, program B already get the wrong value (program B is supposed to get the value that is increased by program A, but it run at the same time as program A, so it retrieved the same value as A).
Now my question is , what are the good ways to handle this kind of problem?
My another question is, I believe that mysql shouldn't be a single threaded system, but let's say if there are two same queries (they are updating the same table, same column and same record) come in at the same time, how mysql handle this kind of situation? Which one mysql will schedule first and which one later?
Moreover, could anyone explain a bit how mysql work in multithreading support? One connection one thread? So all the statement created under that connection will schedule in a same queue?
If you're using InnoDB, you can use transactions to provide fine-grained mutual exclusion.
If you're using MyISAM, you can use LOCK TABLE to prevent B from accessing the table until A finishes making its changes.
If two clients try to update the same field at the same time, it's unpredictable which one will win the race. The database has internal mutual exclusion to serialize the two queries, but the specific order is essentially random.
Related
This is mostly a theoretical question, and it's mostly about MySQL.
Can I write a single query that will give me a number of records inserted from the time the query started to run until it ended, assuming the table has not timestamps etc, so this info cannot be inferred from the data in the table.
I tired this (and maybe it'll clarify the above):
select -(count(*) - (sleep(300) + count(*)) from my_table;
But it doesn't seem to do the job.
I know I can write a stored procedure to do it, but I'm just curious if there's a way to do it in a single query, without writing a new function/stored procedure.
No, you really cannot, at least in theory. Databases support the ACID properties of transactions. The "I" in ACID stands for isolation, which specifically means that two queries do not interfere with each other. In other words, a query should not see inserts that happen after the query begins.
In practice, depending on settings, SELECT does not necessarily behave as its own transaction. However, it only sees the database as it is at any given instant, rather than knowing when particular changes occur.
There are proper ways to accomplish what you want. One simple method is to completely lock the table for the SELECT (in MySQL you can can do that with the for update directive). The query still will not be able to count the number of new rows. But, it will know the answer anyway: 0.
First sorry if the question was already answered, I searched both here and Google and couldn't find my answer. This question can't possibly haven't been asked, but it is hidden pretty deep under all the "Just use LEFT JOIN" and "store it in an array" answers.
I need to load a lot of data spread across multiple tables (then insert it into another database engine, but that's not important, I need to optimize my SELECTs).
The table layout looks like this:
Table A with a_id field
Table B with a_id and b_id field
Table C with b_id and c_id field
... (goes another 3-4 levels like this).
I currently access the data this way (pseudo code):
query1 = SELECT ... FROM TableA WHERE something=$something
foreach query1 as result1:
query2 = SELECT ... FROM TableB WHERE b_id=result1.a_id
foreach query2 as result2:
query3 = SELECT ... FROM TableC WHERE bc_id=result2.b_id
foreach query3 as result3:
// Another few levels of this, see the millions of SELECTs coming?
The only solutions I have found so far are:
Use the slow way and send multiple queries (current solution, and it takes ages to complete my small test set)
Use a ton of LEFT JOIN to have all the data in one query. Involves transmitting a ton of data thousands of times and so some fancy logic on client side to split this into their appropriate tables again since each row will contain the content of its parent tables. (I use OOP and each table maps to an object, and each object contains each-other).
Store each object from table A in an array, then load all Table B, store into an array, continue on Table C. Works for small sets, but mine is a few GB large, won't fit into ram at all.
Is there a way to avoid doing 10k queries per second in such a loop?
(I'm using PHP, converting from MySQL to MongoDB which handles nested objects like this way better, if this helps)
EDIT: There seem to have some confusions about what I'm trying to do and why. I will try to explain better: I need to do a batch conversion to a new structure. The new structure works very well, don't even bother looking on that. I'm remaking a very old website from scratch, and chose MongoDB as my storage engine because we have loads of nested data like this, and it works very well for me. Switching back to MySQL is not even an option for me, the new structure and code is alreay well established and I've been working on this for about a year now. I am not looking in a way to optimize the current schema, I can't. The data is that way, and I need to read the whole database. Once. Then I'm done with it.
All I need to do, is to import the data from the old website, process this and convert it so I can insert it into our new website. Here comes MySQL: The older site was a very normal PHP/MySQL site. We have a lot of tables (about 70 actually or something). We don't have many users, but each users have a ton of data spanned on 7 tables.
What I currently do, is that I loop on each user (1 query). For each of these users (70k), I load Table A which contains 10-80 rows for each user. I then query Table B on every loop of A (so, 10-80 times 70k), which contains 1-16 rows for each A. There comes Table C, which holds 1-4 rows for each B. We are now at 4*80*70k queries to do. Then I have D, 1-32 rows for each C. E with 1-16 rows for each D. F with 1-16 rows for each E. Table F has a couple of millions rows.
Problem is
I end up doing thousands if not millions of queries to the MySQL server, where the production database is not even on my local machine, but 5-10ms away. Even at 0.01ms, I have hours just in network latency. I created a local replica so my restricted test set runs quite faster, but it's still going to take a long while to download a few GB of data like this.
I could keep the members table in RAM and maybe Table A so I can download each database in one shot instead of doing thousands of queries, but once at Table B and further it would be a real mess to track this in memory, especially since I use PHP (command line, at least), which uses a bit more memory than if it was a C++ program where I could have tight RAM control. So this solution doesn't work either.
I could JOIN all the tables together, but if it works for 2-3 tables, doing this for 7 tables would result in an extra huge bandwidth loss transferring the same data from the server millions of times without a use (while also making the code really complicated to split them back in the appropriate order).
Question is: Is there a way to not query the database so often? Like, telling the MySQL server with a procedure or something that I will need all these datasets in this order so I don't have to re-do a query each row and so the database just continually spits out data for me? The current problem is just that I do so much queries that both the script AND the database are almost idle because one is always waiting for another one. The queries themselves are actually very fast, basic prepared SELECT queries on indexed int fields.
This is a problem I always got myself into with MySQL in the past, which never really caused me trouble until now. In its current state, the script takes several hours if not days to complete. It's not THAT bad, but if there's a way I can do better I'd appreciate to know. If not, then okay, I'll just wait for it to finish, at least it will run max 3-4 times (2-3 test runs, have users check their data is converted correctly, fix bugs, try again, and the final run with the last bugfixes).
Thanks in advance!
Assuming your 7 tables are linked by ids, do something like this
First query
'SELECT * FROM table_a WHERE a_id IN (12,233,4545,67676,898999)'
// store the result in $result_of_first_query
Then do a foreach and pick the ids you want to use in the next query in a comma separated variable (csv)
foreach($result_of_first_query as $a_row_from_first_table)
{
$csv_for_second_query = $csv_for_second_query.$a_row_from_first_table['b_id'].",";
}
$csv_for_second_query = trim($csv_for_second_query,", "); // problem is we will have a lot of duplicate entries
$temp_arr = array(); // so lets remove the duplicates
$temp_arr = explode(",",$csv_for_second_query); // explode values in array
$temp_arr = array_unique($temp_arr); // remove duplicates
$csv_for_second_query = implode(",",$temp_arr); // create csv string again. ready!
now for your second table, you will get, with only 1 query all the values you need to JOIN (not by mysql, we will do this with php)
Second query
'SELECT * FROM table_b where a_id IN ('.$csv_for_second_query.')'
// store the result in $result_of_second_query;
Then we just need to programmatically join the two arrays.
$result_a_and_b = array(); // we will store the joined result of every row here
// lets scan every row from first table
foreach($result_of_first_query as $inc=> $a_row_from_first_table)
{
// assign every row from frist table to result_a_and_b
$result_a_and_b[$inc]['a']=$a_row_from_first_table;
$inc_b=0; // counter for the joins that will happen by data from second table
// for every row from first table we will scan every row from second table
// so we need this nested foreach
foreach($result_of_second_query as $a_row_from_second_table)
{
// are data need to join? if yes then do so! :)
if($a_row_from_first_table['a_id']==$a_row_from_second_table['a_id'])
{
$result_a_and_b[$inc]['b'][$inc_b]=$a_row_from_second_table; // "join" in our "own" way :)
++$inc_b; // needed for the next join
}
}
}
now we have the array $result_a_and_b with this format:
$result_a_and_b[INDEX]['a']
$result_a_and_b[INDEX]['b'][INDEX]
so with 2 queries, we have a result similar to TABLE_A_ROWS_NUMBER + 1 (one is the initial query of first table)
Like this keep doing as many levels you want.
Query database with the id that links the table
get the id's in CSV string
do query in next able using WHERE id IN(11,22,33,44,55,.....)
join programmatically
Tip: You can use unset() to free up memory on temp variables.
I believe i answered in your question "Is there a way to not query the database so often?"
note: code not tested for typos, maybe i missed a comma or two -or maybe not
i believe you can get the point :) hope it helps!
Thanks everyone for the anwers. I came to the conclusion that I can't actually do it any other way.
My own solution is to set up a replica database (or just a copy if a snapshot is enough) on localhost. That way, it cuts down the network latency and allows both the script and the database to reach 100% CPU usage, and it seems to be the fastest I can get without reorganizing my script entirely.
Of course, this only works for one-time scripts. The correct way to handle this would be a mix of both answers I got as of now: use multiple unbuffered connections in threads, and process by batch (load 50 rows from Table A, store in ram, load all data related to Table A from Table B, store in RAM, then process all that and continue from Table A).
Thanks anyway for the answers all!
I have 5+ simultaneously processes selecting rows from the same mysql table. Each process SELECTS 100 rows, PROCESS IT and DELETES the selected rows.
But I'm getting the same row selected and processed 2 times or more.
How can I avoid it from happening on MYSQL side or Ruby on Rails side?
The app is built on Ruby On Rails...
Your table appears to be a workflow, which means you should have a field indicating the state of the row ("claimed", in your case). The other processes should be selecting for unclaimed rows, which will prevent the processes from stepping on each others' rows.
If you want to take it a step further, you can use process identifiers so that you know what is working on what, and maybe how long is too long to be working, and whether it's finished, etc.
And yeah, go back to your old questions and approve some answers. I saw at least one that you definitely missed.
Eric's answer is good, but I think I should elaborate a little...
You have some additional columns in your table say:
lockhost VARCHAR(60),
lockpid INT,
locktime INT, -- Or your favourite timestamp.
Default them all to NULL.
Then you have the worker processes "claim" the rows by doing:
UPDATE tbl SET lockhost='myhostname', lockpid=12345,
locktime=UNIX_TIMESTAMP() WHERE lockhost IS NULL ORDER BY id
LIMIT 100
Then you process the claimed rows with SELECT ... WHERE lockhost='myhostname' and lockpid=12345
After you finish processing a row, you make whatever updates are necessary, and set lockhost, lockpid and locktime back to NULL (or delete it).
This stops the same row being processed by more than one process at once. You need the hostname, because you might have several hosts doing processing.
If a process crashes while it is processing a batch, you can check if the "locktime" column is very old (much older than processing can possibly take, say several hours). Then you can just reclaim some rows which have an old "locktime" even though their lockhost is not null.
This is a pretty common "queue pattern" in databases; it is not extremely efficient. If you have a very high rate of items entering / leaving the queue, consider using a proper queue server instead.
http://api.rubyonrails.org/classes/ActiveRecord/Transactions/ClassMethods.html
should do it for you
I am working with an application which has a 3 tables each with more than 10mm records and larger than 2GB.
Every time data is inserted there's at least one record added to each of the three tables and possibly more.
After every INSERT a script is launched which queries all these tables in order to extract data relevent to the last INSERT (let's call this the aggregation script).
What is the best way to divide the DB in smaller units and across different servers so that the load for each server is manageable?
Notes:
1. There are in excess of 10 inserts per second and hence the aggregation script is run the same number of times.
2. The aggregation script is resource intensive
3. The aggregation script has to be run on all the data in order to find which one is relevant to the last insert
4. I have not found a way of somehow dividing the DB into smaller units
5. I know very little about distributed DBs, so please use very basic terminology and provide links for further reading if possible
There are two answers to this from a database point of view.
Find a way of breaking up the database into smaller units. This is very dependent on the use of your database. This is really your best bet because it's the only way to get the database to look at less stuff at once. This is called sharding:
http://en.wikipedia.org/wiki/Shard_(database_architecture)
Have multiple "slave" databases in read only mode. These are basically copies of your database (with a little lag). For any read only queries where that lag is acceptable, they access these databases across the code in your entire site. This will take some load off of the master database you are querying. But, it will still be resource intensive on any particular query.
From a programming perspective, you already have nearly all your information (aside from ids). You could try to find some way of using that information for all your needs rather than having to requery the database after insert. You could have some process that only creates ids that you query first. Imagine you have tables A, B, C. You would have other tables that only have primary keys that are A_ids, B_ids, C_ids. Step one, get new ids from the id tables. Step two, insert into A, B, C and do whatever else you want to do at the same time.
Also, general efficiency/performance of all queries should be reviewed. Make sure you have indexes on anything you are querying. Do explain on all queries you are running to make sure they are using indexes.
This is really a midlevel/senior dba type of thing to do. Ask around your company and have them lend you a hand and teach you.
I use a table with one row to keep the last used ID (I have my reasons to not use auto_increment), my app should work in a server farm so I wonder how I can update the last inserted ID (ie. increment it) and select the new ID in one step to avoid problems with thread safety (race condition between servers in the server farm).
You're going to use a server farm for the database? That doesn't sound "right".
You may want to consider using GUID's for Id's. They may be big but they don't have duplicates.
With a single "next id" value you will run into locking contention for that record. What I've done in the past is use a table of ranges of id's (RangeId, RangeFrom, RangeTo). The range table has a primary key of "RangeId" that is a simple number (eg. 1 to 100). The "get next id" routine picks a random number from 1 to 100, gets the first range record with an id lower than the random number. This spreads the locks out across N records. You can use 10's, 100's or 1000's of range records. When a range is fully consumed just delete the range record.
If you're really using multiple databases then you can manually ensure each database's set of range records do not overlap.
You need to make sure that your ID column is only ever accessed in a lock - then only one person can read the highest and set the new highest ID.
You can do this in C# using a lock statement around your code that accesses the table, or in your database you can put together a transaction on your read/write. I don't know the exact syntax for this on mysql.
Use a transactional database and control transactions manually. That way you can submit multiple queries without risking having something mixed up. Also, you may store the relevant query sets in stored procedures, so you can simply invoke these transactional queries.
If you have problems with performance, increment the ID by 100 and use a thread per "client" server. The thread should do the increment and hand each interested party a new ID. This way, the thread needs only access the DB once for 100 IDs.
If the thread crashes, you'll loose a couple of IDs but if that doesn't happen all the time, you shouldn't need to worry about it.
AFAIK the only way to get this out of a DB with nicely incrementing numbers is going to be transactional locks at the DB which is hideous performance wise. You can get a lockless behaviour using GUIDs but frankly you're going to run into transaction requirements in every CRUD operation you can think of anyway.
Assuming that your database is configured to run with a transaction isolation of READ_COMMITTED or better, then use one SQL statement that updates the row, setting it to the old value selected from the row plus an increment. With lower levels of transaction isolation you might need to use INSERT combined with SELECT FOR UPDATE.
As pointed out [by Aaron Digulla] it is better to allocate blocks of IDs, to reduce the number of queries and table locks.
The application must perform the ID acquisition in a separate transaction from any business logic, otherwise any transaction that needs an ID will end up waiting for every transaction that asks for an ID first to commit/rollback.
This article: http://www.ddj.com/architect/184415770 explains the HIGH-LOW strategy that allows your application to obtain IDs from multiple allocators. Multiple allocators improve concurrency, reliability and scalability.
There is also a long discussion here: http://www.theserverside.com/patterns/thread.tss?thread_id=4228 "HIGH/LOW Singleton+Session Bean Universal Object ID Generator"