Producer/consumer pattern via mysql - mysql

I have 2 processes that act as a producer/consumer via a table.
One process does only INSERT into the table while the other process does a SELECT for new records and an UPDATE of these records when it finishes to mark them as finished.
This keeps happening constantly.
As far as I can see there is no need for any locking or transactions for this simple interaction. Am I right on this?
Am I overlooking something?

I would say the prime consideration to take into account is a scenario where multiple workers retrieve the same row.
The UPDATE and SELECT operations themselves should be fine, but if you have multiple workers consuming via SELECT on the same table, then you might get two workers simultaneously processing the same row.
If each worker is required to process separate rows, locking on SELECT may be required with careful consideration of deadlock if there's a significant unit of work associated with your process.

Related

Running a cron to update 1 million records in every hour fails

We have an E-commerce system with more than 1 million users with a total or 4 to 5 million records in order table. We use codeigniter framework as back end and Mysql as database.
Due to this excessive number of users and purchases, we use cron jobs to update the order details and referral bonus points in every hour to make the things work.
Now we have a situation that these data updates exceeds one hour and the next batch of updates reach before finishing the previous one, there by leading into a deadlock and failure of the system.
I'd like to know about the different possible architectural and database scaling options and suggestions to get rid of this situation. We are using only the monolithic architecture to run this application.
Don't use cron. Have a single process that starts over when it finishes. If one pass lasts more than an hour, the next one will start late. (Checking PROCESSLIST is clumsy and error-prone. OTOH, this continually-running approach needs a "keep-alive" cronjob.)
Don't UPDATE millions of rows. Instead, find a way to put the desired info in a separate table that the user joins to. Presumably, that extra table would only 1 row (if everyone is controlled by the same game) or a small number of rows (if there are only a small number of patterns to handle).
Do have the slowlog turned on, with a small value for long_query_time (possibly "1.0", maybe lower). Use pt-query-digest to summarize it to find the "worst" queries. Then we can help you make them take less time, thereby helping to calm your busy system and improve the 'user experience'.
Do use batched INSERT. (A one INSERT with 100 rows runs about 10 times as fast as 100 single-row INSERTs.) Batching UPDATEs is tricky, but can be done with IODKU.
Do use batches of 100-1000 rows. (This is somewhat optimal considering the various things that can happen.)
Do use transactions judiciously. Do check for errors (including deadlocks) at every step.
Do tell us what you are doing in the hourly update. We might be able to provide more targeted advice than that 15-year-old book.
Do realize that you have scaled beyond the capabilities of the typical 3rd-party package. That is, you will have to learn the details of SQL.
I have some ideas here for you - mixed up with some questions.
Assuming you are limited in what you can do (i.e. you can't re-architect you way out of this) and that the database can't be tuned further:
Make the list of records to be processed as small as possible
i.e. Does the job have to run over all records? These 4-5 million records - are they all active orders, or that's how many you have in total for all time? Obviously just process the bare minimum.
Split and parallel process
You mentioned "batches" but never explained what that meant - can you elaborate?
Can you get multiple instances of the cron job to run at once, each covering a different segment of the records?
Multi-Record Operations
The easy (lazy) way to program updates is to do it in a loop that iterates through each record and processes it individually, but relational databases can do updates over multiple records at once. I'm pretty sure there's a proper term for that but I can't recall it. Are you processing each row individually or doing multi-record updates?
How does the cron job query the database? Have you hand-crafted the most efficient queries possible, or are you using some ORM / framework to do stuff for you?

What is the best way (in Rails/AR) to ensure writes to a database table are performed synchronously, one after another, one at a time?

I have noticed that using something like delayed_job without a UNIQUE constraint on a table column would still create double entries in the DB. I have assumed delayed_job would run jobs one after another. The Rails app runs on Apache with Passenger Phusion. I am not sure if that is the reason why this would happen, but I would like to make sure that every item in the queue is persisted to AR/DB one after another, in sequence, and to never have more than one write to this DB table happen at the same time. Is this possible? What would be some of the issues that I would have to deal with?
update
The race conditions arise because an AJAX API is used to send data to the application. The application received a bunch of data, each batch of data is identified as belonging together by a Session ID (SID), in the end, the final state of the database has to include the latest most up-to date AJAX PUT query to the API. Sometimes queries arrive at the exact same time for the same SID -- so I need a way to make sure they don't all try to be persisted at the same time, but one after the other, or simply the last to be sent by AJAX request to the API.
I hope that makes my particular use-case easier to understand...
You can lock a specific table (or tables) with the LOCK TABLES statement.
In general I would say that relying on this is poor design and will likely lead to with scalability problems down the road since you're creating an bottleneck in your application flow.
With your further explanations, I'd be tempted to add some extra columns to the table used by delayed_job, with a unique index on them. If (for example) you only ever wanted 1 job per user you'd add a user_id column and then do
something.delay(:user_id => user_id).some_method
You might need more attributes if the pattern is more sophisticated, e.g. there are lots of different types of jobs and you only wanted one per person, per type, but the principle is the same. You'd also want to be sure to rescue ActiveRecord::RecordNotUnique and deal with it gracefully.
For non delayed_job stuff, optimistic locking is often a good compromise between handling the concurrent cases well without slowing down the non concurrent cases.
If you are worried/troubled about/with multiple processes writing to the 'same' rows - as in more users updating the same order_header row - I'd suggest you set some marker bound to the current_user.id on the row once /order_headers/:id/edit was called, and removing it again, once the current_user releases the row either by updating or canceling the edit.
Your use-case (from your description) seems a bit different to me, so I'd suggest you leave it to the DB (in case of a fairly recent - as in post 5.1 - MySQL, you'd add a trigger/function which would do the actual update, and here - you could implement similar logic to the above suggested; some marker bound to the sequenced job id of sorts)

How to atomic select rows in Mysql?

I have 5+ simultaneously processes selecting rows from the same mysql table. Each process SELECTS 100 rows, PROCESS IT and DELETES the selected rows.
But I'm getting the same row selected and processed 2 times or more.
How can I avoid it from happening on MYSQL side or Ruby on Rails side?
The app is built on Ruby On Rails...
Your table appears to be a workflow, which means you should have a field indicating the state of the row ("claimed", in your case). The other processes should be selecting for unclaimed rows, which will prevent the processes from stepping on each others' rows.
If you want to take it a step further, you can use process identifiers so that you know what is working on what, and maybe how long is too long to be working, and whether it's finished, etc.
And yeah, go back to your old questions and approve some answers. I saw at least one that you definitely missed.
Eric's answer is good, but I think I should elaborate a little...
You have some additional columns in your table say:
lockhost VARCHAR(60),
lockpid INT,
locktime INT, -- Or your favourite timestamp.
Default them all to NULL.
Then you have the worker processes "claim" the rows by doing:
UPDATE tbl SET lockhost='myhostname', lockpid=12345,
locktime=UNIX_TIMESTAMP() WHERE lockhost IS NULL ORDER BY id
LIMIT 100
Then you process the claimed rows with SELECT ... WHERE lockhost='myhostname' and lockpid=12345
After you finish processing a row, you make whatever updates are necessary, and set lockhost, lockpid and locktime back to NULL (or delete it).
This stops the same row being processed by more than one process at once. You need the hostname, because you might have several hosts doing processing.
If a process crashes while it is processing a batch, you can check if the "locktime" column is very old (much older than processing can possibly take, say several hours). Then you can just reclaim some rows which have an old "locktime" even though their lockhost is not null.
This is a pretty common "queue pattern" in databases; it is not extremely efficient. If you have a very high rate of items entering / leaving the queue, consider using a proper queue server instead.
http://api.rubyonrails.org/classes/ActiveRecord/Transactions/ClassMethods.html
should do it for you

How to ensure MySQL selects do not interfere?

I have a table in my MySQL DB which basically contains "cron"-like tasks. Basically a user visits a page and the script (php) checks the DB cron table, gets the latest 5 results that are "available" and executes the scripts related to the tasks.
Only issues I foresee at the moment is that 2 users might get the same tasks. Note that currently I first run an UPDATE query which assigns 5 tasks to the current user. After that I do a SELECT query to get 5 tasks assigned to the current user and when hes done I mark the tasks as completed.
Theoretically no 2 users should ever get the same tasks but I'm uncertain. I'm simple wondering if MySQL possibly has a build in mechanism to ensure this or if there are known methods for it?
Thanks.
You want to use Transactions. This way you can ensure that a multi-step operation, such as [UPDATE, SELECT, UPDATE] is either wholly completed, or does not happen at all.
This is a classic concurrency problem, it's worth reading up about concurrency and transactions in general so that you understand the principals. This will help you avoid problems down the line (there are lots of knotty problems in concurrency!).

Using a table to keep the last used ID in a web server farm

I use a table with one row to keep the last used ID (I have my reasons to not use auto_increment), my app should work in a server farm so I wonder how I can update the last inserted ID (ie. increment it) and select the new ID in one step to avoid problems with thread safety (race condition between servers in the server farm).
You're going to use a server farm for the database? That doesn't sound "right".
You may want to consider using GUID's for Id's. They may be big but they don't have duplicates.
With a single "next id" value you will run into locking contention for that record. What I've done in the past is use a table of ranges of id's (RangeId, RangeFrom, RangeTo). The range table has a primary key of "RangeId" that is a simple number (eg. 1 to 100). The "get next id" routine picks a random number from 1 to 100, gets the first range record with an id lower than the random number. This spreads the locks out across N records. You can use 10's, 100's or 1000's of range records. When a range is fully consumed just delete the range record.
If you're really using multiple databases then you can manually ensure each database's set of range records do not overlap.
You need to make sure that your ID column is only ever accessed in a lock - then only one person can read the highest and set the new highest ID.
You can do this in C# using a lock statement around your code that accesses the table, or in your database you can put together a transaction on your read/write. I don't know the exact syntax for this on mysql.
Use a transactional database and control transactions manually. That way you can submit multiple queries without risking having something mixed up. Also, you may store the relevant query sets in stored procedures, so you can simply invoke these transactional queries.
If you have problems with performance, increment the ID by 100 and use a thread per "client" server. The thread should do the increment and hand each interested party a new ID. This way, the thread needs only access the DB once for 100 IDs.
If the thread crashes, you'll loose a couple of IDs but if that doesn't happen all the time, you shouldn't need to worry about it.
AFAIK the only way to get this out of a DB with nicely incrementing numbers is going to be transactional locks at the DB which is hideous performance wise. You can get a lockless behaviour using GUIDs but frankly you're going to run into transaction requirements in every CRUD operation you can think of anyway.
Assuming that your database is configured to run with a transaction isolation of READ_COMMITTED or better, then use one SQL statement that updates the row, setting it to the old value selected from the row plus an increment. With lower levels of transaction isolation you might need to use INSERT combined with SELECT FOR UPDATE.
As pointed out [by Aaron Digulla] it is better to allocate blocks of IDs, to reduce the number of queries and table locks.
The application must perform the ID acquisition in a separate transaction from any business logic, otherwise any transaction that needs an ID will end up waiting for every transaction that asks for an ID first to commit/rollback.
This article: http://www.ddj.com/architect/184415770 explains the HIGH-LOW strategy that allows your application to obtain IDs from multiple allocators. Multiple allocators improve concurrency, reliability and scalability.
There is also a long discussion here: http://www.theserverside.com/patterns/thread.tss?thread_id=4228 "HIGH/LOW Singleton+Session Bean Universal Object ID Generator"