Ensure auto_increment value ordering in MySQL - mysql

I have multiple threads writing events into a MySQL table events.
The table has an tracking_no column configured as auto_increment used to enforce an ordering of the events.
Different readers are consuming from events and they poll the table regularly to get the new events and keep the value of the last-consumed event to get all the new events at each poll.
It turns out that the current implementation leaves the chance of missing some events.
This is what's happening:
Thread-1 begins an "insert" transaction, it takes the next value from auto_increment column (1) but takes a while to complete
Thread-2 begins an "insert" transaction, it takes the next auto_incremente value (2) and completes the write before Thread-1.
Reader polls and asks for all events with tracking_number greater than 0; it gets event 2 because Thread-1 is still lagging behind.
The events gets consumed and Reader updates it's tracking status to 2.
Thread-1 completes the insert, event 1 appears in the table.
Reader polls again for all events after 2, and while event 1 was inserted it will never be picked up again.
It seems this could be solved by changing the auto_increment strategy to lock the entire table until a transaction completes, but if possible we would avoid it.

I can think of two possible approaches.
1) If your event inserts are guaranteed to succeed (ie, you never roll back an event insert, and therefore there are never any persistent gaps in your tracking_no), then you can rewrite your Readers so that they keep track of the last contiguous event seen -- aka the last event successfully processed.
The reader queries the event store, starts processing the events in order, and then stops if a gap is found. The remaining events are discarded. The next query uses the sequence number of the last successfully processed event.
Rollback makes a mess of this, though - scenarios with concurrent writes can leave persistent gaps in the stream, which would cause your readers to block.
2) You could rewrite your query with a maximum event represented in time. See MySQL create time and update time timestamp for the mechanics of setting up timestamp columns.
The idea then is that your readers query for all events with a higher sequence number than the last successfully processed event, but with a timestamp less than now() - some reasonable SLA interval.
It generally doesn't matter if the projections of an event stream are a little bit behind in time. So you leverage this, reading events in the past, which protects you from writes in the present that haven't completed yet.
That doesn't work for the domain model, though -- if you are loading an event stream to prepare for a write, working from a stream that is a measurable interval in the past isn't going to be much fun. The good news is that the writers know which version of the object they are currently working on, and therefore where in the sequence their generated events belong. So you track the version in the schema, and use that for conflict detection.
Note It's not entirely clear to me that the sequence numbers should be used for ordering. See https://stackoverflow.com/a/9985219/54734
Synthetic keys (IDs) are meaningless anyway. Their order is not significant, their only property of significance is uniqueness. You can't meaningfully measure how "far apart" two IDs are, nor can you meaningfully say if one is greater or less than another.
So this may be a case of having the wrong problem.

Related

How to create an event stream on MySQL

Given the following InnoDB table:
event_stream
+ id BIGINT PRIMARY AUTO-INCREMENT
+ event TEXT
Considering that there are several clients concurrently inserting events into this stream: What technique could we use so that this event stream could be processed in an incremental way by multiple listeners/consumers?
(edit) I.E. I would like to have multiple clients attached to this stream that can react to new events once and keep track of their position on the stream.
Considerations:
Not using MySQL to store the events is not an option;
locking the entire table is not acceptable;
I want to keep the control of wether an event was already seem/not seem up to the consumer, there might be multiple consumers for this table;
Creating new columns is acceptable;
This table will grow to hundreds of millions of events;
"Don't queue it, just do it." I have found that a database makes a poor queuing mechanism. If the 'worker' threads don't take long to perform the tasks, then have the queuers simply perform the tasks; this removes the overhead of the queue, thereby possibly making the system faster and scale better.
"Hundreds of millions of events" -- and nearly all of them have been "handled"? This suggests you have two tables -- one for handled events, one for events waiting to be handled. The latter would rarely have more than a few dozen rows?? In that case, the processing will work better.
Have two extra columns: which worker owns the process, and when the worker grabbed it. The time is so that you can take care of the case (yes, eventually it will happen) of a worker grabbing a task, then dying -- thereby leaving the task orphaned. As separate job can 'reap' these orphans.
A single-SQL UPDATE can grab one row in the table. Do this in a transaction by itself, not in any transaction(s) in the process. Similarly 'release' the task in its own transaction.
The grab is something like this (with autocommit=ON):
UPDATE ToDo SET who = $me, when = NOW()
WHERE who IS NULL
LIMIT 1; -- grab one
SELECT ... FROM ToDo WHERE who = $me; -- get details on the task
The 'release' probably involves both tables, something like this:
BEGIN;
$stuff = SELECT ... FROM ToDo WHERE who = $me;
DELETE FROM ToDo WHERE who = $me;
INSERT ... INTO History ... VALUES (most of stuff from $stuff);
COMMIT;
In between grabbing and releasing, you have as long as you need to perform the 'task'. You won't be tripped up by an InnoDB timeout, etc.
If you would like to give further details about your queue and tasks, I may have further refinements.
What I describe should handle any number inserters, any number of workers, and tasks lasting any length of time.
AUTO_INCREMENT is not reliable for walking through an event list. An INSERT is multiple steps:
Start transaction
get next auto_incr id
do the insert
COMMIT -- only now do others see the new id
It is possible (especially in replication) for the COMMITs to be "out of order" relative to the auto_incr.

How would you expire (or update) a MySQL record with precision to the expiry time?

Last year I was working on a project for university where one feature necessitated the expiry of records in the database with almost to-the-second precision (i.e. exactly x minutes/hours after creation). I say 'almost' because a few seconds probably wouldn't have meant the end of the world for me, although I can imagine that in something like an auction site, this probably would be important (I'm sure these types of sites use different measures, but just as an example).
I did research on MySQL events and did end up using them, although now that I think back on it I'm wondering if there is a better way to do what I did (which wasn't all that precise or efficient). There's three methods I can think of using events to achieve this - I want to know if these methods would be effective and efficient, or if there is some better way:
Schedule an event to run every second and update expired records. I
imagine that this would cause issues as the number of records
increases and takes longer than a second to execute, and might even
interfere with normal database operations. Correct me if I'm wrong.
Schedule an event that runs every half-hour or so (could be any
time interval, really), updating expired records. At the same time, impose
selection criteria when querying the database to only return records
whose expiration date has not yet passed, so that any records that
expired since the last event execution are not retrieved. While this
would be accurate at the time of retrieval, it defeats the purpose
of having the event in the first place, and I'd assume the extra
selection criteria would slow down the select query. In my project
last year, I used this method, and the event updating the records
was really only for backend logging purposes.
At insert, have a trigger that creates a dynamic event specific to
the record that will expire it precisely when it should expire.
After the expiry, delete the event. I feel like this would be a
great method of doing it, but I'm not too sure if having so many
events running at once would impact on the performance of the
database (imagine a database that has even 60 inserts an hour -
that's 60 events all running simultaneously for just one hour. Over
time, depending on how long the expiration is, this would add up).
I'm sure there's more ways that you could do this - maybe using a separate script that runs externally to the RDBMS is an option - but these are the ones I was thinking about. If anyone has any insight as to how you might expire a record with precision, please let me know.
Also, despite the fact that I actually did use it in the past, I don't really like method 2 because while this works for the expiration of records, it doesn't really help me if instead of expiring a record at a precise time, I wanted to make it active at a certain time (i.e. a scheduled post in a blog site). So for this reason, if you have a method that would work to update a record at a precise time, regardless of what that that update does (expire or post), I'd be happy to hear it.
Option 3:
At insert, have a trigger that creates a dynamic event specific to the record that will expire it precisely when it should expire. After the expiry, delete the event. I feel like this would be a great method of doing it, but I'm not too sure if having so many events running at once would impact on the performance of the database (imagine a database that has even 60 inserts an hour - that's 60 events all running simultaneously for just one hour. Over time, depending on how long the expiration is, this would add up).
If you know the expiry time on insert just put it in the table..
library_record - id, ..., create_at, expire_at
And query live records with the condition:
expire_at > NOW()
Same with publishing:
library_record - id, ..., create_at, publish_at, expire_at
Where:
publish_at <= NOW() AND expire_at > NOW()
You can set publish_at = create_at for immediate publication or just drop create_at if you don't need it.
Each of these, with the correct indexing, will have performance comparable to an is_live = 1 flag in the table and save you a lot of event related headache.
Also you will be able to see exactly why a record isn't live and when it expired/should be published easily. You can also query things such as records that expire soon and send reminders with ease.

synching mysql and memcached/membase

I have an application where I would like to roll up certain information into membase to avoid expensive group by queries. For example, a click conversion will be recorded in MySQL, and I want to keep a running total of clicks grouped by hours for a certain user in a memcache key.
There I can un/serialize an array with the values I need. I have many other needs like this with revenue, likes, etc.
What would be the best way to create some sort of "transaction" that assures MC and Mysql remain in sync? I could always rebuild the key store based on the underlying MySQL store, but I would like to maintain good concurrency between the two products.
At a high level, to use membase / memcache / etc as a cache for mysql, you'll want to do something like the following:
public Object readMethod(String key) {
value = membaseDriver->get(key);
if(value != null) {
return value;
}
value = getFromMysql(key);
membaseDriver->put(key, value, TTL);
}
public Object writeMethod(String key, String value) {
writeToMysql(key, value);
membaseDriver->delete(key);
//next call to get will get the value that we just wrote to db
}
This ensures that your DB remains the primary source of the data and ensures that membase and mysql stay nearly in sync. (it is not in sync while a process is executing the write method, after it has written to mysql and before it has deleted the key from membase).
If you want them to be really in sync, you have to ensure that while any process is executing the writeMethod, no process can execute the readMethod. You can do a simple global lock in memcache / membase by using the add method. Basically, you add a unique key named after your lock (eg: "MY_LOCK") if the add succeeds, you have the lock, after this happens nobody else can get the lock. When you are done with your write, you release the lock by calling delete with your lock's keyname. By starting both of those methods with that "lock", and ending both of those methods with the "unlock" you ensure that only one process at a time is executing either one. You could also build separate read and write locks on top of that, but I don't think locking is really want you want to do unless you need to be 100% up to date (as opposed to 99.999% up to date).
In the clicks per hour case, you could avoid having to re-run the query every time you count another click by keeping the current hour (ie: the only one that will change) separate from the array of all previous hours (which will probably never change, right?).
Every time you add a click, just use memcache increment on the current hour's counter. Then when you get a read request, look up the the array of all previous hours, then the current hour, and all previous hours with the current hour appended to the end. As a free bonus, the fact that increment is atomic provides you with actually synchronized values so you can skip locking.

How to atomic select rows in Mysql?

I have 5+ simultaneously processes selecting rows from the same mysql table. Each process SELECTS 100 rows, PROCESS IT and DELETES the selected rows.
But I'm getting the same row selected and processed 2 times or more.
How can I avoid it from happening on MYSQL side or Ruby on Rails side?
The app is built on Ruby On Rails...
Your table appears to be a workflow, which means you should have a field indicating the state of the row ("claimed", in your case). The other processes should be selecting for unclaimed rows, which will prevent the processes from stepping on each others' rows.
If you want to take it a step further, you can use process identifiers so that you know what is working on what, and maybe how long is too long to be working, and whether it's finished, etc.
And yeah, go back to your old questions and approve some answers. I saw at least one that you definitely missed.
Eric's answer is good, but I think I should elaborate a little...
You have some additional columns in your table say:
lockhost VARCHAR(60),
lockpid INT,
locktime INT, -- Or your favourite timestamp.
Default them all to NULL.
Then you have the worker processes "claim" the rows by doing:
UPDATE tbl SET lockhost='myhostname', lockpid=12345,
locktime=UNIX_TIMESTAMP() WHERE lockhost IS NULL ORDER BY id
LIMIT 100
Then you process the claimed rows with SELECT ... WHERE lockhost='myhostname' and lockpid=12345
After you finish processing a row, you make whatever updates are necessary, and set lockhost, lockpid and locktime back to NULL (or delete it).
This stops the same row being processed by more than one process at once. You need the hostname, because you might have several hosts doing processing.
If a process crashes while it is processing a batch, you can check if the "locktime" column is very old (much older than processing can possibly take, say several hours). Then you can just reclaim some rows which have an old "locktime" even though their lockhost is not null.
This is a pretty common "queue pattern" in databases; it is not extremely efficient. If you have a very high rate of items entering / leaving the queue, consider using a proper queue server instead.
http://api.rubyonrails.org/classes/ActiveRecord/Transactions/ClassMethods.html
should do it for you

Should I use a custom 'locks' table with MySQL?

I'm developing a relatively simple, custom web app with a MySQL MyISAM database on the back end. Somehow, I want to avoid the classic concurrency overwrite problem, e.g. that user A overwrites user B's edits because B loads and submits some edit form before A is finished.
That's why I would like to somehow lock a row on displaying the edit form. However...
As I said, I'm using MyISAM, which, as far as I can tell, doesn't support row-level locks. Also, I'm not sure if holding 'real' MySQL locks for a couple of minutes is recommended practice.
I don't really know much about transactions, but from what I've seen, it looks like they're meant to be used inside one connection.
Using some kind of conflict merge system like Git has is not an option really.
Rows would stay locked for a few minutes. Concurrency is very low: there's half a dozen users using the app at any time.
I'm now planning on using a table with details on which user is doing what, and since when. The app can then decide to not show the edit form when some other user recently opened it (e.g. is working on it). This fake lock would be deleted on saving the form.
Would this work? What should I do to avoid deadlocks, livelocks and all that stuff?
You could implement a lock, the easiest would probably be adding two fields to the data you want locked (lock_created Datetime, locked_by int). Then on the edit page (and probably also on the edit button) you check wether (lock_created + lock_interval) < now() - if not, the data is locked for editing and the user should be informed. (Note you always need the check on the edit-page, not just on the edit button.)
Also on the submission page, you need to check the user still has the lock to submit. (See below.)
The one difficult part of this is what to do when someone edits but fails to submit within the lock interval.
So:
The lock_interval is 2 minutes.
At time 0:00 Alice locks the page, edits something, but gets a phone call and doesn't submit her changes
At time 2:30 Bob checks the page, gets the edit lock because Alice's lock has expired, and edits
At time 3:00 Alice gets back to her comp, presses submit -> conflict.
Someone doesn't get their data submitted. There is no way around that if you set locks to expire. (And if you don't, locks can be left forever.)
You can only decide which one to give priority (going with the new lock created by Bob is probably easiest) and inform the other the page has expired and the data won't be sumbitted, and hand them back their edits to redo them.
A note on the table structure: you could create a table 'locks' with fields 'table_name, row_id, lock_created, locked_by' but it probably won't be the easiest way, since joining on variable table names is complex and confusing. Also, there is probably no use to have a single place for all locks to be stored. For a simple mechanism, I think adding uniform fields to every table you want to implement the locking mechanism is easier all around.
You should absolutely not use row-level locks for this scenario.
You can use optimistic locking, which basically means that you have a version field for each row, which is incremented when the row is saved. Before save you make sure that the version field is the same as what it was when the row was loaded, which means that noone else has saved anything since you read the row.