How to create an event stream on MySQL - mysql

Given the following InnoDB table:
event_stream
+ id BIGINT PRIMARY AUTO-INCREMENT
+ event TEXT
Considering that there are several clients concurrently inserting events into this stream: What technique could we use so that this event stream could be processed in an incremental way by multiple listeners/consumers?
(edit) I.E. I would like to have multiple clients attached to this stream that can react to new events once and keep track of their position on the stream.
Considerations:
Not using MySQL to store the events is not an option;
locking the entire table is not acceptable;
I want to keep the control of wether an event was already seem/not seem up to the consumer, there might be multiple consumers for this table;
Creating new columns is acceptable;
This table will grow to hundreds of millions of events;

"Don't queue it, just do it." I have found that a database makes a poor queuing mechanism. If the 'worker' threads don't take long to perform the tasks, then have the queuers simply perform the tasks; this removes the overhead of the queue, thereby possibly making the system faster and scale better.
"Hundreds of millions of events" -- and nearly all of them have been "handled"? This suggests you have two tables -- one for handled events, one for events waiting to be handled. The latter would rarely have more than a few dozen rows?? In that case, the processing will work better.
Have two extra columns: which worker owns the process, and when the worker grabbed it. The time is so that you can take care of the case (yes, eventually it will happen) of a worker grabbing a task, then dying -- thereby leaving the task orphaned. As separate job can 'reap' these orphans.
A single-SQL UPDATE can grab one row in the table. Do this in a transaction by itself, not in any transaction(s) in the process. Similarly 'release' the task in its own transaction.
The grab is something like this (with autocommit=ON):
UPDATE ToDo SET who = $me, when = NOW()
WHERE who IS NULL
LIMIT 1; -- grab one
SELECT ... FROM ToDo WHERE who = $me; -- get details on the task
The 'release' probably involves both tables, something like this:
BEGIN;
$stuff = SELECT ... FROM ToDo WHERE who = $me;
DELETE FROM ToDo WHERE who = $me;
INSERT ... INTO History ... VALUES (most of stuff from $stuff);
COMMIT;
In between grabbing and releasing, you have as long as you need to perform the 'task'. You won't be tripped up by an InnoDB timeout, etc.
If you would like to give further details about your queue and tasks, I may have further refinements.
What I describe should handle any number inserters, any number of workers, and tasks lasting any length of time.
AUTO_INCREMENT is not reliable for walking through an event list. An INSERT is multiple steps:
Start transaction
get next auto_incr id
do the insert
COMMIT -- only now do others see the new id
It is possible (especially in replication) for the COMMITs to be "out of order" relative to the auto_incr.

Related

Using table locking to prevent multiple users from updating at a given time

I am building a simple shopping cart. Currently, to ensure that a customer can never purchase a product that is out of stock, when processing the order I have a loop for each product in their cart:
-- Begin a transaction --
Loop through each product in the cart and
Select the stock count from the products table
If it is in stock:
I will reduce the stock count from the product
Add the product to the order items table
Otherwise, I call a rollback and return an error
-- (If there isn't a call for rollback, everything ends off with a commit --
However, if at any given time, the stock count for a product is updated AFTER it has checked for that particular product, there may be inconsistencies.
Question: would it be a good idea to lock the table from writes whenever I am processing an order? So that when the 'loop' above occurs, I can be assured that no one else is able to alter the product count and it will always be accurate.
The idea is that the product count/availability will always be consistent, and there will never be an instance where the stock count goes to -1 (which would be unfulfillable).
However, I have seen so many posts on locks being inefficient/having bad effects. If so, what is the best way to accomplish this?
I have seen alternatives like handling it in an update + select query, but have seen that it may also not be suitable in some cases.
You have at least three strategies:
1. Pessimistic Locking
If your application will experience low activity then you can lock the tables (or single rows) to make sure no other thread changes the values during the processing of a purchase. It works, but it has performance limitations.
2. Optimistic Locking
If your application/web site must serve a high load then you can opt for the "optimistic locking" strategy. In this case you add a version number column to your critical tables and then you use it when reading/writing it.
When updating stock you check the version number you are updating must be the same that you read. If it's not the case anymore (another thread modified it) you roll back the transaction and can retry again a couple of times until you succeed.
It requires more development effor since you need to identify the bad case and implement retry logic (if you want to).
3. Processing Queues
You can implement processing queues. When a thread wants to "purchase an order" it can submit it to a processing queue for purchase orders. This queue can be implemented by one or more threads dedicated to this task; if you choose multiple threads they can be divided by order types, regions, categories, etc. to distribute the load.
This requires more programming effort since you need to manage asynchronous processing, but can sustain much higher levels of load.
You can use this strategy for multiple different tasks: purchasing orders, refilling stock, sending notifications, processing promotions, etc.

Concurrent writes to MySQL and testing solutions

I was practicing some "system design" coding questions and I was interested in how to solve a concurrency problem in MySQL. The problem was "design an inventory checkout system".
Let's say you are trying to check out a specific item from an inventory, a library book for instance.
If two people are on the website, looking to book it, is it possible that they both check it out? Let's assume the query is updating the status of the row to mark a boolean checked_out to True.
Would transactions solve this issue? It would cause the second query that runs to fail (assuming they are the same query).
Alternatively, we insert rows into a checkouts table. Since both queries read that the item is not checked out currently, they could both insert into the table. I don't think a transaction would solve this, unless the transaction includes reading the table to see if a checkout currently exists for this item that hasn't yet ended.
One of the suggested methods
How would I simulate two writes at the exact same time to test this?
No, transactions alone do not address concurrency issues. Let's quickly revisit mysql's definition of transactions:
Transactions are atomic units of work that can be committed or rolled back. When a transaction makes multiple changes to the database, either all the changes succeed when the transaction is committed, or all the changes are undone when the transaction is rolled back.
To sum it up: transactions are a way to ensure data integrity.
RDBMSs use various types of locking, isolation levels, and storage engine level solutions to address concurrency. People often mistake transactions as a mean to control concurrency because transactions affect how long certain locks are held.
Focusing on InnoDB: when you issue an update statement, mysql places an exclusive lock on the record being updated. Only the transaction holding the exclusive lock can modify the given record, the others have to wait until the transaction is committed.
How does this help you preventing multiple users checking out the same book? Let's say you have an id field uniquely identifying the books and a checked_out field indicating the status of the book.
You can use the following atomic update to check out a book:
update books set checked_out=1 where id=xxx and checked_out=0
The checked_out=0 criteria makes sure that the update only succeeds if the book is not checked out yet. So, if the above statement affects a row, then the current user checks out the book. If it does not affect any rows, then someone else has already checked out the book. The exclusive lock makes sure that only one transaction can update the record at any given time, thus serializing the access to that record.
If you want to use a separate checkouts table for reserving books, then you can use a unique index on book ids to prevent the same book being checked out more than once.
Transactions don't cause updates to fail. They cause sequences of queries to be serialized. Only one accessor can run the sequence of queries; others wait.
Everything in SQL is a transaction, single-statement update operations included. The kind of transaction denoted by BEGIN TRANSACTION; ... COMMIT; bundles a series of queries together.
I don't think a transaction would solve this, unless the transaction
includes reading the table to see if a checkout currently exists for
this item.
That's generally correct. Checkout schemes must always read availability from the database. The purpose of the transaction is to avoid race conditions when multiple users attempt to check out the same item.
SQL doesn't have thread-safe atomic test-and-set instructions like multithreaded processor cores have. So you need to use transactions for this kind of thing.
The simplest form of checkout uses a transaction, something like this.
BEGIN TRANSACTION;
SELECT is_item_available, id FROM item WHERE catalog_number = whatever FOR UPDATE;
/* if the item is not available, tell the user and commit the transaction without update*/
UPDATE item SET is_item_available = 0 WHERE id = itemIdPreviouslySelected;
/* tell the user the checkout succeeded. */
COMMIT;
It's clearly possible for two or more users to attempt to check out the same item more-or-less simultaneously. But only one of them actually gets the item.
A more complex checkout scheme, not detailed here, uses a two-step system. First step: a transaction to reserve the item for a user, rejecting the reservation if someone else has it checked out or reserved. Second step: reservation holder has a fixed amount of time to accept the reservation and check out the item, or the reservation expires and some other user may reserve the item.

Ensure auto_increment value ordering in MySQL

I have multiple threads writing events into a MySQL table events.
The table has an tracking_no column configured as auto_increment used to enforce an ordering of the events.
Different readers are consuming from events and they poll the table regularly to get the new events and keep the value of the last-consumed event to get all the new events at each poll.
It turns out that the current implementation leaves the chance of missing some events.
This is what's happening:
Thread-1 begins an "insert" transaction, it takes the next value from auto_increment column (1) but takes a while to complete
Thread-2 begins an "insert" transaction, it takes the next auto_incremente value (2) and completes the write before Thread-1.
Reader polls and asks for all events with tracking_number greater than 0; it gets event 2 because Thread-1 is still lagging behind.
The events gets consumed and Reader updates it's tracking status to 2.
Thread-1 completes the insert, event 1 appears in the table.
Reader polls again for all events after 2, and while event 1 was inserted it will never be picked up again.
It seems this could be solved by changing the auto_increment strategy to lock the entire table until a transaction completes, but if possible we would avoid it.
I can think of two possible approaches.
1) If your event inserts are guaranteed to succeed (ie, you never roll back an event insert, and therefore there are never any persistent gaps in your tracking_no), then you can rewrite your Readers so that they keep track of the last contiguous event seen -- aka the last event successfully processed.
The reader queries the event store, starts processing the events in order, and then stops if a gap is found. The remaining events are discarded. The next query uses the sequence number of the last successfully processed event.
Rollback makes a mess of this, though - scenarios with concurrent writes can leave persistent gaps in the stream, which would cause your readers to block.
2) You could rewrite your query with a maximum event represented in time. See MySQL create time and update time timestamp for the mechanics of setting up timestamp columns.
The idea then is that your readers query for all events with a higher sequence number than the last successfully processed event, but with a timestamp less than now() - some reasonable SLA interval.
It generally doesn't matter if the projections of an event stream are a little bit behind in time. So you leverage this, reading events in the past, which protects you from writes in the present that haven't completed yet.
That doesn't work for the domain model, though -- if you are loading an event stream to prepare for a write, working from a stream that is a measurable interval in the past isn't going to be much fun. The good news is that the writers know which version of the object they are currently working on, and therefore where in the sequence their generated events belong. So you track the version in the schema, and use that for conflict detection.
Note It's not entirely clear to me that the sequence numbers should be used for ordering. See https://stackoverflow.com/a/9985219/54734
Synthetic keys (IDs) are meaningless anyway. Their order is not significant, their only property of significance is uniqueness. You can't meaningfully measure how "far apart" two IDs are, nor can you meaningfully say if one is greater or less than another.
So this may be a case of having the wrong problem.

exclusive read locks in mysql

I have a table which maintains and assigns portion of input to work on (from a big input table), for multiple instances of a process. The table is organised as follows:
BlockInfo Table
---------------
BlockID int primary key
Status varchar
Every process queries for the block of input it should take, and processes that block.
I am expecting the query to be the following:
select BlockID
from BlockInfo
order by BlockID
where Status='available'
limit 1
For this effect, I would require that the server maintain exclusive read locks, since if the read lock is to be maintained as shared, then multiple instances may get the same block, which causes duplication of efforts and is undesirable.
I could get an exclusive write lock, but not actually write anything. But I want to know if mysql permits an exclusive read lock.
It would also help to hear about alternate ways of implementing this.
What you should do is:
Get an exclusive write lock
Select the row you want to process
Change its status to "processing" (or something other than "available")
Unlock the table
Do all your processing of the row
Update the row to change its status back to "available"
This will then allow other processes to work on other rows concurrently with this. It keeps the table locked for just enough time to keep them from trying to work on the same row.
If you want to achieve this in the database level, table level lock is the way to go, as mentioned in the other answer. But it will be a bad design, if performance is of concern to your application. This will result in frequent table locking and waiting.
I would suggest you to divide the work inside the application.
Let one process read the available rows from the database and fill the queue of the worker processes who would process them.

synching mysql and memcached/membase

I have an application where I would like to roll up certain information into membase to avoid expensive group by queries. For example, a click conversion will be recorded in MySQL, and I want to keep a running total of clicks grouped by hours for a certain user in a memcache key.
There I can un/serialize an array with the values I need. I have many other needs like this with revenue, likes, etc.
What would be the best way to create some sort of "transaction" that assures MC and Mysql remain in sync? I could always rebuild the key store based on the underlying MySQL store, but I would like to maintain good concurrency between the two products.
At a high level, to use membase / memcache / etc as a cache for mysql, you'll want to do something like the following:
public Object readMethod(String key) {
value = membaseDriver->get(key);
if(value != null) {
return value;
}
value = getFromMysql(key);
membaseDriver->put(key, value, TTL);
}
public Object writeMethod(String key, String value) {
writeToMysql(key, value);
membaseDriver->delete(key);
//next call to get will get the value that we just wrote to db
}
This ensures that your DB remains the primary source of the data and ensures that membase and mysql stay nearly in sync. (it is not in sync while a process is executing the write method, after it has written to mysql and before it has deleted the key from membase).
If you want them to be really in sync, you have to ensure that while any process is executing the writeMethod, no process can execute the readMethod. You can do a simple global lock in memcache / membase by using the add method. Basically, you add a unique key named after your lock (eg: "MY_LOCK") if the add succeeds, you have the lock, after this happens nobody else can get the lock. When you are done with your write, you release the lock by calling delete with your lock's keyname. By starting both of those methods with that "lock", and ending both of those methods with the "unlock" you ensure that only one process at a time is executing either one. You could also build separate read and write locks on top of that, but I don't think locking is really want you want to do unless you need to be 100% up to date (as opposed to 99.999% up to date).
In the clicks per hour case, you could avoid having to re-run the query every time you count another click by keeping the current hour (ie: the only one that will change) separate from the array of all previous hours (which will probably never change, right?).
Every time you add a click, just use memcache increment on the current hour's counter. Then when you get a read request, look up the the array of all previous hours, then the current hour, and all previous hours with the current hour appended to the end. As a free bonus, the fact that increment is atomic provides you with actually synchronized values so you can skip locking.