synching mysql and memcached/membase - mysql

I have an application where I would like to roll up certain information into membase to avoid expensive group by queries. For example, a click conversion will be recorded in MySQL, and I want to keep a running total of clicks grouped by hours for a certain user in a memcache key.
There I can un/serialize an array with the values I need. I have many other needs like this with revenue, likes, etc.
What would be the best way to create some sort of "transaction" that assures MC and Mysql remain in sync? I could always rebuild the key store based on the underlying MySQL store, but I would like to maintain good concurrency between the two products.

At a high level, to use membase / memcache / etc as a cache for mysql, you'll want to do something like the following:
public Object readMethod(String key) {
value = membaseDriver->get(key);
if(value != null) {
return value;
}
value = getFromMysql(key);
membaseDriver->put(key, value, TTL);
}
public Object writeMethod(String key, String value) {
writeToMysql(key, value);
membaseDriver->delete(key);
//next call to get will get the value that we just wrote to db
}
This ensures that your DB remains the primary source of the data and ensures that membase and mysql stay nearly in sync. (it is not in sync while a process is executing the write method, after it has written to mysql and before it has deleted the key from membase).
If you want them to be really in sync, you have to ensure that while any process is executing the writeMethod, no process can execute the readMethod. You can do a simple global lock in memcache / membase by using the add method. Basically, you add a unique key named after your lock (eg: "MY_LOCK") if the add succeeds, you have the lock, after this happens nobody else can get the lock. When you are done with your write, you release the lock by calling delete with your lock's keyname. By starting both of those methods with that "lock", and ending both of those methods with the "unlock" you ensure that only one process at a time is executing either one. You could also build separate read and write locks on top of that, but I don't think locking is really want you want to do unless you need to be 100% up to date (as opposed to 99.999% up to date).
In the clicks per hour case, you could avoid having to re-run the query every time you count another click by keeping the current hour (ie: the only one that will change) separate from the array of all previous hours (which will probably never change, right?).
Every time you add a click, just use memcache increment on the current hour's counter. Then when you get a read request, look up the the array of all previous hours, then the current hour, and all previous hours with the current hour appended to the end. As a free bonus, the fact that increment is atomic provides you with actually synchronized values so you can skip locking.

Related

Ensure auto_increment value ordering in MySQL

I have multiple threads writing events into a MySQL table events.
The table has an tracking_no column configured as auto_increment used to enforce an ordering of the events.
Different readers are consuming from events and they poll the table regularly to get the new events and keep the value of the last-consumed event to get all the new events at each poll.
It turns out that the current implementation leaves the chance of missing some events.
This is what's happening:
Thread-1 begins an "insert" transaction, it takes the next value from auto_increment column (1) but takes a while to complete
Thread-2 begins an "insert" transaction, it takes the next auto_incremente value (2) and completes the write before Thread-1.
Reader polls and asks for all events with tracking_number greater than 0; it gets event 2 because Thread-1 is still lagging behind.
The events gets consumed and Reader updates it's tracking status to 2.
Thread-1 completes the insert, event 1 appears in the table.
Reader polls again for all events after 2, and while event 1 was inserted it will never be picked up again.
It seems this could be solved by changing the auto_increment strategy to lock the entire table until a transaction completes, but if possible we would avoid it.
I can think of two possible approaches.
1) If your event inserts are guaranteed to succeed (ie, you never roll back an event insert, and therefore there are never any persistent gaps in your tracking_no), then you can rewrite your Readers so that they keep track of the last contiguous event seen -- aka the last event successfully processed.
The reader queries the event store, starts processing the events in order, and then stops if a gap is found. The remaining events are discarded. The next query uses the sequence number of the last successfully processed event.
Rollback makes a mess of this, though - scenarios with concurrent writes can leave persistent gaps in the stream, which would cause your readers to block.
2) You could rewrite your query with a maximum event represented in time. See MySQL create time and update time timestamp for the mechanics of setting up timestamp columns.
The idea then is that your readers query for all events with a higher sequence number than the last successfully processed event, but with a timestamp less than now() - some reasonable SLA interval.
It generally doesn't matter if the projections of an event stream are a little bit behind in time. So you leverage this, reading events in the past, which protects you from writes in the present that haven't completed yet.
That doesn't work for the domain model, though -- if you are loading an event stream to prepare for a write, working from a stream that is a measurable interval in the past isn't going to be much fun. The good news is that the writers know which version of the object they are currently working on, and therefore where in the sequence their generated events belong. So you track the version in the schema, and use that for conflict detection.
Note It's not entirely clear to me that the sequence numbers should be used for ordering. See https://stackoverflow.com/a/9985219/54734
Synthetic keys (IDs) are meaningless anyway. Their order is not significant, their only property of significance is uniqueness. You can't meaningfully measure how "far apart" two IDs are, nor can you meaningfully say if one is greater or less than another.
So this may be a case of having the wrong problem.

Best practice reading newest rows from database

I have a table which stores the location of my user very frequently. I want to query this table frequently and return the newest rows I haven't read from.
What would be the best practice way to do this. My ideas are:
Add a boolean read flag, query all results where this is false, return them and then update them ALL. This might slow things down with the extra writes
Save the id of the last read row on the client side, and query for rows greater than this. Only issue here is that my client could lose their place
Some stream of data
there will eventually be multiple users and readers of the locations so this will not to scale somewhat
If what you have is a SQL database storing rows of things. I'd suggest something like option 2.
What I would probably do is keep a timestamp rather than in ID, and an index on that (a clustered index on MSSQL, or similar construct so that new rows are physically sorted by time). Then just query by anything newer than that.
That does have the "losing their place" issue. If the client MUST read every row published, then I'd either delete them after processing, or have a flag in the database to indicate that they have been processed. If the client just needs to restart reading current data, then I would do as above, but initialize the time with the most recent existing row.
If you MUST process every record, aren't limited to a database, what you're really talking about is a message queue. If you need to be able to access the individual data points after processing, then one step of the message handling could be to insert into a database for later querying(in addition to whatever this is doing with the data read).
Edit per comments:
If there's no processing that needs be done when receiving, but you just want to periodically update data then you'd be fine with solution of keeping the last received time or ID and not deleting the data. In that case I would recommend not persisting a last known id/timestamp across restarts/reconnects since you might end up inadvertently loading a bunch of data. Just reset it max when you restart.
On another note, when I did stuff like this I had good success using MQTT to transmit the data, and for the "live" updates. That is a pub/sub messaging protocol. You could have a process subscribing on the back end and forwarding data to the database, while the thing that wants the data frequently can subscribe directly to the stream of data for live updates. There's also a feature to hold onto the last published message and forward that to new subscribers so you don't start out completely empty.

redis as write-back view count cache for mysql

I have a very high throughput site for which I'm trying to store "view counts" for each page in a mySQL database (for legacy reasons they must ultimately end up in mySQL).
The sheer number of views is making it impractical to do SQL "UPDATE ITEM SET VIEW_COUNT=VIEW_COUNT+1" type of statements. There are millions of items but most are only viewed a small number of times, others are viewed many times.
So I'm considering using Redis to gather the view counts, with a background thread that writes the counts to mySQL. What is the recommended method for doing this? There are some issues with the approach:
how often does the background thread run?
how does it determine what to write back to mySQL?
should I store a Redis KEY for every ITEM that gets hit?
what TTL should I use?
is there already some pre-built solution or powerpoint presentation that gets me halfway there, etc.
I have seen very similar questions on StackOverflow but none with a great answer...yet! Hoping there's more Redis knowledge out there at this point.
I think you need to step back and look at some of your questions from a different angle to get to your answers.
"how often does the background thread run?"
To answer this you need to answer these questions: How much data can you lose? What is the reason for the data being in MySQL, and how often is that data accessed? For example, if the DB is only needed to be consulted once per day for a report, you might only need it to be updated once per day. On the other hand, what if the Redis instance dies? How many increments can you lose and still be "ok"? These will provide the answers to the question of how often to update your MySQL instance and aren't something we can answer for you.
I would use a very different strategy for storing this in redis. For the sake of the discussion let us assume you decide you need to "flush to db" every hour.
Store each hit in hashes with a key name structure along these lines:
interval_counter:DD:HH
interval_counter:total
Use the page id (such as MD5 sum of the URI, the URI itself, or whatever ID you currently use) as the hash key and do two increments on a page view; one for each hash. This provides you with a current total for each page and a subset of pages to be updated.
You would then have your cron job run a minute or so after the start of the hour to pull down all pages with updated view counts by grabbing the previous hour's hash. This provides you with a very fast means of getting the data to update the MySQL DB with while avoiding any need to do math or play tricks with timestamps etc.. By pulling data from a key which is no longer bing incremented you avoid race conditions due to clock skew.
You could set an expiration on the daily key, but I'd rather use the cron job to delete it when it has successfully updated the DB. This means your data is still there if the cron job fails or fails to be executed. It also provides the front-end with a full set of known hit counter data via keys that do not change. If you wanted, you could even keep the daily data around to be able to do window views of how popular a page is. For example if you kept the daily hash around for 7 days by setting an expire via the cron job instead of a delete, you could display how much traffic each page has had per day for the last week.
Executing two hincr operations can be done either solo or pipelined still performs quite well and is more efficient than doing calculations and munging data in code.
Now for the question of expiring the low traffic pages vs memory use. First, your data set doesn't sound like one which will require huge amounts of memory. Of course, much of that depends on how you identify each page. If you have a numerical ID the memory requirements will be rather small. If you still wind up with too much memory, you can tune it via the config, and if needs be could even use a 32 bit compile of redis for a significant memory use reduction. For example, the data I describe in this answer I used to manage for one of the ten busiest forums on the Internet and it consumed less than 3GB of data. I also stored the counters in far more "temporal window" keys than I am describing here.
That said, in this use case Redis is the cache. If you are still using too much memory after the above options you could set an expiration on keys and add an expire command to each ht. More specifically, if you follow the above pattern you will be doing the following per hit:
hincr -> total
hincr -> daily
expire -> total
This lets you keep anything that is actively used fresh by extending it's expiration every time it is accessed. Of course, to do this you'd need to wrap your display call to catch the null answer for hget on the totals hash and populate it from the MySQL DB, then increment. You could even do both as an increment. This would preserve the above structure and would likely be the same codebase needed to update the Redis server from the MySQL Db if you the Redis node needed repopulation. For that you'll need to consider and decide which data source will be considered authoritative.
You can tune the cron job's performance by modifying your interval in accordance with the parameters of data integrity you determine from the earlier questions. To get a faster running cron nob you decrease the window. With this method decreasing the window means you should have a smaller collection of pages to update. A big advantage here is you don't need to figure out what keys you need to update and then go fetch them. you can do an hgetall and iterate over the hash's keys to do updates. This also saves many round trips by retrieving all the data at once. In either case if you will likely want to consider a second Redis instance slaved to the first to do your reads from. You would still do deletes against the master but those operations are much quicker and less likely to introduce delays in your write-heavy instance.
If you need disk persistence of the Redis DB, then certainly put that on a slave instance. Otherwise if you do have a lot of data being changed often your RDB dumps will be constantly running.
I hope that helps. There are no "canned" answers because to use Redis properly you need to think first about how you will access the data, and that differs greatly from user to user and project to project. Here I based the route taken on this description: two consumers accessing the data, one to display only and the other to determine updating another datasource.
Consolidation of my other answer:
Define a time-interval in which the transfer from redis to mysql should happen, i.e. minute, hour or day. Define it in a way so that fast and easyly an identifying key can be obtained. This key must be ordered, i.e. a smaller time should give a smaller key.
Let it be hourly and the key be YYYYMMDD_HH for readability.
Define a prefix like "hitcount_".
Then for every time-interval you set a hash hitcount_<timekey> in redis which contains all requested items of that interval in the form ITEM => count.
There exists two parts of the solution:
The actual page that has to count:
a) get the current $timekey, i.e. by date- functions
b) get the value of $ITEM
b) send the redis-command HINCRBY hitcount_$timekey $ITEM 1
A cronjob which runs in that given interval, not too close to the limit of that intervals (in example: not at the full hour). This cronjob:
a) Extracts the current time-key (for now it would be 20130527_08)
b) Requests all matching keys from redis with KEYS hitcount_* (those should be a small number)
c) compares every such hash against the current hitcount_<timekey>
d) if that key is smaller than current key, then process it as $processing_key:
read all pairs ITEM => counter by HGETALL $processing_key as $item, $cnt
update the database with `UPDATE ITEM SET VIEW_COUNT=VIEW_COUNT+$cnt where ITEM=$item"
delete that key from the hash by HDEL $processing_key $item
no need to del the hash itself - there are no empty hashes in redis as far as I tried
If you want to have a TTL involved, say if the cleanup-cronjob may be not reliable (as might not run for many hours), then you could create the future hashes by the cronjob with an appropriate TTL, that means for now we could create a hash 20130527_09 with ttl 10 hours, 20130527_10 with TTL 11 hours, 20130527_11 with TTL 12 hours. Problem is that you would need a pseudokey, because empty hashes seem to be deleted automatically.
See EDIT3 for current state of the A...nswer.
I would write a key for every ITEM. A few tenthousand keys are definitely no problem at all.
Do the pages change very much? I mean do you get a lot of pages that will never be called again? Otherwise I would simply:
add the value for an ITEM on page request.
every minute or 5 minutes call a cronjob that reads the redis-keys, read the value (say 7) and reduce it by decrby ITEM 7. In MySQL you could increment the value for that ITEM by 7.
If you have a lot of pages/ITEMS which will never be called again you could make a cleanup-job once a day to delete keys with value 0. This should be locked against incrementing that key again from the website.
I would set no TTL at all, so the values should live forever. You could check the memory usage, but I see a lot of different possible pages with current GB of memory.
EDIT: incr is very nice for that, because it sets the key if not set before.
EDIT2: Given the large amount of different pages, instead of the slow "keys *" command you could use HASHES with incrby (http://redis.io/commands/hincrby). Still I am not sure if HGETALL is much faster then KEYS *, and a HASH does not allow a TTL for single keys.
EDIT3: Oh well, sometimes the good ideas come late. It is so simple: Just prefix the key with a timeslot (say day-hour) or make a HASH with name "requests_". Then no overlapping of delete and increment may happen! Every hour you take the possible keys with older "day_hour_*" - values, update the MySQL and delete those old keys. The only condition is that your servers are not too different on their clock, so use UTC and synchronized servers, and don't start the cron at x:01 but x:20 or so.
That means: a called page converts a call of ITEM1 at 23:37, May 26 2013 to Hash 20130526_23, ITEM1. HINCRBY count_20130526_23 ITEM1 1
One hour later the list of keys count_* is checked, and all up to count_20130523 are processed (read key-value by hgetall, update mysql), and deleted one by one after processing (hdel). After finishing that you check if hlen is 0 and del count_...
So you only have a small amount of keys (one per unprocessed hour), that makes keys count_* fast, and then process the actions of that hour. You can give a TTL of a few hours, if your cron is delayed or timejumped or down for a while or something like that.

What is the best way (in Rails/AR) to ensure writes to a database table are performed synchronously, one after another, one at a time?

I have noticed that using something like delayed_job without a UNIQUE constraint on a table column would still create double entries in the DB. I have assumed delayed_job would run jobs one after another. The Rails app runs on Apache with Passenger Phusion. I am not sure if that is the reason why this would happen, but I would like to make sure that every item in the queue is persisted to AR/DB one after another, in sequence, and to never have more than one write to this DB table happen at the same time. Is this possible? What would be some of the issues that I would have to deal with?
update
The race conditions arise because an AJAX API is used to send data to the application. The application received a bunch of data, each batch of data is identified as belonging together by a Session ID (SID), in the end, the final state of the database has to include the latest most up-to date AJAX PUT query to the API. Sometimes queries arrive at the exact same time for the same SID -- so I need a way to make sure they don't all try to be persisted at the same time, but one after the other, or simply the last to be sent by AJAX request to the API.
I hope that makes my particular use-case easier to understand...
You can lock a specific table (or tables) with the LOCK TABLES statement.
In general I would say that relying on this is poor design and will likely lead to with scalability problems down the road since you're creating an bottleneck in your application flow.
With your further explanations, I'd be tempted to add some extra columns to the table used by delayed_job, with a unique index on them. If (for example) you only ever wanted 1 job per user you'd add a user_id column and then do
something.delay(:user_id => user_id).some_method
You might need more attributes if the pattern is more sophisticated, e.g. there are lots of different types of jobs and you only wanted one per person, per type, but the principle is the same. You'd also want to be sure to rescue ActiveRecord::RecordNotUnique and deal with it gracefully.
For non delayed_job stuff, optimistic locking is often a good compromise between handling the concurrent cases well without slowing down the non concurrent cases.
If you are worried/troubled about/with multiple processes writing to the 'same' rows - as in more users updating the same order_header row - I'd suggest you set some marker bound to the current_user.id on the row once /order_headers/:id/edit was called, and removing it again, once the current_user releases the row either by updating or canceling the edit.
Your use-case (from your description) seems a bit different to me, so I'd suggest you leave it to the DB (in case of a fairly recent - as in post 5.1 - MySQL, you'd add a trigger/function which would do the actual update, and here - you could implement similar logic to the above suggested; some marker bound to the sequenced job id of sorts)

Using Memcache as a counter for multiple objects

I have a photo-hosting website, and I want to keep track of views to the photos. Due to the large volume of traffic I get, incrementing a column in MySQL on every hit incurs too much overhead.
I currently have a system implemented using Memcache, but it's pretty much just a hack.
Every time a photo is viewed, I increment its photo-hits_uuid key in Memcache. In addition, I add a row containing the uuid to an invalidation array also stored in Memcache. Every so often I fetch the invalidation array, and then cycle through the rows in it, pushing the photo hits to MySQL and decrementing their Memcache keys.
This approach works and is significantly faster than directly using MySQL, but is there a better way?
I did some research and it looks like Redis might be my solution. It seems like it's essentially Memcache with more functionality - the most valuable to me is listing, which pretty much solves my problem.
There is a way that I use.
Method 1: (Size of a file)
Every time that someone hits the page, I add one more byte to a file. Then after x seconds or so (I set 600), I will count how many bytes that are in my file, delete my file, then I update it to the MySQL database. This will also allow scalability if multiple servers are adding to a small file in a cache server. Use fwrite to append to the file and you will never have to read that cache file.
Method 2: (Number stored in a file)
Another method is to store a number in a text file that contains the number of hits, but I recommend from using this because if two processes were simultaneously updating, data might be off (maybe same with method1).
I would use method 1 because although it is a bigger file size, it is faster.
I'm assuming you're keeping access logs on your server for this solution.
Keep track of the last time you checked your logs.
Every n seconds or so (where n is less than the time it takes for your logs to be rotated, if they are), scan through the latest log file, ignoring every hit until you find a timestamp after your last check time.
Count how many times each image was accessed.
Add each count to the count stored in the database.
Store the timestamp of the last log entry you processed for next time.