Load balancing KEYs using GET via Redis - mysql

my application currently using MySQL makes phone calls fetching information about the dialed numbers and the caller ID from the DB. I want to have a group where a list of caller IDs to be defined in Redis. Let's say 10 caller IDs. But for each dialing, I want to SELECT/GET the caller ID from redis server not just a random number. Is that possible with Redis? It's like load balancing from the list of Keys from redis to make sure all keys are given a fair chance to be used?
An example of the data set will be a phonebook which will be the key, and there will be say 10 phone numbers in that phonebook. I want to use those numbers for every unique dialing so all numbers in the phonebook are used evenly for dialing.
I can do that in MySQL by setting up an update field in the table but that's going to increate UPDATE's on MySQL. Is this something can easily be done with Redis? I can't seem to think of a logic on how to do that.
Thanks.

There are two ways to do it in Redis:
ZSET
You can track the usage frequency with the score of a zset entry. So when you fetch one out from Redis with lowest score, you increase its score by one.
The side benefit is you can easily see exactly how many times each element has been used.
LIST
If you're not bothered about tracking the usage in numbers. You can also do it with a Redis list. Just use RPOPLPUSH source destination from/to itself to achieve round robin load balancing effect. Basically it takes an element from the bottom and puts it back onto the top the queue, and returns you the value of the shuffled element, obviously.
The benefit is there is only one command to run and the operation is atomic.

Related

How to handle "View count" in redis

Our DB is mostly reads, but we want to add a "View count" and "thumbs up/thumbs down" to our videos.
When we stress tested incrementing views in mysql, our database started deadlocking.
I was thinking about handling this problem by having a redis DB that holds the view count, and only writes to the DB once the key expires. But, I hear the notifications are not consistent, and I don't want to lose the view data.
Is there a better way of going about this? Or is the talk of redis notifications being inconsistent not true.
Thanks,
Sammy
Redis' keyspace notifications are consistent, but delivery isn't guaranteed.
If you don't want to lose data, implement your own background process that manually expires the counters - i.e. copies to MySQL and deleted from Redis.
There are several approaches to implementing this lazy eviction pattern. For example, you can use a Redis Hash with two fields: a value field that you can HINCRBY and a timestamp field for expiry logic purposes. Your background process can then SCAN the keyspace to identify outdated keys.
Another way is to use Sorted Sets to manage the counters. In some cases you can use just one Sorted Set, encoding both TTL and count into each member's score (using the float's integer and fractional parts, respectively), but in most cases it is simpler to use two Sorted Sets - one for TTLs and the other fur values.

redis as write-back view count cache for mysql

I have a very high throughput site for which I'm trying to store "view counts" for each page in a mySQL database (for legacy reasons they must ultimately end up in mySQL).
The sheer number of views is making it impractical to do SQL "UPDATE ITEM SET VIEW_COUNT=VIEW_COUNT+1" type of statements. There are millions of items but most are only viewed a small number of times, others are viewed many times.
So I'm considering using Redis to gather the view counts, with a background thread that writes the counts to mySQL. What is the recommended method for doing this? There are some issues with the approach:
how often does the background thread run?
how does it determine what to write back to mySQL?
should I store a Redis KEY for every ITEM that gets hit?
what TTL should I use?
is there already some pre-built solution or powerpoint presentation that gets me halfway there, etc.
I have seen very similar questions on StackOverflow but none with a great answer...yet! Hoping there's more Redis knowledge out there at this point.
I think you need to step back and look at some of your questions from a different angle to get to your answers.
"how often does the background thread run?"
To answer this you need to answer these questions: How much data can you lose? What is the reason for the data being in MySQL, and how often is that data accessed? For example, if the DB is only needed to be consulted once per day for a report, you might only need it to be updated once per day. On the other hand, what if the Redis instance dies? How many increments can you lose and still be "ok"? These will provide the answers to the question of how often to update your MySQL instance and aren't something we can answer for you.
I would use a very different strategy for storing this in redis. For the sake of the discussion let us assume you decide you need to "flush to db" every hour.
Store each hit in hashes with a key name structure along these lines:
interval_counter:DD:HH
interval_counter:total
Use the page id (such as MD5 sum of the URI, the URI itself, or whatever ID you currently use) as the hash key and do two increments on a page view; one for each hash. This provides you with a current total for each page and a subset of pages to be updated.
You would then have your cron job run a minute or so after the start of the hour to pull down all pages with updated view counts by grabbing the previous hour's hash. This provides you with a very fast means of getting the data to update the MySQL DB with while avoiding any need to do math or play tricks with timestamps etc.. By pulling data from a key which is no longer bing incremented you avoid race conditions due to clock skew.
You could set an expiration on the daily key, but I'd rather use the cron job to delete it when it has successfully updated the DB. This means your data is still there if the cron job fails or fails to be executed. It also provides the front-end with a full set of known hit counter data via keys that do not change. If you wanted, you could even keep the daily data around to be able to do window views of how popular a page is. For example if you kept the daily hash around for 7 days by setting an expire via the cron job instead of a delete, you could display how much traffic each page has had per day for the last week.
Executing two hincr operations can be done either solo or pipelined still performs quite well and is more efficient than doing calculations and munging data in code.
Now for the question of expiring the low traffic pages vs memory use. First, your data set doesn't sound like one which will require huge amounts of memory. Of course, much of that depends on how you identify each page. If you have a numerical ID the memory requirements will be rather small. If you still wind up with too much memory, you can tune it via the config, and if needs be could even use a 32 bit compile of redis for a significant memory use reduction. For example, the data I describe in this answer I used to manage for one of the ten busiest forums on the Internet and it consumed less than 3GB of data. I also stored the counters in far more "temporal window" keys than I am describing here.
That said, in this use case Redis is the cache. If you are still using too much memory after the above options you could set an expiration on keys and add an expire command to each ht. More specifically, if you follow the above pattern you will be doing the following per hit:
hincr -> total
hincr -> daily
expire -> total
This lets you keep anything that is actively used fresh by extending it's expiration every time it is accessed. Of course, to do this you'd need to wrap your display call to catch the null answer for hget on the totals hash and populate it from the MySQL DB, then increment. You could even do both as an increment. This would preserve the above structure and would likely be the same codebase needed to update the Redis server from the MySQL Db if you the Redis node needed repopulation. For that you'll need to consider and decide which data source will be considered authoritative.
You can tune the cron job's performance by modifying your interval in accordance with the parameters of data integrity you determine from the earlier questions. To get a faster running cron nob you decrease the window. With this method decreasing the window means you should have a smaller collection of pages to update. A big advantage here is you don't need to figure out what keys you need to update and then go fetch them. you can do an hgetall and iterate over the hash's keys to do updates. This also saves many round trips by retrieving all the data at once. In either case if you will likely want to consider a second Redis instance slaved to the first to do your reads from. You would still do deletes against the master but those operations are much quicker and less likely to introduce delays in your write-heavy instance.
If you need disk persistence of the Redis DB, then certainly put that on a slave instance. Otherwise if you do have a lot of data being changed often your RDB dumps will be constantly running.
I hope that helps. There are no "canned" answers because to use Redis properly you need to think first about how you will access the data, and that differs greatly from user to user and project to project. Here I based the route taken on this description: two consumers accessing the data, one to display only and the other to determine updating another datasource.
Consolidation of my other answer:
Define a time-interval in which the transfer from redis to mysql should happen, i.e. minute, hour or day. Define it in a way so that fast and easyly an identifying key can be obtained. This key must be ordered, i.e. a smaller time should give a smaller key.
Let it be hourly and the key be YYYYMMDD_HH for readability.
Define a prefix like "hitcount_".
Then for every time-interval you set a hash hitcount_<timekey> in redis which contains all requested items of that interval in the form ITEM => count.
There exists two parts of the solution:
The actual page that has to count:
a) get the current $timekey, i.e. by date- functions
b) get the value of $ITEM
b) send the redis-command HINCRBY hitcount_$timekey $ITEM 1
A cronjob which runs in that given interval, not too close to the limit of that intervals (in example: not at the full hour). This cronjob:
a) Extracts the current time-key (for now it would be 20130527_08)
b) Requests all matching keys from redis with KEYS hitcount_* (those should be a small number)
c) compares every such hash against the current hitcount_<timekey>
d) if that key is smaller than current key, then process it as $processing_key:
read all pairs ITEM => counter by HGETALL $processing_key as $item, $cnt
update the database with `UPDATE ITEM SET VIEW_COUNT=VIEW_COUNT+$cnt where ITEM=$item"
delete that key from the hash by HDEL $processing_key $item
no need to del the hash itself - there are no empty hashes in redis as far as I tried
If you want to have a TTL involved, say if the cleanup-cronjob may be not reliable (as might not run for many hours), then you could create the future hashes by the cronjob with an appropriate TTL, that means for now we could create a hash 20130527_09 with ttl 10 hours, 20130527_10 with TTL 11 hours, 20130527_11 with TTL 12 hours. Problem is that you would need a pseudokey, because empty hashes seem to be deleted automatically.
See EDIT3 for current state of the A...nswer.
I would write a key for every ITEM. A few tenthousand keys are definitely no problem at all.
Do the pages change very much? I mean do you get a lot of pages that will never be called again? Otherwise I would simply:
add the value for an ITEM on page request.
every minute or 5 minutes call a cronjob that reads the redis-keys, read the value (say 7) and reduce it by decrby ITEM 7. In MySQL you could increment the value for that ITEM by 7.
If you have a lot of pages/ITEMS which will never be called again you could make a cleanup-job once a day to delete keys with value 0. This should be locked against incrementing that key again from the website.
I would set no TTL at all, so the values should live forever. You could check the memory usage, but I see a lot of different possible pages with current GB of memory.
EDIT: incr is very nice for that, because it sets the key if not set before.
EDIT2: Given the large amount of different pages, instead of the slow "keys *" command you could use HASHES with incrby (http://redis.io/commands/hincrby). Still I am not sure if HGETALL is much faster then KEYS *, and a HASH does not allow a TTL for single keys.
EDIT3: Oh well, sometimes the good ideas come late. It is so simple: Just prefix the key with a timeslot (say day-hour) or make a HASH with name "requests_". Then no overlapping of delete and increment may happen! Every hour you take the possible keys with older "day_hour_*" - values, update the MySQL and delete those old keys. The only condition is that your servers are not too different on their clock, so use UTC and synchronized servers, and don't start the cron at x:01 but x:20 or so.
That means: a called page converts a call of ITEM1 at 23:37, May 26 2013 to Hash 20130526_23, ITEM1. HINCRBY count_20130526_23 ITEM1 1
One hour later the list of keys count_* is checked, and all up to count_20130523 are processed (read key-value by hgetall, update mysql), and deleted one by one after processing (hdel). After finishing that you check if hlen is 0 and del count_...
So you only have a small amount of keys (one per unprocessed hour), that makes keys count_* fast, and then process the actions of that hour. You can give a TTL of a few hours, if your cron is delayed or timejumped or down for a while or something like that.

Would using Redis with Rails provide any performance benefit for this specific kind of queries

I don't know if this is the right place to ask question like this, but here it goes:
I have an intranet-like Rails 3 application managing about 20k users which are in nested-set (preordered tree - http://en.wikipedia.org/wiki/Nested_set_model).
Those users enter stats (data, just plain numeric values). Entered stats are assigned to category (we call it Pointer) and a week number.
Those data are further processed and computed to Results.
Some are computed from users activity + result from some other category... etc.
What user enters isn't always the same what he sees in reports.
Those computations can be very tricky, some categories have very specific formulae.
But the rest is just "give me sum of all entered values for this category for this user for this week/month/year".
Problem is that those stats needs also to be summed for a subset of users under selected user (so it will basically return sum of all values for all users under the user, including self).
This app is in production for 2 years and it is doing its job pretty well... but with more and more users it's also pretty slow when it comes to server-expensive reports, like "give me list of all users under myself and their statistics. One line for summed by their sub-group and one line for their personal stats"). Of course, users wants (and needs) their reports to be as actual as possible, 5 mins to reflect newly entered data is too much for them. And this specific report is their favorite :/
To stay realtime, we cannot do the high-intensive sqls directly... That would kill the server. So I'm computing them only once via background process and frontend just reads the results.
Those sqls are hard to optimize and I'm glad I've moved from this approach... (caching is not an option. See below.)
Current app goes like this:
frontend: when user enters new data, it is saved to simple mysql table, like [user_id, pointer_id, date, value] and there is also insert to the queue.
backend: then there is calc_daemon process, which every 5 seconds checks the queue for new "recompute requests". We pop the requests, determine what else needs to be recomputed along with it (pointers have dependencies... simplest case is: when you change week stats, we must recompute month and year stats...). It does this recomputation the easy way.. we select the data by customized per-pointer-different sqls generated by their classes.
those computed results are then written back to mysql, but to partitioned tables (one table per year). One line in this table is like [user_id, pointer_id, month_value, w1_value, w2_value, w3_value, w4_value]. This way, the tables have ~500k records (I've basically reduced 5x # of records).
when frontend needs those results it does simple sums on those partitioned data, with 2 joins (because of the nested set conds).
The problem is that those simple sqls with sums, group by and join-on-the-subtree can take like 200ms each... just for a few records.. and we need to run a lot of these sqls... I think they are optimized the best they can, according to explain... but they are just too hard for it.
So... The QUESTION:
Can I rewrite this to use Redis (or other fast key-value store) and see any benefit from it when I'm using Ruby and Rails? As I see it, if I'll rewrite it to use redis, I'll have to run much more queries against it than I have to with mysql, and then perform the sum in ruby manually... so the performance can be hurt considerably... I'm not really sure if I could write all the possible queries I have now with redis... Loading the users in rails and then doing something like "redis, give me sum for users 1,2,3,4,5..." doesn't seem like right idea... But maybe there is some feature in redis that could make this simpler?)...
Also the tree structure needs to be like nested set, i.e. it cannot have one entry in redis with list of all child-ids for some user (something like children_for_user_10: [1,2,3]) because the tree structure changes frequently... That's also the reason why I can't have those sums in those partitioned tables, because when the tree changes, I would have to recompute everything.. That's why I perform those sums realtime.)
Or would you suggest me to rewrite this app to different language (java?) and to compute the results in memory instead? :) (I've tried to do it SOA-way but it failed on that I end up one way or another with XXX megabytes of data in ruby... especially when generating the reports... and gc just kills it...) (and a side effect is that one generating report blocks the whole rails app :/ )
Suggestions are welcome.
Redis would be faster, it is an in-memory database, but can you fit all of that data in memory? Iterating over redis keys is not recommended, as noted in the comments, so I wouldn't use it to store the raw data. However, Redis is often used for storing the results of sums (e.g. logging counts of events), for example it has a fast INCR command.
I'm guessing that you would get sufficient speed improvement by using a stored procedure or a faster language than ruby (eg C-inline or Go) to do the recalculation. Are you doing group-by in the recalculation? Is it possible to change group-bys to code that orders the result-set and then manually checks when the 'group' changes. For example if you are looping by user and grouping by week inside the loop, change that to ordering by user and week and keep variables for the current and previous values of user and week, as well as variables for the sums.
This is assuming the bottleneck is the recalculation, you don't really mention which part is too slow.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

Unique, numeric, incremental identifier

I need to generate unique, incremental, numeric transaction id's for each request I make to a certain XML RPC. These numbers only need to be unique across my domain, but will be generated on multiple machines.
I really don't want to have to keep track of this number in a database and deal with row locking etc on every single transaction. I tried to hack this using a microsecond timestamp, but there were collisions with just a few threads - my application needs to support hundreds of threads.
Any ideas would be appreciated.
Edit: What if each transaction id just has to be larger than the previous request's?
If you're going to be using this from hundreds of threads, working on multiple machines, and require an incremental ID, you're going to need some centralized place to store and lock the last generated ID number. This doesn't necessarily have to be in a database, but that would be the most common option. A central server that did nothing but serve IDs could provide the same functionality, but that probably defeats the purpose of distributing this.
If they need to be incremental, any form of timestamp won't be guaranteed unique.
If you don't need them to be incremental, a GUID would work. Potentially doing some type of merge of the timestamp + a hardware ID on each system could give unique identifiers, but the ID number portion would not necessarily be unique.
Could you use a pair of Hardware IDs + incremental timestamps? This would make each specific machine's IDs incremental, but not necessarily be unique across the entire domain.
---- EDIT -----
I don't think using any form of timestamp is going to work for you, for 2 reasons.
First, you'll never be able to guarantee that 2 threads on different machines won't try to schedule at exactly the same time, no matter what resolution of timer you use. At a high enough resolution, it would be unlikely, but not guaranteed.
Second, to make this work, even if you could resolve the collision issue above, you'd have to get every system to have exactly the same clock with microsecond accuracy, which isn't really practical.
This is a very difficult problem, particularly if you don't want to create a performance bottleneck. You say that the IDs need to be 'incremental' and 'numeric' -- is that a concrete business constraint, or one that exists for some other purpose?
If these aren't necessary you can use UUIDs, which most common platforms have libraries for. They allow you to generate many (millions!) of IDs in very short timespans and be quite comfortable with no collisions. The relevant article on wikipedia claims:
In other words, only after generating
1 billion UUIDs every second for the
next 100 years, the probability of
creating just one duplicate would be
about 50%.
If you remove 'incremental' from your requirements, you could use a GUID.
I don't see how you can implement incremental across multiple processes without some sort of common data.
If you target a Windows platform, did you try Interlocked API ?
Google for GUID generators for whatever language you are looking for, and then convert that to a number if you really need it to be numeric. It isn't incremental though.
Or have each thread "reserve" a thousand (or million, or billion) transaction IDs and hand them out one at a time, and "reserve" the next bunch when it runs out. Still not really incremental.
I'm with the GUID crowd, but if that's not possible, could you consider using db4o or SQL Lite over a heavy-weight database?
If each client can keep track of its own "next id", then you could talk to a sentral server and get a range of id's, perhaps a 1000 at a time. Once a client runs out of id's, it will have to talk to the server again.
This would make your system have a central source of id's, and still avoid having to talk to the database for every id.