How to cache popular queries to avoid both stamedes and blank results - mysql

On the customizable front page of our web site, we offer users the option of showing modules showing recently updated content, choosing from well over 100 modules.
All of the data is generated by MySQL queries, the results of which are cached via memcached. Our current system works like this: when a user load a page containing modules, module, they are immediately served the data from cache, and the query is added to a queue to be updated by a separate gearman process (so that the page load does not wait for the mysql query). That query is then run once every 15 minutes to refresh the data in cache. The queue of queries itself is periodically purged so that we do not continually refresh data that has not been requested recently.
The problem is what to do when the cache is empty, for some reason. This doesn't happen often, but when it does, the user is currently shown an empty module, and the data is refreshed in the gearman process so that a bit later, when the same (or a different) user reloads the page, there is data to show.
Our traffic is such that, if we were to try to run the query live for the user when the cache is empty, we would have a serious problem with stampeding--we'd be running the same (possibly slow) query many times as many users loaded the page. Is there any way to solve the "blank module" problem without opening up the risk of stampeding?

This is an interesting implementation though varies a bit from the way most typically implement memcached in fronT of MySQL.
In most cases users will set things up to where queries are first evaluated at memcached to see if there is is an available entry. If so they server it from memcached and never query the database at all. If there is a cache miss, then the query is made against the database, the results added to memcached, and the information returned to the caller. This is how you would typically build up your cache for read queries.
In cases where data is being updated, the update would be made against the database, and then the appropriate data in memcached invalidated and/or updated. Similarly for inserts, you could either do nothing regarding the cache (and let the next read on that record populate the cache), or you could actively add the data related to the insert into the cache, depending on your application needs.
In this way you wouldn't need to take the extra step of calling the database to get authoritative data after getting initial data from memcached. The data in memcached would be a copy of the authoritative data which is just updated/invalidated upon updates/inserts.
Based on your comments, one thing you might want to try in order to prevent a number of of queries on your database in case of cache misses is to use a mutex of sorts. For example, when the first client hits memcached and gets a cache miss for that lookup, you could could insert a temporary value in memcached indicating that the data is pending, then make the query against the database, and the update the memcached data with the result.
On the client side, when you get a cache miss or a "pending" result, you could simply initiate a retry for the cache after a certain period of time (which you may want to increase exponentially). So perhaps first hey wait for 1 second, then try back gain in 2 seconds if they still get a "pending" results, then retry in 4 seconds, and so on.
This would amount in possibly more requests against the memcached server, but should resolve any problems on the database layer.

Related

How do database transactions happen in MMORPG's?

I've built an MMORPG that uses a MySQL database to store player related data when the user logs off.
We built in a auto save timer so that all the data of every logged in user is saved to the database every 3 hours.
In doing so we noticed a fatal flaw....
Due to the fact that all our database transactions are sent to a single DB Thread the thread can become backlogged with requests. This produces a login/saving issue. When this happens players unable to login as the login process requires the use of the DB Thread to confirm login credentials. Similarly all save requests are queued to the back of the DB thread schedule. This produces a backlog of requests...
The only solution that I can think of for this is to introduce multiple threads and have 3-4 threads interacting with the database.
However, this opens up a new issue. Since multiple threads are sent DB requests this means that one thread can receive a save request from a player while another DB thread receives a save request from the same player.
For example....
PlayerA Logs In to the game
3 Hours pass & the auto save happens, playerA's data will now be saved.
PlayerA kills a monster and gains experience.
PlayerA logs off, which adds a save request to a DB thread.
Now we have two different save requests queue'd in the database. Assuming they are both assigned to two different DB threads, this could cause the users data to be saved in the wrong order... For example maybe the the thread handling PlayerA's log out save runs first and then the auto save for PlayerA runs after that on a separate thread.... This would cause loss of data (in this case experience).
How do other MMORPG's handle something like this?
You need a database connection pool if you're not using one already and make sure you're not locking more data than you need. If you are saving how much gold a player has, you don't need to lock the table holding the credentials.
Keeping the order of events in a multi-threaded scenario is not a trivial problem, I suggest using a message queue, a single producer per player and a single consumer per player. This link shows 2 strategies to keep the order.
A queue is actually important for other reasons. If a save request fails, it would remain in the queue to retry later. When dealing with players money and items, you probably want this.
Your autosave is deterministic, meaning that you know exactly when the last one occured and when the next one would occur. I would use that somehow, along with the previously suggested idea to add a timestamp. Actually, it might be better to make the updates represent only the increments/decrements along with a user timestamp and calculate the experience upon request ( maybe cache it then)
To avoid this problem in all cases you must not allow users to continue doing stuff before their last database transaction has been successfully committed. Of course that means that the DB has to be very fast -- if it can't keep the request queue below a couple of seconds worth of transactions at most, you simply have to make it faster. More RAM cache, SSDs, the usual MySQL optimization dance. Adding extra logic in the form of triggers etc. isn't going to help in the long run, especially because they can become really complicated in the case of inventories and the like.
If on average the system is fast enough but struggling in peaks like when everybody logs in during lunch break, adding something like Redis as a fast cache might help. You'd load the data into Redis when a user logs on (or when they first need a certain piece of data) , remove it when they log off or when it expires, and write changes back to the relational DB as fast as it can keep up.

Using Redis to cache SQL result

I have a SQL-based application and I like to cache the result using Redis. You can think of the application as an address book with multiple SQL tables. The application performs the following tasks:
40% of the time:
Create a new record / Update an existing record
Bulk update multiple records
Review an existing record
60% of the time:
Search records based on user's criteria
This is my current approach:
The system cache a record when a record is created or updated.
When user performs a search, the system will cache the query result.
On top of that, I have a Redis look-up table (Redis Set) which stores the MySQL record ID and the Redis cache key. That way I can delete the Redis caches if the MySQL record has been changed (e.g., bulk update).
What if a new record is created after the system cache the search result? If the new record matches the search criteria, the system will always return the old cache (which does not include the new record), until the cache is deleted (which won't happen until an existing record in the cache is updated).
The search is driven by the users and the combination of the search condition is countless. It is not possible to evaluate which cache should be deleted when a new record is created.
So far, the only solution is to remove all caches of a MySQL table when a record is created. However this is not a good choice because lots of records are created daily.
In this situation, what's the best way to implement Redis on top of MySQL?
Here's a surprising thing when it comes to PHP and MySQL (I am not sure about other languages) - not caching stuff into memcached or Redis is actually faster. Much faster. Basically, if you just built your app and queried MySQL - you'd get more out of it.
Now for the "why" part.
InnoDB, the default engine, is a superb engine. Specifically, it's memory management (allocation and what not) is superior to any memory storage solutions. That's a fact, you can look it up or take my word for it - it will, at least, perform as good as Redis.
Now what happens in your app - you query MySQL and cache the result into redis. However, MySQL is also smart enough to keep cached results. What you just did is create an additional file descriptor that's required to connect to Redis. You also used some storage (RAM) to cache the result that MySQL already cached.
Here comes another interesting part - the preferred way of serving PHP scripts is by using php-fpm - it's much quicker than any mod_* crap out there. Down to the core, php-fpm is a supervisor process that spawns child processes. They don't shut down after the script is served, which means they cache connections to MySQL - connect once, use multiple times. Basically, if you serve scripts using php-fpm, they will reuse the already established connection to MySQL, meaning that you won't be opening and closing connections for each request - this is extremely resource friendly and it lets you have lightning fast connection to MySQL. MySQL, being memory efficient and having the cached result is much quicker than Redis.
Now what does all of this mean for you - having a proper setup lets you have small code that's simple, easy, doesn't involve Redis and eliminates all the problems that you might have with cache invalidation and what not and you won't waste your memory to contain the same data twice.
Ingredients you need for this to work:
php-fpm
MySQL and InnoDB based tables and most of all - sufficient RAM and tweaked innodb_buffer_pool_size variable. That one controls how much RAM InnoDB is allowed to allocate for its purposes - the larger the better.
You eliminated Redis from the game, you kept your code simple and easy to maintain, you didn't duplicate data, you didn't introduce additional system to the play and you let software that's meant to take care of data do its job. Pretty cheap trade-off for maximum usefulness, even if you compile all the software from scratch - it won't take more than an hour or so to get it up and running.
Or, you can just ignore what I wrote and look for a solution using Redis.
We met the same problem and we chose to do same thing you are thinking of: remove all query caches affected by the table. It is not ideal like your said but fortunately our "write" is not as high as 40% so it's ok so far.
That's the nature of query based caching. As an alternative you can add entity based caching. Instead of caching the search result only, cache the entire table and do the search inside memory. We use C# LINQ so we can do pretty common queries in memory but if the search is too complicated then you are out of luck.

Syncing memcache and MySQL

I have not come across a good suggestion on how to keep the database and memcache in sync.
I use MySQL 5.5.28, Zope 2.12.19 in my web application.
So, some of the suggestions are like once you do a select from memcache (during a cache hit), it sends the data from the cache. After this cache is invalidated and data is selected again from the database for the cache to be re-populated. But only because the database operations are expensive, we have opted to use cache in the first place. So how is this solving the problem of faster access ?
The other solution seems to be update memcache using triggers on the source table. Any inputs on this would be appreciated as I do not understand how this is done.
Below are the links with the best solutions that I could find to the above questions.
The answer to my first question that mentions about the use of cache with rapidly changing data.
Well, caching is not ideal if the data changes frequently. This is true with less number of users.
But if the number of hits to the website increases, then caching is really useful when the following approach is used:
INSERT, UPDATE or DELETE operations will invoke triggers that would invalidate the cache.
And when the page is loaded, SELECT will be used and the resulting data will be stored in the cache until it is changed again. This way, the application's code does not have to be modified throughout the system by using triggers for INSERT, UPDATE, DELETE on the respective tables. Only SELECT needs to be handled in the code.
Regarding my second question on how to use triggers to manipulate cache, the link below has been extemely useful in answering my question:
http://code.openark.org/blog/mysql/using-memcached-functions-for-mysql-an-automated-alternative-to-query-cache.

How to avoid DB requery on pagination using ASP Classic and MySQL?

I have a page that querying products from the database and displaying then in pages of 30 items. When I navigate to the next page, the application re-queries the DB and displays page no. 2 and so on.
How can I avoid this database re-query? Can I store the results somewhere? We are talking about 1500-2000 rows/query and when we have 400-450 users online, our dedicated server runs at 100% CPU capacity.
Do you have enough memory to pre-load your entire "catalog" (in Application level storage) and then have SQL return all results, but store only the index (in each Session).
Something like this:
On Application Start: create my read only Application-level cache
On Search: SQL returns all results (I assume you have to do SQL, so you can check business conditions
On results: build list of indices that map into Application cache
On Display Page: Read and display apropriate range from Application cache
If you don't have enough memory, then a "Result" table might provide some optimization: on a per-session basis, cache entire query result into a "flattened" table, to avoid potentially expensive (busines-logic-heavy) products query. You have to be careful to detect when the query changes, so you can discard cache, and also have some server-side logic to cleanup old, expired searches.
As I stated the main reason I was asking for a solution was to avoid CPU overload. It seemed unnatural for the server to be clogged up at 100% with only 500-600 users online. I discovered the optimize table MySQL command, which works on MyISAM tables and it totally solved the problem. Immediately after executing the command, the CPU usage went down to 10-12%.
So, if there is anyone else out there running MySQL applications that overload the CPU, you should first try the Optimize Table command and other maintenance tasks described here http://dev.mysql.com/doc/refman/5.5/en/optimize-table.html

Ruby on Rails: Why first active record query takes longer?

If i execute active record query after some time gap, it takes longer.
Say Item.all takes .11 sec on first query and .003 later on. what could be possible reason for this behaviour?
edited:
active record query cache 's scope is action of controller. In my case, active record query in subsequent http request is also faster.
Possible explanations:
ActiveRecord Caching
Connection Pooling (it doesn't have to restart the connection)
Load on the web server or db server.
ActiveRecord caches the results from queries. The first query is actually hitting the database - ActiveRecord then waits for the operation to complete and parses the results into its objects. The next time an identical query is made, it has the results cached so that they are returned to you immediately, instead of going all the way back to the database.
Check the API for the QueryCache: it seems like you can clear the query cache (connection.clear_query_cache) if you want to wipe out cached queries.
This SO question also suggests self.class.uncached do ... end to bypass the cache but I am not sure if this still applies in Rails 3.
It's definitely ActiveRecord's caching you're looking at. See doc.
All of the methods are built on a simple caching principle that will
keep the result of the last query around unless specifically
instructed not to. The cache is even shared across methods to make it
even cheaper to use the macro-added methods without worrying too much
about performance at the first go.