Should I implement my own caching or rely on read-replicas? - mysql

We have an enterprise application that uses an SQL database. The database access characteristics are about 90% reads. The data that does get updated or created needs to be up-to-date immediately. The cache needs to be correctly invalidated with high certainty. The entities are referred to by their primary key for 98% of the cases.
The application is based on Node.js and is AWS-native. Since the application is AWS-native, I'd like to rely on managed services from AWS rather than hosting my own. One option is to implement our read-through Redis-based cache. Upon retrieving the entities, we'd check the cache and if the data is not cached we'd put it into the cache before turning it to the user. The parts of the code that update those entities will invalidate the cache by primary key.
Generally speaking, in computer science cache coherency is one of the most challenging problems to get right. I am of the opinion that rather than implementing a Redis cache and thinking through all of the possible scenarios for correctly invalidating it, it is wiser to instead configure an Aurora read-replica specifically for reading frequently accessed entities. The RDBMS will do a much better job at caching than anything we can build ourselves.
So, I am facing two options -- go through the effort of implementing my own caching, or use read replicas. My personal opinion is to use a read replica.
Any advice is greatly appreciated, as always.

Yes, you're right, cache invalidation is a tough problem. The simplest solution is to add code to your data writes, to replace the cached values. So they're always current. But this is easy only if the cached values have a pretty much 1-to-1 correlation with rows in your database.
An advantage of your own cache is that you can cache data that is not 1-to-1 with rows of data in the database. You might cache an entire HTML fragment for a drop-down menu for example. That could be the result of several SQL queries. It could be quite an advantage to cache data that is higher up the "food chain" so to speak. But cache invalidation becomes less straightforward. Best for storing results of queries that don't change often.
Using a read-replica is not a substitute for using a cache. Querying a read-replica still has overhead of making a database connection, authentication, SQL query parsing and optimization, locking, and all the other overhead that goes into RDBMS workings.
Querying data from a cache can be orders of magnitude faster.
Both have their place. It's best to use both a cache and a read-replica for different tasks. I would also add message queues as an important technology. I believe database, cache, and queue form a three-legged stool.
But you must have experience and judgment to know when each is the best tool for a given case.

Related

handling stale caches with multiple servers

We currently have a website hosted on one server, and we are looking into adding a new server. The main issue is about caching. Some items are cached based on when they are changed. However right now, they are changed in the same process, hence the cache can be invalidated.
If the website is hosted on two servers, the changes can be done on both servers and they will not be notified of such changes. The cache needs to remain as it drastically speeds up the website. I would prefer if the cache is not taken out-of-process in a cache-server, as it slows down to the speed of network rather than memory, and adds complexity to the servers.
The website is implemented in .Net, with MySQL as it's backing datastore. My issue is how the process can be notified when data changes. Is it possible that MySQL will automatically notify all registered clients when any data changes? I've used RavenDb, which has a similar feature which comes in very handy. I couldn't find anything similar for MySQL. If this is not possible, any ideas how one would approach this issue?
Distributed caching is a complex topic. It sounds like you are running a more basic in-memory cache. If this is the case, you will need to handle synchronisation yourself, or be happy with "eventual consistency" of the data, assuming you have some stale key checking mechanism.
Personally I would look into using memcached (we use Couchbase). Your opinion on this becoming a network bottleneck may be unrealised, although yes in real terms memory access is faster. In practical terms, we noticed that Couchbase caching was more than fast enough, and it is atomic at the key level. It will handle key distribution over nodes.
As for MySQL pushing notifications to clients, I am not sure but I don't think so. You could emulate this yourself if you have a layer of code (DAL etc) over database access.
It is also difficult to reconcile the desire to have the cache follow the same integrity principles as the database. If you achieve this then all you have done is made an in-memory database. Caching is supposed to be a trade-off of data accuracy over time to increase scalability.

multiple rails engines talking to one mySQL server for horizontally scaling application servers

I've seen pictures like this where multiple rails engines write to a single mySQL server.
1) Is this possible? Or does Rails want each application server to write to one database server?
2) If this is possible, how is it accomplished? Are there queues and a scheduler between the application servers and the write database server?
Scaling a mysql db is a pretty difficult thing to do, but its certainly been done plenty of times and there are a lot of best practices out there for you to take advantage of. The first thing you should know is that before you worry about scaling writes for a while yet, you probably need to scale your reads first.
Scaling reads can be done fairly easily using replication. There are several tools out there that make managing replication a lot easier such as Amazon RDS. Generally speaking many web severs can connect to many databases (as suggested by others), however you quickly run into scale issues once you have a lot of traffic, connections or whatever other action you are performing that generates load on the server.
As replicated severs are read only, you need to manage which sever you connect to depending on the action you're performing. I.e. if you had a users table, when creating, updating or deleting users you need to use the "write" database (the primary "source" sever) but when reading the user table, you can use one of the read replicas. This reduces the load on the primary write sever (allowing it to deal with even more writes) and as you can have multiple read databases behind a load balancer, you can get away with this structure for a very long time and scale reads across tens of database severs before you'll hit any significant issues (however most apps get away with 1-3).
There are situations where you will need to use your write database for read actions (although you should avoid it as much as possible) as the read replicas can be slightly behind the write dbs due to latency in replicating the write db queries, however most of the time you should be able to code knowing that there is the possibility that the read db is delayed (i.e. queue actions a reasonable period of time such that the updates will propagate across all the read severs) and simply use one of your read dbs rather than the write db.
Beyond this the key items to work on are ensuring you have efficient indexes and applying other best practices around maintaining a sensible data structure. You might also want to consider having 3 distinct "groups" of database servers. I generally like to have write, read and "stats" db groups. The write group for create, update and delete operations (as well as select for update), the read for general read items that must return their results quickly, and stats for anything that is going to be under high load and that you do not rely on for a prompt response (this keeps heavy queries that are not time sensitive away from your read db that you need quick responses from for general reads)
Once you get into a situation where you can no longer buy larger hardware and you're near maxing out your write capacity, you'll need to look into sharding, however that will take a lot of traffic / data (so dont worry about it unless you've done all of the above already).

Memcached or MySQL for session storage - PHP

Is it good idea to use Memcached for session storage with PHP? We will have a lot of servers and we must access the session data from everywhere so we are forced to use database (in our case that will be MySQL) as session storage or Memcached. What do you think?
I know people who've used Memcached for this -- it's very fast, certainly a lot faster than a database, and is built to handle a lot more concurrency.
The primary disadvantage to purely in-memory storage is that all your session data will be wiped if/when you restart the daemon. In my experience, memcached is rock-solid and I've never had to restart it because of a failure, but it is a consideration if your sysadmins aren't used to working that way, or if your systems are updated frequently. It also depends on whether losing all your user sessions once a month or year is acceptable or not (i.e. in ecommerce, management probably won't like this).
The obvious solution, if that's the case, is to go to one of the many disk-based NoSQL/hash table databases, such as MemcacheDB, which is based off of Memcached. Or see: CouchDB, MongoDB etc. Each of these daemons (including Memcached) is also a lot less complex when it comes to performance tuning than MySQL (where all sorts of things like key and sort buffers, query cache etc. have to be tuned per install/use case) -- I mean, with Memcached there's not much to do other than to allocate memory and start it up.
Personally, I am a fan of using faster, more appropriate (non-SQL) storage for temporary things like session keys, but if your database is not under load and you don't anticipate it to be, the only thing you lose by storing sessions in the database is that it's a little slower, so users see a little more latency.
Whichever way you go, I suggest that you write your session-management code in such a way that the storage engine is just a layer, and you can swap in a different storage engine relatively painlessly. You don't want to be recoding your application if you find memcached or whatever you choose isn't working well, and you want to try something else. For instance, I once wrote a caching system for a clustered CMS application that used memcached to cache various pages and objects, but when the daemon wasn't reachable, it would fail over to alternate backends that would cache to shared memory or disk on the individual webservers. (In your case, you don't necessarily need the auto-failover, just the ability to change your mind about the backend.)
I mentioned MemcacheDB because it uses the Memcache protocol, so it's extremely easy to swap in Memcached for MemcacheDB or vice versa.

SQL Azure performance considerations

Which are the performance considerations I should keep in mind when I'm planning an SQL Azure application? Azure Storage, and the worker and the web roles looks very scalable, but if at the end they are using one database... it looks like the bottleneck.
I was trying to find numbers about:
How many concurrent connections does
SQL Azure support?
Which is the bandwidth?
But no luck.
For example, I'm planning and application that uses a very high level of inserts, but I need return the result of an aggregate function each time (e.g.: the sum of all records with same key in a column), so I can not go with table storage.
Batching is an option, but time response is critical as well, so I'm afraid the database will be bloated with lot of connections.
Sharding is another option, but even when the amount of inserts is massive, the amount of data is very small, 4 to 6 columns with one PK and no FK. So even a 1Gb DB would be an overkill (and an overpay :D) for a partition.
Which would be the performance keys I should keep in mind when I'm facing these kind of applications?
Cheers.
Achieving both scalability and performance can be very difficult, even in the cloud. Your question was primarily about scalability, so you may want to design your application in such a way that your data becomes "eventually" consistent, using queues for example. A worker role would listen for incoming insert requests and would perform the insert asynchronously.
To minimize the number of roundtrips to the database and optimize connection pooling make sure to batch your inserts as well. So you could send 100 inserts in one shot. Also keep in mind that SQL Azure now supports MARS (multiple active recordsets) so that you can return multiple SELECTs in a single batch back to the calling code. The use of batching and MARS should reduce the number of database connections to a minimum.
Sharding usually helps for Read operations; not so much for inserts (although I never benchmarked inserts with sharding). So I am not sure sharding will help you that much for your requirements.
Remember that the Azure offering is designed first for scalability and reasonable performance in a multitenancy environment, where your database is shared with others on the same server. So if you need strong performance with guaranteed response time you may need to reevaluate your hosting choices or indeed test the performance boundaries of Azure for your needs as suggested by tijmenvdk.
SQL Azure will throttle your connections if any form of resource contention occurs (this includes heavy load but might also occur when your database is physically moved around). Throttling is non-deterministic, meaning that you cannot predict if and when this happens. When throttling, SQL Azure will drop your connection, requiring you to perform a retry. Number of connections supported and bandwidth is not published "by design" due to the flexible nature of the underlying infrastructure. Having said that, the setup is optimized for high availability, not high throughput.
If the bursts happen at a known time, you might consider sharding just during those bursts and consolidating the data after the burst has happened. Another way to handle this, is to start queueing/batching writes if and only if throttling occurs. You can use an Azure Queue for that plus a worker role to empty the queue later. This "overflow mechanism" has the advantage of automatically engaging if throttling occurs.
As an alternative you could use Azure Table Storage and keep a separate table of running totals that you can report back instead of performing an aggregation over the data to return the required sum of all records (this might be tricky due to the lack of locking on the tables though).
Apologies for stating the obvious, but the first step would be to test if you run into throttling at all in your scenario. I would give the overflow solution a try.

sql caching disadvantage?

I have a web server with a lot of web sites with many database operations, and i am tryng sql caching as a way to improve the performance of my server.
In general, is there any disadvantage about sql caching in a common environment?
Thanks
Well, caching consumes RAM memory, so you'll need plenty of that.
I'm not sure about what caching mechanism SQL server employs, but it might be possible that your queries return stale data for some time.
Your best options of performance improvement is to load as much data into RAM as possible instead of caching.
The main problem with caching in a normal environment is cache expiration and stale data.
If you invalidate your cache every time data changes, you could end up rarely or never hitting the cache.
If you try to invalidate just the part of the cache that is changed, you have extra processing time to determine what to invalidate.
If you do not invalidate the cache or have cache timers, you may end up with stale data.
Depending on your environment and your requirements, you need to pick which solution best meets your needs. Sometimes it is ok to have some stale data, and in other applications it is not.
All the above points are valid. Invalidation of stale cache entries would be a key concern as well as syncing local cache across multiple servers. You may want to look into a grid cache (e.g. Hazelcast, mem-cache) and Heimdall Data. Heimdall acts as a transparent cache and provides invalidation logic built in.
In summary, sql caching itself is a good thing to do. It increases performance and can buffer sql traffic away from the database allowing scaling benefits.