We currently have a website hosted on one server, and we are looking into adding a new server. The main issue is about caching. Some items are cached based on when they are changed. However right now, they are changed in the same process, hence the cache can be invalidated.
If the website is hosted on two servers, the changes can be done on both servers and they will not be notified of such changes. The cache needs to remain as it drastically speeds up the website. I would prefer if the cache is not taken out-of-process in a cache-server, as it slows down to the speed of network rather than memory, and adds complexity to the servers.
The website is implemented in .Net, with MySQL as it's backing datastore. My issue is how the process can be notified when data changes. Is it possible that MySQL will automatically notify all registered clients when any data changes? I've used RavenDb, which has a similar feature which comes in very handy. I couldn't find anything similar for MySQL. If this is not possible, any ideas how one would approach this issue?
Distributed caching is a complex topic. It sounds like you are running a more basic in-memory cache. If this is the case, you will need to handle synchronisation yourself, or be happy with "eventual consistency" of the data, assuming you have some stale key checking mechanism.
Personally I would look into using memcached (we use Couchbase). Your opinion on this becoming a network bottleneck may be unrealised, although yes in real terms memory access is faster. In practical terms, we noticed that Couchbase caching was more than fast enough, and it is atomic at the key level. It will handle key distribution over nodes.
As for MySQL pushing notifications to clients, I am not sure but I don't think so. You could emulate this yourself if you have a layer of code (DAL etc) over database access.
It is also difficult to reconcile the desire to have the cache follow the same integrity principles as the database. If you achieve this then all you have done is made an in-memory database. Caching is supposed to be a trade-off of data accuracy over time to increase scalability.
Related
We have an enterprise application that uses an SQL database. The database access characteristics are about 90% reads. The data that does get updated or created needs to be up-to-date immediately. The cache needs to be correctly invalidated with high certainty. The entities are referred to by their primary key for 98% of the cases.
The application is based on Node.js and is AWS-native. Since the application is AWS-native, I'd like to rely on managed services from AWS rather than hosting my own. One option is to implement our read-through Redis-based cache. Upon retrieving the entities, we'd check the cache and if the data is not cached we'd put it into the cache before turning it to the user. The parts of the code that update those entities will invalidate the cache by primary key.
Generally speaking, in computer science cache coherency is one of the most challenging problems to get right. I am of the opinion that rather than implementing a Redis cache and thinking through all of the possible scenarios for correctly invalidating it, it is wiser to instead configure an Aurora read-replica specifically for reading frequently accessed entities. The RDBMS will do a much better job at caching than anything we can build ourselves.
So, I am facing two options -- go through the effort of implementing my own caching, or use read replicas. My personal opinion is to use a read replica.
Any advice is greatly appreciated, as always.
Yes, you're right, cache invalidation is a tough problem. The simplest solution is to add code to your data writes, to replace the cached values. So they're always current. But this is easy only if the cached values have a pretty much 1-to-1 correlation with rows in your database.
An advantage of your own cache is that you can cache data that is not 1-to-1 with rows of data in the database. You might cache an entire HTML fragment for a drop-down menu for example. That could be the result of several SQL queries. It could be quite an advantage to cache data that is higher up the "food chain" so to speak. But cache invalidation becomes less straightforward. Best for storing results of queries that don't change often.
Using a read-replica is not a substitute for using a cache. Querying a read-replica still has overhead of making a database connection, authentication, SQL query parsing and optimization, locking, and all the other overhead that goes into RDBMS workings.
Querying data from a cache can be orders of magnitude faster.
Both have their place. It's best to use both a cache and a read-replica for different tasks. I would also add message queues as an important technology. I believe database, cache, and queue form a three-legged stool.
But you must have experience and judgment to know when each is the best tool for a given case.
My partner and I are trying to start a website hosted in cloud. It has pretty heavy ajax traffic and the backend handles money transactions so we need ACID in some of the DB tables.
Currently everything is running off a single server. Some of the AJAX traffic are cached in text files.
Question:
What's the best way to scale the database server? I thought about moving mysql to separate instances and do master-master duplication. However this seems tough and I heard I might lose ACID properties even with InnoDB? Is Amazon RDS a good solution?
The web server is relatively stateless except for some custom log files and the ajax cache files. What's a good way to scale to multiple web servers? I guess the custom log files can be moved to a reliable shared file system or DB but not sure what to do about the AJAX cache file coherency across multiple servers. (I dont care about losing /var/log/* if web server dies)
For performance it might be cheaper to go with larger instance with more cores and memory but eventually I would need redundancy so wondering what's the best way to do this cheaply.
thanks
take a look at this post. there is plenty of presentations on the net discussing scalability. few things i suggest to keep in mind:
plan early for the data sharding [even if you are not going to do it immediately]
try using mechanisms like memcached to limit number of queries sent to the database
prepare to serve static content from other domain, in the longer run - from ngin-x-alike server and later CDN
redundancy - depends on your needs. is 'read-only' mode acceptable for your site? if so - go with mysql replication + rsync of static files and in case of failover have your site work in that mode till you recover the master node. if you need high availability - then take a look either at drbd replication [at least for mysql] or setup with automated promotion of slave server to become master node.
you might find following interesting:
http://yoshinorimatsunobu.blogspot.com/2011/08/mysql-mha-support-for-multi-master.html
http://mysqlperformanceblog.com
http://highscalability.com
http://google.com - search for scalability, lamp, failover... there are tones of case studies and horror stories from the trench lines :-]
Another option is using a scaleable platform such as Amazon Web Services. You can start out with a micro instance and configure load balancing to fire up more instances as needed.
Once you determine average resource requirements you can then resize your image to larger or smaller depending on your needs.
http://aws.amazon.com
http://tuts.pinehead.tv/2011/06/26/creating-an-amazon-ec2-instance-with-linux-lamp-stack/
http://tuts.pinehead.tv/2011/09/11/how-to-use-amazon-rds-relation-database-service-to-host-mysql/
Amazon allows you to either load balance or change instance size based off demand.
Is it good idea to use Memcached for session storage with PHP? We will have a lot of servers and we must access the session data from everywhere so we are forced to use database (in our case that will be MySQL) as session storage or Memcached. What do you think?
I know people who've used Memcached for this -- it's very fast, certainly a lot faster than a database, and is built to handle a lot more concurrency.
The primary disadvantage to purely in-memory storage is that all your session data will be wiped if/when you restart the daemon. In my experience, memcached is rock-solid and I've never had to restart it because of a failure, but it is a consideration if your sysadmins aren't used to working that way, or if your systems are updated frequently. It also depends on whether losing all your user sessions once a month or year is acceptable or not (i.e. in ecommerce, management probably won't like this).
The obvious solution, if that's the case, is to go to one of the many disk-based NoSQL/hash table databases, such as MemcacheDB, which is based off of Memcached. Or see: CouchDB, MongoDB etc. Each of these daemons (including Memcached) is also a lot less complex when it comes to performance tuning than MySQL (where all sorts of things like key and sort buffers, query cache etc. have to be tuned per install/use case) -- I mean, with Memcached there's not much to do other than to allocate memory and start it up.
Personally, I am a fan of using faster, more appropriate (non-SQL) storage for temporary things like session keys, but if your database is not under load and you don't anticipate it to be, the only thing you lose by storing sessions in the database is that it's a little slower, so users see a little more latency.
Whichever way you go, I suggest that you write your session-management code in such a way that the storage engine is just a layer, and you can swap in a different storage engine relatively painlessly. You don't want to be recoding your application if you find memcached or whatever you choose isn't working well, and you want to try something else. For instance, I once wrote a caching system for a clustered CMS application that used memcached to cache various pages and objects, but when the daemon wasn't reachable, it would fail over to alternate backends that would cache to shared memory or disk on the individual webservers. (In your case, you don't necessarily need the auto-failover, just the ability to change your mind about the backend.)
I mentioned MemcacheDB because it uses the Memcache protocol, so it's extremely easy to swap in Memcached for MemcacheDB or vice versa.
Which are the performance considerations I should keep in mind when I'm planning an SQL Azure application? Azure Storage, and the worker and the web roles looks very scalable, but if at the end they are using one database... it looks like the bottleneck.
I was trying to find numbers about:
How many concurrent connections does
SQL Azure support?
Which is the bandwidth?
But no luck.
For example, I'm planning and application that uses a very high level of inserts, but I need return the result of an aggregate function each time (e.g.: the sum of all records with same key in a column), so I can not go with table storage.
Batching is an option, but time response is critical as well, so I'm afraid the database will be bloated with lot of connections.
Sharding is another option, but even when the amount of inserts is massive, the amount of data is very small, 4 to 6 columns with one PK and no FK. So even a 1Gb DB would be an overkill (and an overpay :D) for a partition.
Which would be the performance keys I should keep in mind when I'm facing these kind of applications?
Cheers.
Achieving both scalability and performance can be very difficult, even in the cloud. Your question was primarily about scalability, so you may want to design your application in such a way that your data becomes "eventually" consistent, using queues for example. A worker role would listen for incoming insert requests and would perform the insert asynchronously.
To minimize the number of roundtrips to the database and optimize connection pooling make sure to batch your inserts as well. So you could send 100 inserts in one shot. Also keep in mind that SQL Azure now supports MARS (multiple active recordsets) so that you can return multiple SELECTs in a single batch back to the calling code. The use of batching and MARS should reduce the number of database connections to a minimum.
Sharding usually helps for Read operations; not so much for inserts (although I never benchmarked inserts with sharding). So I am not sure sharding will help you that much for your requirements.
Remember that the Azure offering is designed first for scalability and reasonable performance in a multitenancy environment, where your database is shared with others on the same server. So if you need strong performance with guaranteed response time you may need to reevaluate your hosting choices or indeed test the performance boundaries of Azure for your needs as suggested by tijmenvdk.
SQL Azure will throttle your connections if any form of resource contention occurs (this includes heavy load but might also occur when your database is physically moved around). Throttling is non-deterministic, meaning that you cannot predict if and when this happens. When throttling, SQL Azure will drop your connection, requiring you to perform a retry. Number of connections supported and bandwidth is not published "by design" due to the flexible nature of the underlying infrastructure. Having said that, the setup is optimized for high availability, not high throughput.
If the bursts happen at a known time, you might consider sharding just during those bursts and consolidating the data after the burst has happened. Another way to handle this, is to start queueing/batching writes if and only if throttling occurs. You can use an Azure Queue for that plus a worker role to empty the queue later. This "overflow mechanism" has the advantage of automatically engaging if throttling occurs.
As an alternative you could use Azure Table Storage and keep a separate table of running totals that you can report back instead of performing an aggregation over the data to return the required sum of all records (this might be tricky due to the lack of locking on the tables though).
Apologies for stating the obvious, but the first step would be to test if you run into throttling at all in your scenario. I would give the overflow solution a try.
I have a web server with a lot of web sites with many database operations, and i am tryng sql caching as a way to improve the performance of my server.
In general, is there any disadvantage about sql caching in a common environment?
Thanks
Well, caching consumes RAM memory, so you'll need plenty of that.
I'm not sure about what caching mechanism SQL server employs, but it might be possible that your queries return stale data for some time.
Your best options of performance improvement is to load as much data into RAM as possible instead of caching.
The main problem with caching in a normal environment is cache expiration and stale data.
If you invalidate your cache every time data changes, you could end up rarely or never hitting the cache.
If you try to invalidate just the part of the cache that is changed, you have extra processing time to determine what to invalidate.
If you do not invalidate the cache or have cache timers, you may end up with stale data.
Depending on your environment and your requirements, you need to pick which solution best meets your needs. Sometimes it is ok to have some stale data, and in other applications it is not.
All the above points are valid. Invalidation of stale cache entries would be a key concern as well as syncing local cache across multiple servers. You may want to look into a grid cache (e.g. Hazelcast, mem-cache) and Heimdall Data. Heimdall acts as a transparent cache and provides invalidation logic built in.
In summary, sql caching itself is a good thing to do. It increases performance and can buffer sql traffic away from the database allowing scaling benefits.