I have a web server with a lot of web sites with many database operations, and i am tryng sql caching as a way to improve the performance of my server.
In general, is there any disadvantage about sql caching in a common environment?
Thanks
Well, caching consumes RAM memory, so you'll need plenty of that.
I'm not sure about what caching mechanism SQL server employs, but it might be possible that your queries return stale data for some time.
Your best options of performance improvement is to load as much data into RAM as possible instead of caching.
The main problem with caching in a normal environment is cache expiration and stale data.
If you invalidate your cache every time data changes, you could end up rarely or never hitting the cache.
If you try to invalidate just the part of the cache that is changed, you have extra processing time to determine what to invalidate.
If you do not invalidate the cache or have cache timers, you may end up with stale data.
Depending on your environment and your requirements, you need to pick which solution best meets your needs. Sometimes it is ok to have some stale data, and in other applications it is not.
All the above points are valid. Invalidation of stale cache entries would be a key concern as well as syncing local cache across multiple servers. You may want to look into a grid cache (e.g. Hazelcast, mem-cache) and Heimdall Data. Heimdall acts as a transparent cache and provides invalidation logic built in.
In summary, sql caching itself is a good thing to do. It increases performance and can buffer sql traffic away from the database allowing scaling benefits.
Related
We have an enterprise application that uses an SQL database. The database access characteristics are about 90% reads. The data that does get updated or created needs to be up-to-date immediately. The cache needs to be correctly invalidated with high certainty. The entities are referred to by their primary key for 98% of the cases.
The application is based on Node.js and is AWS-native. Since the application is AWS-native, I'd like to rely on managed services from AWS rather than hosting my own. One option is to implement our read-through Redis-based cache. Upon retrieving the entities, we'd check the cache and if the data is not cached we'd put it into the cache before turning it to the user. The parts of the code that update those entities will invalidate the cache by primary key.
Generally speaking, in computer science cache coherency is one of the most challenging problems to get right. I am of the opinion that rather than implementing a Redis cache and thinking through all of the possible scenarios for correctly invalidating it, it is wiser to instead configure an Aurora read-replica specifically for reading frequently accessed entities. The RDBMS will do a much better job at caching than anything we can build ourselves.
So, I am facing two options -- go through the effort of implementing my own caching, or use read replicas. My personal opinion is to use a read replica.
Any advice is greatly appreciated, as always.
Yes, you're right, cache invalidation is a tough problem. The simplest solution is to add code to your data writes, to replace the cached values. So they're always current. But this is easy only if the cached values have a pretty much 1-to-1 correlation with rows in your database.
An advantage of your own cache is that you can cache data that is not 1-to-1 with rows of data in the database. You might cache an entire HTML fragment for a drop-down menu for example. That could be the result of several SQL queries. It could be quite an advantage to cache data that is higher up the "food chain" so to speak. But cache invalidation becomes less straightforward. Best for storing results of queries that don't change often.
Using a read-replica is not a substitute for using a cache. Querying a read-replica still has overhead of making a database connection, authentication, SQL query parsing and optimization, locking, and all the other overhead that goes into RDBMS workings.
Querying data from a cache can be orders of magnitude faster.
Both have their place. It's best to use both a cache and a read-replica for different tasks. I would also add message queues as an important technology. I believe database, cache, and queue form a three-legged stool.
But you must have experience and judgment to know when each is the best tool for a given case.
We currently have a website hosted on one server, and we are looking into adding a new server. The main issue is about caching. Some items are cached based on when they are changed. However right now, they are changed in the same process, hence the cache can be invalidated.
If the website is hosted on two servers, the changes can be done on both servers and they will not be notified of such changes. The cache needs to remain as it drastically speeds up the website. I would prefer if the cache is not taken out-of-process in a cache-server, as it slows down to the speed of network rather than memory, and adds complexity to the servers.
The website is implemented in .Net, with MySQL as it's backing datastore. My issue is how the process can be notified when data changes. Is it possible that MySQL will automatically notify all registered clients when any data changes? I've used RavenDb, which has a similar feature which comes in very handy. I couldn't find anything similar for MySQL. If this is not possible, any ideas how one would approach this issue?
Distributed caching is a complex topic. It sounds like you are running a more basic in-memory cache. If this is the case, you will need to handle synchronisation yourself, or be happy with "eventual consistency" of the data, assuming you have some stale key checking mechanism.
Personally I would look into using memcached (we use Couchbase). Your opinion on this becoming a network bottleneck may be unrealised, although yes in real terms memory access is faster. In practical terms, we noticed that Couchbase caching was more than fast enough, and it is atomic at the key level. It will handle key distribution over nodes.
As for MySQL pushing notifications to clients, I am not sure but I don't think so. You could emulate this yourself if you have a layer of code (DAL etc) over database access.
It is also difficult to reconcile the desire to have the cache follow the same integrity principles as the database. If you achieve this then all you have done is made an in-memory database. Caching is supposed to be a trade-off of data accuracy over time to increase scalability.
I have a number of sites with PHP and MySQL, especially running MediaWiki, and I need to enhance the performance. However, I have only a limited percentage of CPU that I'm allowed to use.
The best thing I can think about to improve performance is to enable caching. However, I'm confused: Does that really enhance performance overall or just enhance speed?
What I can think about is, if caching will use files, then it would take more processing to get the content of these files. If it will use SQL tables, then it will take more processing to query these tables as well, perhaps the time will be shorter, but the CPU usage will be more.
Is that correct or not? does caching consume more CPU to give a speeder results or it improves performance overall?
At the most basic level caching should be used to store the result of CPU intensive processes. For example, if you have a server side image handler that creates an image on-the-fly (say a thumbnail and larger preview) then you don't want this operation to occur on every request - you'd want to run this process once and store the results; Then, every other request gets the saved result.
This is obviously a hugely over-simplified description of basic caching, and the use of an image is fine in this case as you don't have to worry about stale data i.e. how often will the actual image change? In your case, databases are hugely different. If you cache data then how can you guarantee that there won't be an instant mismatch between your real data and your cached data? Querying a database is not always a CPU intensive task also (granted you have to consider how the database is designed in terms of indexing, table size etc) but in most cases querying a well designed database is far more intensive on disk I/O than it is on CPU cycles.
First, you need to look at your database design and secondly your queries. For example are you normalizing your database correctly, are your queries trawling through huge amounts of data when you could just archive, are you joining tables on non-indexed fields, are your where clauses querying fields that could be indexed (IN is particulary bad in these cases).
I recommend you get hold of a query analyzer and spend some time optimizing your table structure and queries to find that bottle neck before looking into more drastic changes.
Reference : http://msdn.microsoft.com/en-us/library/ee817646.aspx
Performance : Caching techniques are commonly used to improve application performance by storing relevant data as close as possible to the data consumer, thus avoiding repetitive data creation, processing, and transportation.
For example, storing data that does not change, such as a list of countries, in a cache can improve performance by minimizing data access operations and eliminating the need to recreate the same data for each request.
Scalability : The same data, business functionality, and user interface fragments are often required by many users and processes in an application. If this information is processed for each request, valuable resources are wasted recreating the same output. Instead, you can store the results in a cache and reuse them for each request. This improves the scalability of your application because as the user base increases, the demand for server resources for these tasks remains constant.
For example, in a Web application the Web server is required to render the user interface for each user request. You can cache the rendered page in the ASP.NET output cache to be used for future requests, freeing resources to be used for other purposes.
Caching data can also help scale the resources of your database server. By storing frequently used data in a cache, fewer database requests are made, meaning that more users can be served.
Availability : Occasionally the services that provide information to your application may be unavailable. By storing that data in another place, your application may be able to survive system failures such as network latency, Web service problems, or hardware failures.
For example, each time a user requests information from your data store, you can return the information and also cache the results, updating the cache on each request. If the data store then becomes unavailable, you can still service requests using the cached data until the data store comes back online.
You need to profile your seem and find out where the bottle necking is happening. Cacheing is the best type of page load, its one that doesn't hit the server at all. You can build a very simple caching system that only reloads the information ever 15 minutes. So, if the page was cached in the last 15 minutes it gives them a pre-rendered page. The page loaded once, it creates a temp file. every 15 minutes you create a new on (if someone loads that page).
Caching only stores a file that the server has already done the work for. The work to create the file is already done and your simply storing it.
You use the terms 'performance' and 'speed'. I'll assume 'performance' relates to CPU cycles on your web server and that 'speed' relates to the time it takes to serve the page to the user. You want to maximize web server 'performance' ( by lowering the total number of CPU cycles needed to serve pages ) whilst maximizing 'speed' ( lowering the time it takes to serve a web page ).
The good news for you is that Caching can improve both of these metrics at the same time. By caching content you create an output page that is stored in the cache and can be served repeatedly to users directly without having to re-execute PHP code that originally created this output page ( thus lowering CPU cycles ). Fetching a cached page from cache consumes less CPU cycles than re-executing PHP code.
Caching is particularly good for web pages that are generally the same for all users who request the page - for example in a wiki, and for pages that generally do not change all too often - again, a wiki.
"Enhance performance" sounds like some of the email I get...
There are two, interrelated things that happen here. One is "how long does it take to serve a given request?", and the other is "how many requests can I serve concurrently given my limited resources?". People tend to use either or both of those concepts when talking about performance.
Caching can help with both those things.
The most effective caching strategy uses resources outside your machines to cache your stuff - the most obvious examples are the user's browser, or a CDN. I'll assume you can't use a CDN, but by spending a bit of effort on setting the HTTP cache headers, you can reduce the number of requests to your server for static or sluggish resources quite dramatically.
For dynamic content - usually the web page you generate by querying your database - the next most effective caching strategy is to cache the HTML generated by (parts of) your page. For instance, if you have a "most popular items" box on your homepage, this will usually run a couple of moderately complex database queries, and then some "turn data to HTML" back-end code. If you can cache the HTML, you save both the database queries and the CPU effort of turning the data into HTML.
If that's not possible, you may be able to cache the result of some database queries. That helps in reducing the database load, and usually also reduces the load on your web server - the code required to run the database query and deal with the results is usually more onerous that retrieving the item from cache; because it's faster, it allows your request to be handled quicker, which frees up resources more quickly. This reduces the load on your servers for an individual request, and thus allows you to serve more concurrent requests.
I think they are both great and I wanna know is it possible to use both caching systems together at the same time?
Is there any experience?
Mysqls working set will always be cached.
This is called double caching and represents a major bottleneck in larger systems (Facebook for example), where user data is double cached, existing twice in RAM.
Is it good idea to use Memcached for session storage with PHP? We will have a lot of servers and we must access the session data from everywhere so we are forced to use database (in our case that will be MySQL) as session storage or Memcached. What do you think?
I know people who've used Memcached for this -- it's very fast, certainly a lot faster than a database, and is built to handle a lot more concurrency.
The primary disadvantage to purely in-memory storage is that all your session data will be wiped if/when you restart the daemon. In my experience, memcached is rock-solid and I've never had to restart it because of a failure, but it is a consideration if your sysadmins aren't used to working that way, or if your systems are updated frequently. It also depends on whether losing all your user sessions once a month or year is acceptable or not (i.e. in ecommerce, management probably won't like this).
The obvious solution, if that's the case, is to go to one of the many disk-based NoSQL/hash table databases, such as MemcacheDB, which is based off of Memcached. Or see: CouchDB, MongoDB etc. Each of these daemons (including Memcached) is also a lot less complex when it comes to performance tuning than MySQL (where all sorts of things like key and sort buffers, query cache etc. have to be tuned per install/use case) -- I mean, with Memcached there's not much to do other than to allocate memory and start it up.
Personally, I am a fan of using faster, more appropriate (non-SQL) storage for temporary things like session keys, but if your database is not under load and you don't anticipate it to be, the only thing you lose by storing sessions in the database is that it's a little slower, so users see a little more latency.
Whichever way you go, I suggest that you write your session-management code in such a way that the storage engine is just a layer, and you can swap in a different storage engine relatively painlessly. You don't want to be recoding your application if you find memcached or whatever you choose isn't working well, and you want to try something else. For instance, I once wrote a caching system for a clustered CMS application that used memcached to cache various pages and objects, but when the daemon wasn't reachable, it would fail over to alternate backends that would cache to shared memory or disk on the individual webservers. (In your case, you don't necessarily need the auto-failover, just the ability to change your mind about the backend.)
I mentioned MemcacheDB because it uses the Memcache protocol, so it's extremely easy to swap in Memcached for MemcacheDB or vice versa.