Does Caching always enhance performance? - mysql

I have a number of sites with PHP and MySQL, especially running MediaWiki, and I need to enhance the performance. However, I have only a limited percentage of CPU that I'm allowed to use.
The best thing I can think about to improve performance is to enable caching. However, I'm confused: Does that really enhance performance overall or just enhance speed?
What I can think about is, if caching will use files, then it would take more processing to get the content of these files. If it will use SQL tables, then it will take more processing to query these tables as well, perhaps the time will be shorter, but the CPU usage will be more.
Is that correct or not? does caching consume more CPU to give a speeder results or it improves performance overall?

At the most basic level caching should be used to store the result of CPU intensive processes. For example, if you have a server side image handler that creates an image on-the-fly (say a thumbnail and larger preview) then you don't want this operation to occur on every request - you'd want to run this process once and store the results; Then, every other request gets the saved result.
This is obviously a hugely over-simplified description of basic caching, and the use of an image is fine in this case as you don't have to worry about stale data i.e. how often will the actual image change? In your case, databases are hugely different. If you cache data then how can you guarantee that there won't be an instant mismatch between your real data and your cached data? Querying a database is not always a CPU intensive task also (granted you have to consider how the database is designed in terms of indexing, table size etc) but in most cases querying a well designed database is far more intensive on disk I/O than it is on CPU cycles.
First, you need to look at your database design and secondly your queries. For example are you normalizing your database correctly, are your queries trawling through huge amounts of data when you could just archive, are you joining tables on non-indexed fields, are your where clauses querying fields that could be indexed (IN is particulary bad in these cases).
I recommend you get hold of a query analyzer and spend some time optimizing your table structure and queries to find that bottle neck before looking into more drastic changes.

Reference : http://msdn.microsoft.com/en-us/library/ee817646.aspx
Performance : Caching techniques are commonly used to improve application performance by storing relevant data as close as possible to the data consumer, thus avoiding repetitive data creation, processing, and transportation.
For example, storing data that does not change, such as a list of countries, in a cache can improve performance by minimizing data access operations and eliminating the need to recreate the same data for each request.
Scalability : The same data, business functionality, and user interface fragments are often required by many users and processes in an application. If this information is processed for each request, valuable resources are wasted recreating the same output. Instead, you can store the results in a cache and reuse them for each request. This improves the scalability of your application because as the user base increases, the demand for server resources for these tasks remains constant.
For example, in a Web application the Web server is required to render the user interface for each user request. You can cache the rendered page in the ASP.NET output cache to be used for future requests, freeing resources to be used for other purposes.
Caching data can also help scale the resources of your database server. By storing frequently used data in a cache, fewer database requests are made, meaning that more users can be served.
Availability : Occasionally the services that provide information to your application may be unavailable. By storing that data in another place, your application may be able to survive system failures such as network latency, Web service problems, or hardware failures.
For example, each time a user requests information from your data store, you can return the information and also cache the results, updating the cache on each request. If the data store then becomes unavailable, you can still service requests using the cached data until the data store comes back online.

You need to profile your seem and find out where the bottle necking is happening. Cacheing is the best type of page load, its one that doesn't hit the server at all. You can build a very simple caching system that only reloads the information ever 15 minutes. So, if the page was cached in the last 15 minutes it gives them a pre-rendered page. The page loaded once, it creates a temp file. every 15 minutes you create a new on (if someone loads that page).
Caching only stores a file that the server has already done the work for. The work to create the file is already done and your simply storing it.

You use the terms 'performance' and 'speed'. I'll assume 'performance' relates to CPU cycles on your web server and that 'speed' relates to the time it takes to serve the page to the user. You want to maximize web server 'performance' ( by lowering the total number of CPU cycles needed to serve pages ) whilst maximizing 'speed' ( lowering the time it takes to serve a web page ).
The good news for you is that Caching can improve both of these metrics at the same time. By caching content you create an output page that is stored in the cache and can be served repeatedly to users directly without having to re-execute PHP code that originally created this output page ( thus lowering CPU cycles ). Fetching a cached page from cache consumes less CPU cycles than re-executing PHP code.
Caching is particularly good for web pages that are generally the same for all users who request the page - for example in a wiki, and for pages that generally do not change all too often - again, a wiki.

"Enhance performance" sounds like some of the email I get...
There are two, interrelated things that happen here. One is "how long does it take to serve a given request?", and the other is "how many requests can I serve concurrently given my limited resources?". People tend to use either or both of those concepts when talking about performance.
Caching can help with both those things.
The most effective caching strategy uses resources outside your machines to cache your stuff - the most obvious examples are the user's browser, or a CDN. I'll assume you can't use a CDN, but by spending a bit of effort on setting the HTTP cache headers, you can reduce the number of requests to your server for static or sluggish resources quite dramatically.
For dynamic content - usually the web page you generate by querying your database - the next most effective caching strategy is to cache the HTML generated by (parts of) your page. For instance, if you have a "most popular items" box on your homepage, this will usually run a couple of moderately complex database queries, and then some "turn data to HTML" back-end code. If you can cache the HTML, you save both the database queries and the CPU effort of turning the data into HTML.
If that's not possible, you may be able to cache the result of some database queries. That helps in reducing the database load, and usually also reduces the load on your web server - the code required to run the database query and deal with the results is usually more onerous that retrieving the item from cache; because it's faster, it allows your request to be handled quicker, which frees up resources more quickly. This reduces the load on your servers for an individual request, and thus allows you to serve more concurrent requests.

Related

Should I implement my own caching or rely on read-replicas?

We have an enterprise application that uses an SQL database. The database access characteristics are about 90% reads. The data that does get updated or created needs to be up-to-date immediately. The cache needs to be correctly invalidated with high certainty. The entities are referred to by their primary key for 98% of the cases.
The application is based on Node.js and is AWS-native. Since the application is AWS-native, I'd like to rely on managed services from AWS rather than hosting my own. One option is to implement our read-through Redis-based cache. Upon retrieving the entities, we'd check the cache and if the data is not cached we'd put it into the cache before turning it to the user. The parts of the code that update those entities will invalidate the cache by primary key.
Generally speaking, in computer science cache coherency is one of the most challenging problems to get right. I am of the opinion that rather than implementing a Redis cache and thinking through all of the possible scenarios for correctly invalidating it, it is wiser to instead configure an Aurora read-replica specifically for reading frequently accessed entities. The RDBMS will do a much better job at caching than anything we can build ourselves.
So, I am facing two options -- go through the effort of implementing my own caching, or use read replicas. My personal opinion is to use a read replica.
Any advice is greatly appreciated, as always.
Yes, you're right, cache invalidation is a tough problem. The simplest solution is to add code to your data writes, to replace the cached values. So they're always current. But this is easy only if the cached values have a pretty much 1-to-1 correlation with rows in your database.
An advantage of your own cache is that you can cache data that is not 1-to-1 with rows of data in the database. You might cache an entire HTML fragment for a drop-down menu for example. That could be the result of several SQL queries. It could be quite an advantage to cache data that is higher up the "food chain" so to speak. But cache invalidation becomes less straightforward. Best for storing results of queries that don't change often.
Using a read-replica is not a substitute for using a cache. Querying a read-replica still has overhead of making a database connection, authentication, SQL query parsing and optimization, locking, and all the other overhead that goes into RDBMS workings.
Querying data from a cache can be orders of magnitude faster.
Both have their place. It's best to use both a cache and a read-replica for different tasks. I would also add message queues as an important technology. I believe database, cache, and queue form a three-legged stool.
But you must have experience and judgment to know when each is the best tool for a given case.

How do modern web applications implement caching and data persistence with large amounts of rapidly changing data?

For example, consider something like Facebook or Twitter. All the user tweets / posts are retained indefinitely (so they must ultimately be stored within a static database). At the same time, they can rapidly change (e.g. with replies, likes, etc), so some sort of caching layer is necessary (e.g. you obviously can't be writing directly to the database every time a user "likes" a post).
In a case like this, how are the database / caching layers designed and implemented? How are they tied together?
For example, is it typical to begin by implementing the database in its entirety, and then add the caching layer afterword?
What about the other way around? In other words, begin by implementing the majority of functionality into the cache layer, and then write another layer which periodically flushes the cache to the database (at some point when its activity has gone down)? In this scenario, for current / rapidly changing data, the entire application would essentially be stored in cache.
Or perhaps implement some sort of cache-ranking algorithm based on access / update frequency?
How then should it be handled when a user accesses less frequent data (which isn't currently in cache)? Simply bypass cache completely / query the database directly, or should all data be cached before it's sent to users?
In cases like this, does it make sense to design the database schema with the caching layer in mind, or should it be designed independently?
I'm not necessarily asking for direct answers to all these questions, but they're just to give an idea of where I'm coming from.
I've found quite a bit of information / books on implementing the database, and implementing the caching layer independent of one another, but not a whole lot of information on using them in conjunction / tying them together.
Any information, suggestions, general patters, articles, books, would be much appreciated. It's just difficult to find some direction here.
Thanks
Probably not the best solution, but I worked on a personal project using Openresty where I used their shared memory zones to cache, to avoid the overhead of connecting to something like Redis, then used Redis as the backend DB.
When a user loads a resource, it checks the shared dict, if it misses then it loads it from Redis and writes it to the cache on the way back.
If a resource is created or updated, it's written to the cache, and also queued to a shared dict queue.
A background worker ticks away waiting for new items in the queue, writing them to Redis and then sending an event to other servers to either invalidate the resource in their cache if they have it, or even pre-cache it if needed.

handling stale caches with multiple servers

We currently have a website hosted on one server, and we are looking into adding a new server. The main issue is about caching. Some items are cached based on when they are changed. However right now, they are changed in the same process, hence the cache can be invalidated.
If the website is hosted on two servers, the changes can be done on both servers and they will not be notified of such changes. The cache needs to remain as it drastically speeds up the website. I would prefer if the cache is not taken out-of-process in a cache-server, as it slows down to the speed of network rather than memory, and adds complexity to the servers.
The website is implemented in .Net, with MySQL as it's backing datastore. My issue is how the process can be notified when data changes. Is it possible that MySQL will automatically notify all registered clients when any data changes? I've used RavenDb, which has a similar feature which comes in very handy. I couldn't find anything similar for MySQL. If this is not possible, any ideas how one would approach this issue?
Distributed caching is a complex topic. It sounds like you are running a more basic in-memory cache. If this is the case, you will need to handle synchronisation yourself, or be happy with "eventual consistency" of the data, assuming you have some stale key checking mechanism.
Personally I would look into using memcached (we use Couchbase). Your opinion on this becoming a network bottleneck may be unrealised, although yes in real terms memory access is faster. In practical terms, we noticed that Couchbase caching was more than fast enough, and it is atomic at the key level. It will handle key distribution over nodes.
As for MySQL pushing notifications to clients, I am not sure but I don't think so. You could emulate this yourself if you have a layer of code (DAL etc) over database access.
It is also difficult to reconcile the desire to have the cache follow the same integrity principles as the database. If you achieve this then all you have done is made an in-memory database. Caching is supposed to be a trade-off of data accuracy over time to increase scalability.

memcached use cases

What are some usecases that will benefit from using memcached with a mysql DB. I would guess it would be good for data that does not change much over time.
More specifically if my data changes often then its not worth using memcached right?
Even more specifically I am trying to use the DB as a data structure for a multi player game. So the records are going to change with every move the players make. And all players views should be updated with the latest moves. So my app is getting read and write intensive. Trying to see what I can do about it. If I use memcached, for every write we read 3 times max since 4 players max can play the game at a time.
Thanks.
Pav
Usecase: webshop with a lot of products. These products are assigned to various pages, and per product a user gets to see certain specs. The specs are called with a "getSpec" function. This is expensive and a query per time.
If we put these in memcached, its much quicker. Everytime someone changes something about the product, you jsut update the memcached.
so if your data changes it still can be worth it! Not everything might change at once.
edit: In your case, you could make your write also update memcached: no stale cache. But that's just a random thought, I don't know if making your write heavier like that has any disadvantaged. This would essentially mean you're running everything from memcached, and are just using your DB as a sort of backup :)
Caching is a tradeoff between speed and (potentially) stale data. You have to determine if the speed gain is appropriate given your own use cases.
We cache everything that doesn't require real-time data. Some things that are typically cached: Reports, user content, entire pages (though you may consider caching these to disk via some other system), etc..
Our API allows clients to query for huge amounts of data. We use memcached to store that for quick paging on the clients end.
If you plan ahead, you can setup your application to cache most everything and just invalidate parts of the cache as needed (for instance, when some data in your db is updated).
It's going to depend on how often "often" is and how busy your app is. For example, if you have a piece of data that changes hourly, but that data is queried 500 times per hour, it would probably make sense to cache it even though it changes relatively frequently.

How To Interpret Siege and or Apache Bench Results

We have a MySQL driven site that will occasionally get 100K users in the space of 48 hours, all logging into the site and making purchases.
We are attempting to simulate this kind of load using tools like Apache Bench and Siege.
While the key metric seems to me number of concurrent users, and we've got our report results, we still feel like we're in the dark.
What I want to ask is: What kinds of things should we be testing to anticipate this kind of traffic?
50 concurrent users 1000 Times? 500 concurrent users 10 times?
We're looking at DB errors, apache timeouts, and response times. What else should we be looking at?
This is a vague question and I know there is no "right" answer, we're just looking for some general thoughts on how to determine what our infrastructure can realistically handle.
Thanks in advance!
Simultaneous users is certainly one of the key factors - especially as that applies to DB connection pools, etc. But you will also want to verify that the page rate (pages/sec) of your tests is also in the range you expect. If the the think-time in your testcases is off by much, you can accidentally simulate a much higher (or lower) page rate than your real-world traffic. Think time is the amount of time the user spends between page requests - reading the page, filling out a form, etc.
Depending on what other information you have on hand, this might help you calculate the number of simultaneous users to simulate:
Virtual User Calculators
The complete page load time seen by the end-user is usually the most important metric to evaluate system performance. You'll also want to look for failure rates on all transactions. You should also be on the lookout for transactions that never complete. Some testing tools do not report these very well, allowing simulated users to hang indefinitely when the server doesn't respond...and not reporting this condition. Look for tools that report the number of users waiting on a given page or transaction and the average amount of time those users are waiting.
As for the server-side metrics to look for, what other technologies is your app built on? You'll want to look at different things for a .NET app vs. a PHP app.
Lastly, we have found it very valuable to look at how the system responds to increasing load, rather than looking at just a single level of load. This article goes into more detail.
Ideally you are going to want to model your usage to the user, but creating simulated concurrent sessions for 100k users is usually not easily accomplished.
The best source would be to check out your logs for the busiest hour and try and figure out a way to model that load level.
The database is usually a critical piece of infrastructure, so I would look at recording the number and length of lock waits as well as the number and duration of db statements.
Another key item to look at is disk queue lengths.
Mostly the process is to look for slow responses either in across the whole site or for specific pages and then hone in on the cause.
The biggest problem for load testing is that is quite hard to test your network and if you have (as most public sites do) a limited bandwidth through your ISP, that may create a performance issue that is not reflected in the load tests.