I would like to convert my stats tracking system not to write to the database directly, as we're hitting bottlenecks.
We're currently using memcached for certain aspects of the site, and I wanted to use it for storing stats and committing them to mysql DB periodically.
The issue lies however in the number of items (which is in the millions) for which potentially there could be stats collected between the cronjob runs that would commit them into the database. Other than running a SELECT * FROM data and checking for existence of every single memcache key, and then updating the table.... is there any other way to do this?
(I'm not saying below is gospel, this is just my gut feeling. As said later on, I don't have the specifics of your system :) And obviously no offence meant etc :) )
I would advice against using memcached for this. Memcached is build te quickly retrieve values that you've gotten before, not to store values. The big difference is that is your cache is getting full, you'll loose your data.
Normally, you'd just have no data in your cache, and recollect the data from the source, which is impossible in this case. That alone would be a reason for me to try an dissuade you from this.
Now you say the major problem is the mysql connection limit you are hitting. If you do simple stuff (like what we talked about in the comments: the insert delayed), it's just a case of increasing the limit. You should probably have enough power to have your scripts/users go to the database once and say "this should eventually be added", and then go away. If your users can't even open 1 connection for that, there's a serious resource problem you probably won't fix by adding extra layers of cache?
Obviously hard to say without any specs of the system, soft and hardware, but my suggestion would be to see if you can just let them open their connections by increasing the limit, and fiddle with the server variables a bit, instead of monkey-patching your system by using a memcached as an in-between layer.
I had a similar issue with statistic data. But please don't use memcached for it. You can't be sure that ALL your items will moved to DB. You can loose data and/or double process data.
You should analyse your bottleneck against how much data you are writing/reading and how many connections you need. And than switch to something scalable like Hadoop, Cassandra, Scripe and other systems.
You need to provide additional information on the platform that you are running: O/S, database (version), storage engine, RAM, CPU (if possible)?
Are you inserting into a single table or more than one table?
Can you disable the indexes on the tables you are inserting into as this slows down the insert functions.
Are you running any triggers or stored procedures to compute values as you insert the raw data?
Related
I have the app a MySQL DB is a slave for other remote Master DB. And i use memcache to do caching of some DB data.
My slave DB can be updated if there are updates in a Master DB. So in my application i want to know when my local (slave) DB is updated to invalidate related cached data and display fresh data i got from master.
Is there any way to run some program when slave mysql DB is updated ? i would then filter q query and understand if i need to clean a cache or not.
Thanks
First of all you are looking for solution similar to what Facebook did in their db architecture (As I remember they patched MySQL for this).
You can build your own solution based on one of these techniques:
Parse replication log on slave side, remove cache entry when you see update of data in the log
Load UDF (user defined function) for memcached, attach trigger on replica side (it will call UDF remove function) to interested tables inside MySQL.
Please note that this configuration is complicated during the support and maintenance. If you can sacrifice stale data in the cache maybe small ttl will help you.
As Kirugan says, it's as simple as writing your own SQL parser, and ensuring that you also provide an indexed lookup keyed to the underlying data for anything you insert into the cache, then cross reference the datasets for any DML you apply to the database. Of course, this will be a lot simpler if you create a simplified, abstract syntax to represent the DML, but thereby losing the flexibilty of SQL and of course, having to re-implement any legacy code using your new syntax. Apart from fixing the existing code, it should only take a year or two to get this working right. Basing your syntax on MySQL's handler API rather than SQL will probably save a lot of pain later in the project.
Of course, if you need full cache consistency then you need to ensure that a logical transaction now spans all the relevant datacentres which will have something of an adverse impact on your performance (certainly much slower than just referencing the master directly).
For a company like facebook, with hundreds of thousands of servers and terrabytes of data (and no requirement for cache consistency) such an approach to solving the problem leads to massive savings. If you only have 2 servers, a better solution would be to switch to multi-master replication, possibly add another database node, optimize the storage (e.g. switching to ssds / adding fast bcache) make sure you have session affinity to the dbms from the aplication (but not stcky sessions) and spend some time tuning your dbms, particularly its cache performance.
I have a SQL-based application and I like to cache the result using Redis. You can think of the application as an address book with multiple SQL tables. The application performs the following tasks:
40% of the time:
Create a new record / Update an existing record
Bulk update multiple records
Review an existing record
60% of the time:
Search records based on user's criteria
This is my current approach:
The system cache a record when a record is created or updated.
When user performs a search, the system will cache the query result.
On top of that, I have a Redis look-up table (Redis Set) which stores the MySQL record ID and the Redis cache key. That way I can delete the Redis caches if the MySQL record has been changed (e.g., bulk update).
What if a new record is created after the system cache the search result? If the new record matches the search criteria, the system will always return the old cache (which does not include the new record), until the cache is deleted (which won't happen until an existing record in the cache is updated).
The search is driven by the users and the combination of the search condition is countless. It is not possible to evaluate which cache should be deleted when a new record is created.
So far, the only solution is to remove all caches of a MySQL table when a record is created. However this is not a good choice because lots of records are created daily.
In this situation, what's the best way to implement Redis on top of MySQL?
Here's a surprising thing when it comes to PHP and MySQL (I am not sure about other languages) - not caching stuff into memcached or Redis is actually faster. Much faster. Basically, if you just built your app and queried MySQL - you'd get more out of it.
Now for the "why" part.
InnoDB, the default engine, is a superb engine. Specifically, it's memory management (allocation and what not) is superior to any memory storage solutions. That's a fact, you can look it up or take my word for it - it will, at least, perform as good as Redis.
Now what happens in your app - you query MySQL and cache the result into redis. However, MySQL is also smart enough to keep cached results. What you just did is create an additional file descriptor that's required to connect to Redis. You also used some storage (RAM) to cache the result that MySQL already cached.
Here comes another interesting part - the preferred way of serving PHP scripts is by using php-fpm - it's much quicker than any mod_* crap out there. Down to the core, php-fpm is a supervisor process that spawns child processes. They don't shut down after the script is served, which means they cache connections to MySQL - connect once, use multiple times. Basically, if you serve scripts using php-fpm, they will reuse the already established connection to MySQL, meaning that you won't be opening and closing connections for each request - this is extremely resource friendly and it lets you have lightning fast connection to MySQL. MySQL, being memory efficient and having the cached result is much quicker than Redis.
Now what does all of this mean for you - having a proper setup lets you have small code that's simple, easy, doesn't involve Redis and eliminates all the problems that you might have with cache invalidation and what not and you won't waste your memory to contain the same data twice.
Ingredients you need for this to work:
php-fpm
MySQL and InnoDB based tables and most of all - sufficient RAM and tweaked innodb_buffer_pool_size variable. That one controls how much RAM InnoDB is allowed to allocate for its purposes - the larger the better.
You eliminated Redis from the game, you kept your code simple and easy to maintain, you didn't duplicate data, you didn't introduce additional system to the play and you let software that's meant to take care of data do its job. Pretty cheap trade-off for maximum usefulness, even if you compile all the software from scratch - it won't take more than an hour or so to get it up and running.
Or, you can just ignore what I wrote and look for a solution using Redis.
We met the same problem and we chose to do same thing you are thinking of: remove all query caches affected by the table. It is not ideal like your said but fortunately our "write" is not as high as 40% so it's ok so far.
That's the nature of query based caching. As an alternative you can add entity based caching. Instead of caching the search result only, cache the entire table and do the search inside memory. We use C# LINQ so we can do pretty common queries in memory but if the search is too complicated then you are out of luck.
I'm not sure if caching would be the correct term for this but my objective is to build a website that will be displaying data from my database.
My problem: There is a high probability of a lot of traffic and all data is contained in the database.
My hypothesized solution: Would it be faster if I created a separate program (in java for example) to connect to the database every couple of seconds and update the html files (where the data is displayed) with the new data? (this would also increase security as users will never be connecting to the database) or should I just have each user create a connection to MySQL (using php) and get the data?
If you've had any experiences in a similar situation please share, and I'm sorry if I didn't word the title correctly, this is a pretty specific question and I'm not even sure if I explained myself clearly.
Here are some thoughts for you to think about.
First, I do not recommend you create files but trust MySQL. However, work on configuring your environment to support your traffic/application.
You should understand your data a little more (How much is the data in your tables change? What kind of queries are you running against the data. Are your queries optimized?)
Make sure your tables are optimized and indexed correctly. Make sure all your query run fast (nothing causing a long row locks.)
If your tables are not being updated very often, you should consider using MySQL cache as this will reduce your IO and increase the query speed. (BUT wait! If your table is being updated all the time this will kill your server performance big time)
Your query cache is set to "ON". Based on my experience this is always bad idea unless your data does not change on all your tables. When you have it set to "ON" MySQL will cache every query. Then as soon as they data in the table changes, MySQL will have to clear the cached query "it is going to work harder while clearing up cache which will give you bad performance." I like to keep it set to "ON DEMAND"
from there you can control which query should be cache and which should not using SQL_CACHE and SQL_NO_CACHE
Another thing you want to review is your server configuration and specs.
How much physical RAM does your server have?
What types of Hard Drives are you using? SSD is not at what speed do they rotate? perhaps 15k?
What OS are you running MySQL on?
How is the RAID setup on your hard drives? "RAID 10 or RAID 50" will help you out a lot here.
Your processor speed will make a big different.
If you are not using MySQL 5.6.20+ you should consider upgrading as MySQL have been improved to help you even more.
How much RAM does your server have? is your innodb_log_buffer_size set to 75% of your total physical RAM? Are you using innodb table?
You can also use MySQL replication to increase the read sources of the data. So you have multiple servers with the same data and you can point half of your traffic to read from server A and the other half from Server B. so the same work will be handled by multiple server.
Here is one argument for you to think about: Facebook uses MySQL and have millions of hits per seconds but they are up 100% of the time. True they have trillion dollar budget and their network is huge but the idea here is to trust MySQL to get the job done.
I've seen pictures like this where multiple rails engines write to a single mySQL server.
1) Is this possible? Or does Rails want each application server to write to one database server?
2) If this is possible, how is it accomplished? Are there queues and a scheduler between the application servers and the write database server?
Scaling a mysql db is a pretty difficult thing to do, but its certainly been done plenty of times and there are a lot of best practices out there for you to take advantage of. The first thing you should know is that before you worry about scaling writes for a while yet, you probably need to scale your reads first.
Scaling reads can be done fairly easily using replication. There are several tools out there that make managing replication a lot easier such as Amazon RDS. Generally speaking many web severs can connect to many databases (as suggested by others), however you quickly run into scale issues once you have a lot of traffic, connections or whatever other action you are performing that generates load on the server.
As replicated severs are read only, you need to manage which sever you connect to depending on the action you're performing. I.e. if you had a users table, when creating, updating or deleting users you need to use the "write" database (the primary "source" sever) but when reading the user table, you can use one of the read replicas. This reduces the load on the primary write sever (allowing it to deal with even more writes) and as you can have multiple read databases behind a load balancer, you can get away with this structure for a very long time and scale reads across tens of database severs before you'll hit any significant issues (however most apps get away with 1-3).
There are situations where you will need to use your write database for read actions (although you should avoid it as much as possible) as the read replicas can be slightly behind the write dbs due to latency in replicating the write db queries, however most of the time you should be able to code knowing that there is the possibility that the read db is delayed (i.e. queue actions a reasonable period of time such that the updates will propagate across all the read severs) and simply use one of your read dbs rather than the write db.
Beyond this the key items to work on are ensuring you have efficient indexes and applying other best practices around maintaining a sensible data structure. You might also want to consider having 3 distinct "groups" of database servers. I generally like to have write, read and "stats" db groups. The write group for create, update and delete operations (as well as select for update), the read for general read items that must return their results quickly, and stats for anything that is going to be under high load and that you do not rely on for a prompt response (this keeps heavy queries that are not time sensitive away from your read db that you need quick responses from for general reads)
Once you get into a situation where you can no longer buy larger hardware and you're near maxing out your write capacity, you'll need to look into sharding, however that will take a lot of traffic / data (so dont worry about it unless you've done all of the above already).
I'm building a web app. This app will use MySQL to store all the information associated with each user. However, it will also use MySQL to store sys admin type stuff like error logs, event logs, various temporary tokens, etc. This second set of information will probably be larger than the first set, and it's not as important. If I lost all my error logs, the site would go on without a hiccup.
I am torn on whether to have multiple databases for these different types of information, or stuff it all into a single database, in multiple tables.
The reason to keep it all in one, is that I only have to open up one connection. I've noticed a measurable time penalty for connection opening, particularly using remote mysql servers.
What do you guys do?
Fisrt,i must say, i think storing all your event logs, error logs in db is a very bad idea, instead you may want to store them on the filesystem.
You will only need error logs or event logs if something in your web app goes unexpected. Then you download the file, and examine it, thats all. No need to store it on the db. It will slow down your db and your web app.
As an answer to your question, if you really want to do that, you should seperate them, and you should find a way to keep your page running even your event og and error log databases are loaded and responding slowly.
Going with two distinct database (one for your application's "core" data, and another one for "technical" data) might not be a bad idea, at least if you expect your application to have a lot of users :
it'll allow you to put one DB on one server, and the other DB on a second server
and you can think about scaling a bit more, later : more servers for the "core" data, and still only one for the "technical" data -- or the opposite
if the "technical" data is not as important, you can (more easily) have two distinct backup processes / policies
having two distinct databases, and two distinct servers, also means you can have heavy calculations on the technical data, without impacting the DB server that hosts the "core" data -- and those calculations can be useful, on logs, or stuff like that.
as a sidenote : if you don't need that kind of "reporting" calculations, maybe storing those data to a DB is not useful, and files would do perfectly ?
Maybe opening two connections means a bit more time -- but that difference is probably rather negligible, is it not ?
I've worked a couple of times on applications that would use two database :
One "master" / "write" database, that would be used only for writes
and one "slave" database (a replication of the first one, to several slave servers), that would be used for reads
This way, yes, we sometimes open two connections -- bu one server alone would not have been able to handle the load...
Use connection pooling anyway. So the time to get a connection is not a problem. But if you have 2 connections, transaction handling become more complicated. On the other hand, sometimes it's handy to have 2 connections: if something goes wrong on the business transaction, you can rollback transaction and still log the failure on the admin transaction. But I would still stick to one database.
I would only use one databse - mostly for the reason you supply: You only need one connection to reach both logging and user stored data.
Depending on your programming language, some frameworks (J2EE as an example) provide connection pooling. With two databases you would need two pools. In PHP on the other hand, the performance come in to perspective when setting up a connection (or two).
I see no reason for two databases. It'd be perfectly acceptable to have tables that are devoted to "technical" and "business"data, but the logical separation should be sufficient.
Physical separation doesn't seem necessary to me, unless you mean an application and data warehouse star schema. In that case, it's either real-time updates or, more typically, a nightly batch ETL.
It makes no difference to mysql in any way whether you use separate "datbases", they are simply catalogues.
It may make setting permissions easier, this is a legitimate reason to do it. Other than that, it is exactly the same as keeping the tables in the same db (except you can have several tables with the same name ... but please don't)
Putting them on separate servers might be a good idea however, as you probably don't want your core critical (user info, for example) data mixed in with your high-volume, unimportant data. This is particularly true for old audit data, debug logs etc.
Also short-lived data, such as search results, sessions etc, could be placed on a different server - it presumably has no high availability[1] requirement.
Having said that, if you don't need to do this, dump it all on one server where it's easier to manage (backup, provide high availibilty, manage security etc).
It is not generally possible to take a consistent snapshot of data on >1 server. This is a good reason to only have one (or one that you care about for backup purposes)
[1] Of the data, not the database.
In MySQL, InnoDB has an option of storing all tables of a certain database in one file, or having one file per table.
Having one file per table is somewhat recommended anyway, and if you do that, it makes difference on the database storage level if you have one database or several.
With connection pooling, one database or several is probably not going to matter either.
So, in my opinion, the question is if you'd ever consider separating the "other half" of the database to a separate server - with the separate server having perhaps a very different hardware configuration, such as no RAID. If so, consider using separate databases. If not, use a single database.